The business model of a Semantic Web service

What could be the business model of a semantic web service? First of all, what could be a Semantic Web service? We could define it as a web service that broadcast his content using formatting technologies like RDF or OWL. The content is formatted in such a way that software agents can easily read it and could infer knowledge. The key principle of a semantic web service is his openness.

I was talking with one of the creator of the search engines used by Talk Digger yesterday. During our conversation he asked me:

Are you planning to share any revenues with the blog engines? Do you have a business model for Talk Digger?

His question was legitimate but the answer was short: no. No I do not have any business model for Talk Digger. Right now it is an experimentation; a personal little research project. I could share revenues, however I doubts they could interest anyone considering that it merely pay the hosting fees.

However, I already thought about it and the question is good: What could be a business model for a service such as Talk Digger. Even more: what could be a business model for a Semantic Web service; a service that broadcast his computations at large, without any fees (Talk Digger could become such a service… it is a wish, remember?).

I have no idea.

John Heilemann wrote in the New York Metro:

“Alan Murray wrote a column in the Wall Street Journal that called Google’s business model a new kind of feudalism: The peasants produce the content; Google makes the profits,”

It is right, and not just for Google. Search Engines gather and index web content from everywhere. They even cache entire web pages content and publish them without any kind of permissions (remember the “cache” feature on many search engines?). I remember that secret US military documents have already been indexed by Google (just a rumor or a fact? I don’t remember). I also know that Google index millions of books content without any kind of permissions.

So, the question remains: what could be the business model for Talk Digger? And if revenues are generated with such a model, should I share it with other search engines companies? Sure I should; they can ban the IP address of my crawler at anytime. However, is it at their advantage? Considering the picture I described of some current search engines, why I couldn’t scrap their web page for some results? If we think about it, their results pages are web document like any other web pages on the Internet. If they can scrap others’ web page, why I could not too?

At the end everything is about money. However, they told us that they democratize the Internet by making it searchable. I told them that I democratize it too by aggregating their results in a “novel” way.

What is the Internet? The democratization of World’s information or a money making cow? I hope no.

Technorati: | | | | | | | | |

Alexia opens its teragigs of indexes: can Talk Digger get advantage of it?

Alexia (Amazon.com) just started a new web service that will give access to Alexia’s databases to anyone who needs it. It is really great news. I am all excited to see that big companies are opening themselves and making their data publicly available to anyone who needs it.

I am talking about how I see the future of the Web since some months. I am talking about the vision I have of the future of the Internet with the Semantic Web, etc. I talked about how the Web could change if everybody makes his gathered/processed/indexed content publicly available.

Yesterday I released a totally new version of Talk Digger. I talked about how I would like to make the computed results available to anyone who needs it. It is a dream I have, it is a reality that Amazon makes. Talk Digger and Alexia results would not be the same, the users would not too, but in a case or another, it goes in a vision of things that could change the way we use the Internet, the way that the Internet growth.

The new version of Talk Digger is using a web service of Google: PageRank. It is really great way to try to see what is the credibility of the people that are talking about an URL; it is a great way to know who the people that participate to a conversation are. It is sure that it is not the best and only way to do that, but it is a good start. In fact I am designing a system, a new feature of Talk Digger, that I think it could be a good way to see, analyze and interpret these conversations. In a case or another, it is a great feature that will be part of Talk Digger for long (as long as Google gives access to their API through a web service).

There is the point: Talk Digger goes ever further in displaying its results using the service of another company.

Now, would it be possible to integrate the new Alexia web service to enhance Talk Digger’s results? It would be really great considering all the stuff we have access too using the web service. I could even compare the Google’s PageRank with Alexia’s Popularity system to compute a unique indicator that would use both services (none are full-proof, but both of them could be complementary).

The problem with Alexia’s service is that I am restricted to one request per IP per second. The thing is that if you start a search for an URL and receive 70 results, then Talk Digger requested the PageRank of these 70 URLs in less than a second. So, I cannot really implement Alexia’s new web service in Talk Digger with this restriction.

In a case or another, Amazon has done a great thing by creating this new web service. I hope that other companies follow them in that direction.

Technorati: | | | | | | | |

Why Microsoft seems to reinvent the wheel with RSS?

I cannot understand why Microsoft seems to try to reinvent the wheel with RSS 2.0. Okay, I am a little bit late with that one, but I just discovered that they talked about an “extension” to RSS 2.0 called “Simple List Extensions Specification” at Gnomedex 2005.

Well, what this SLES is all about? “The Simple List Extensions are designed as extensions to existing feed formats to make exposing ordered lists of items easier and more accessible to users”.

Then I was lost…

Why does Microsoft publish such a specification for RSS 2.0? RSS 1.0, supported by XML Namespaces and RDF, already use such an ordered list called a “rdf:seq” to do exactly the same thing. This capability is provided directly by RDF.

I already wrote about the difference between RSS 1.0 and RSS 2.0 and I really do not understand why Microsoft develops modules for RSS 2.0 instead of implementing everything using RSS 1.0 and RDF.

I already read somewhere that Microsoft doesn’t have in their plan to develop any RDF parser in their .NET framework. It is probably one of the reasons why they do not use RDF 1.0: because they do not have any tool to implement it and do not have plans to develop one.

Why? Someone could help me with that one?

Right now I think that my greatest whish is to have the Jena framework developed in C#. I think that I can’t rely on Microsoft for that one.

Finally it seems that I am not the only person that have questions related with this move in relation with RSS 1.0.

RSS 1.0, RSS 2.0: make it simple not simpler


Update to the discussion about RSS 1.0 vs. RSS 2.0

Why using RDF instead of XML? [25 May 2006]

“Make everything as simple as possible, but not simpler”. – Albert Einstein.

I love that quote of Albert Einstein. Few words that tell so much to designers. Make things simple to the user, make it such that he does not even know that he his using what you designed (okay, it is an utopia); but beware: do not make it simpler, do not compromise on the capabilities of what you are designing to make it simple (this is all the art of design).

This said, I am currently rewriting the Talk Digger RSS feeds generator for the next release planned in a week or two. While working on it, I found that I done the error: I make it simpler while whishing to make it simple.

Let me explain the situation. Some months ago, I choose to create the feeds in RSS 2.0 instead of RSS 1.0. But what is the problem then? RSS 2.0 should be much more evolved then RSS 1.0, isn’t? No, it is not. RSS 2.0 is about 2 years younger than RSS 1.0, but much simpler. Why do I say that the file format is much simpler? Because RSS 1.0 feeds are serialized in RDF and RSS 2.0 feeds are serialized in XML.

Where is the problem then? XML serialized files are much easier to read than RDF serialized ones; in fact, RDF files are only cluttered XML files, isn’t it? No, definitely not. It is sure that RDF/XML serialized files (because there exist other serialization format like N3 that will also serialize RDF files) are less intuitive to read for humans, but they are much more powerful to answer to some needs.

Personally I see RSS 2.0 as a lesser version of RSS 1.0. Why? Because applications that support RSS 2.0 are much simpler (a thing that we do not want) considering that it only have to handle XML files instead of full RDF ones.

Fred, you are telling us that RSS 1.0 is much powerful than RSS 2.0? Yes, all the power of RSS 1.0 resides in the fact that it supports modules. This capability is given by RDF and his ability to import external RDF schemas to extend his vocabulary. What is a module? A module gives the possibility to the content publisher to extend his file format’s vocabulary by importing external RDF schemas.

Okay, but what is the advantage of using these modules? I will explain it with an example using Talk Digger. I am currently thinking about creating a RDF schema that would model some semantic relations that Talk Digger will compute with the search engines’ returned results. Personally I want to make that information publicly available to anyone who would like to have access to it and do something with it. This said, I am also thinking to broadcast the information directly in the RSS feed: I want to create only one source of information that would broadcast everything. RSS 1.0 gives me that possibility (in fact, a RSS 1.0 web feed is a normal RDF/XML file using the RSS 1.0 schema). It is beautiful, I can make all the information I want available to any one, in a unique source. If a software that read the feed do not understand a part of the information I broadcast (in reality, he do not know the RDF schema I am using) he simply skip it and continue to read the source of information (the web feed) and do what he have to do with the information he understand. I can’t do that with RSS 2.0 because it is serialized in XML and not RDF. I could even add OWL elements in my feeds to model some relations between the knowledge represented in the web feeds. That way an application could be able to infer knowledge from it! An example of a popular module is the Dublin Core metadata initiative.

You are probably thinking: yeah Fred, but readers only have to support both formats, and publishers also only have to support both formats as well and everybody will be happy. Bad design thinking: do not forget that the goal is to make application. How do you think that I will explain the difference between RSS 1.0 and RSS 2.0 to my mother? How do you think that I will explain her which one to choose if she have the possibility to subscribe to more than one feed? Will she choose RSS 0.91, RSS 0.92, RSS 1.0, RSS 2.0, ATOM 0.9, or ATOM 1.0 (because some websites propose them all)? Sorry, but I do not want to.

One of the current problems

On of the problem are the way applications handle all these file formats and serializations. I will explain it with a problem I faced today while testing the new RSS feed of Talk Digger with Bloglines.

A thing I wanted was to use the Dublin Core element “Description” instead of the normal “<description></description>” tag of the RSS 1.0 specification. I first thought that it would scale much more because the Dublin Core RDF schema is widely use by many, many applications over the Internet. First I tested it using RSS Bandit. It worked like a charm. All the Dublin Code elements I added to my RSS feed were handled by it. Wow! Then a tested it with Bloglines: nothing. Bloglines just doesn’t handle that Dublin Core tag: deception.

Then I included this namespace into my RDF file: “xmlns:ct=”http://purl.org/rss/1.0/modules/content/””. Then I re-tested it: nothing. Wow, it should works, isn’t? Then I tested something else, I changed the alias “ct” for the namespace “content”: it worked. What a deception I had: Bloglines is not caring about the namespace local alias, in fact it seems that it parse the RSS 1.0 feeds (in fact a RDF file) with fixed strings. The system should know that “ct” is related to the namespace instead of “content” because they are just aliases that I use to define the namespace in my local file. It is a perfect example of bad implementation of specifications in softwares.

The problem here is that Bloglines is the most popular web feed reader out there. So I have to change the way I build my feeds to handle that fact, but I shouldn’t be supposed to (it is really frustrating). Will I have to change the way I build my feeds each time I discover that an application is not parsing and using them properly? I hope no, I shouldn’t be supposed to because I follow the specification to build them.

I hope they will check that problem with their parser and hire somebody to develop a robust system that will parse and handle the RDF specification, and not only parsing RSS 1.0 feeds as simple text files with some format… (Could I change that skill requirement “Familiarity with RSS and blogs” for “Strong understanding of RDF, RSS and blogs”, cited in that job proposition, to answer that responsibility “maintain and improve RSS crawling and parsing processes”.

I hope to be able to show you how RSS 1.0 could be extended, using a future version of Talk Digger, soon.

Technorati: | | | | | | | | | |

Semantic-Web-Of-Trust

The current problem of the Web

The current problem of the Web is that most (virtually all) documents it holds are formatted for humans. By example, HTML is a markup language that is used to present information to humans, to make documents easily understandable by them.

You wonder why I say that it is the current problem of the web? The problem resides in the fact that these human-oriented documents are not easily processable by computers. The information is not formatted for them. They can’t easily understand what a document is about, his subject, his meaning, his semantic, etc.

A possible solution to that problem

A solution we could use to try to solve this problem is annotating these human-processable documents with computer-processable metadata. This is the exact purpose of new sort of file formats like RDF or OWL. The primary and only purpose of these new file formats is to make digital documents (file, photo, video, anything that is digital) computer-processable.

Such document would describe the meaning and the semantic of a digital document that could be easily understood by computers. That way, software agents could easily read these documents, understand them and even infer new facts and knowledge from them. This is the idea behind the Semantic Web.

The possible problems with such annotated metadata

Remember the first time of the Web when people were using metadata in their HTML header files? Remember the time when search bots were using this information to return relevant data to users? Remember the time when search bots stop using them because people were only using them to tricks the search bots to bring people to their web pages even if their search queries where really not relevant with the content of the returned web page results? It is exactly why people lose faith in metadata. And it is exactly why I have doubts in social tagging (but this is another story).

The problem with the early principles of annotating metadata to documents is that people were able to annotate their web documents with any metadata information, related or not with the content of these documents. At the end, web publishers were not annotating their documents with relevant information in relation with their content, but only with information that would bring traffic to their web sites.

You are probably thinking something like this: “Fred, you said that the semantic web formats: RDF, OWL, or any other, are simply sort of metadata files that could be annotated to current web documents to describe them, their meaning and semantic. So, don’t you think that the result would be the same as the HTML headers’ metadata? That people would try to tricks the semantic web search engines, crawlers and software agents?”

The solution: Semantic-Web-Of-Trust

Bellow is a short description of the Web of Trust saw by Tim Berners-Lee, the father of the Web and the Semantic Web, wrote in 1997.

“In cases in which a high level of trust is needed for metadata, digitally signed metadata will allow the Web to include a “Web of trust”. The Web of trust will be a set of documents on the Web that are digitally signed with certain keys and contain statements about those keys and bout other documents. Like the Web itself, the Web of trust will not need to have a specific structure, such as a tree or a matrix. Statements of trust can be added in such a way as to reflect actual trust exactly. People learn to trust through experience and though recommendation. We change our minds about who we trust and for what purposes. The Web of trust must allow us to express this.”

At that time, Mr. Berners-Lee saw digital signatures as a way to ensure who the author of a metadata annotation is to add trust in that metadata. Some people could also think about PGP’s [PKI] Web of trust system.

Other people, like Shelley Powers, thought about annotating RDF content to links (by example, annotating descriptive information about a link to a local hardware store), and using reification principles to infer trust in the relation: I trust him, you trust me, so you trust him.

Many studies are done to try to find what is the best way to add trust to the Web and in a near future, the Semantic Web. Some techniques, like PGP’s are tested and effective. However, could they be applied for the Semantic Web? What is the best system we can use for the Semantic Web? Is the system already created? Is it to be created?

One thing is sure is that such a system will have to be present in the Semantic Web if we want it to succeed.