Distribution of semantic web data

Three days ago I talked about the importance of both RDF data dumps and dereferencable URIs to distribute RDF data over the Web. However yesterday Marc from Geonames.org got some problems with an impolite semantic web crawler. In his article he point out that:

“It simply does not make sense to download a huge database record by record if a full dump is available.”

In the best of the world it doesn’t make sense, but unfortunately it is how the Web always worked. Think about Google, Yahoo! and MSN Search; this is exactly what they do, and it doesn’t make sense. The difference is that they are probably more polite. Marc did the only thing he as to do: banning the belligerent crawler.

The problem with data dumps is that they are generally not that easy to find (if available) on services web site. So some developers will note care taking the time to find them and will fetch everything from the Web server, page by page.

However all that story brings a question: how could we make these data dump more visible? Ecademy.com use a <link> element from their home page to link to a dump of the URLs to their FOAF profiles. However, if you don’t check at the HTML code of the page, you will never be aware of it. A first step would probably be to create a repository of these data dumps.

The SWEO Community Project started the “Linking Open Data on the Semantic Web” project that is basically a list of RDF dumps from different web site or projects.

Personally what I will do to help people finding these RDF dumps (and to make them aware of their existence) is to create a repository of these RDF dump on Pingthesemanticweb.com (should be available later this week).

That way, developers using Pingthesemanticweb.com will probably check that list and then download the data they need. After that, they will only use PTSW to synch their triple store with the remote service’s database (Geonames.org for example).

Technorati: | | | | | | |

RDF dump vs. dereferencable URIs

In a recent mail thread, someone was asking what was the best way to get RDF data from a source [having more than a couple of thousands of documents]: a RDF dump or a list of dereferencable URIs?

None is better than the other. Personally what I prefer is to use both.

If we take the example of Geonames.org, getting all the 6.4 million of RDF documents from dereferencable URI would take weeks. However, updating your triple store with new or updated RDF documents with a RDF dump would force you to download and re-index it completely every month (or so). This task would take some days.

So what is best way then? There is what I propose (and currently do):

In fact, the first time I indexed Geonames into a triple store, I requested a RDF dump to Marc. Then I asked him: would it be possible for you to ping Pingthesemanticweb.com each time a new document appears on Geonames or each time a document is updated? In less than a couple of hour he answered to my mail and then Geonames was pinging PTSW.

So, what it means? It means that I populated my triple store with geonames with a RDF dump for the first time. By proceeding that way I saved one to two weeks of work. Then I am now updating the triple store via Pingthesemanticweb.com. By proceeding that way, I save 2 or 3 days each month.

So what I suggest is to use both methods. The important point here is that Pingthesemanticweb.com acts as an agent that send you new and updated files for a specific service (Geonames in the above example). This simple infrastructure could save precious time to many semantic web developers.

Technorati: | | | | | | |

Revision 1.02 of the Music Ontology: a first crystallization of the ontology

 

Since about a month people are looking at the Music Ontology; they talk about it; they revise it; and they give suggestions. The revision 1.02 of the ontology as been possible by all their comments and suggestions. It should be a good compromise for all users’ visions.

As I previously explained in one of my blog post, the FRBR Final Report has been use as the ground for the ontology. It describes musical entities and their relationship.

    

Finally the revision 1.02 is the first “crystallization” of the Music Ontology. Now people can certainly start to use it into their FOAF profile. Artists can start to describe themselves, their musical group, their musical creations, etc. Music stores could start to export the content of their inventory using the ontology. Internet radio could certainly start to gives information about the tracks they stream.

If you are interested in participating in the elaboration of that ontology, I would encourage you to subscribe to its mailing list. I would like to thank everybody subscribed to this mailing list since this new revision wouldn’t has been possible without their insightful comments and suggestions.

 

Technorati: | | | | | | | |

Wikipedia concepts to help managing Semantic Web described subjects

 

This article is an aggregation of thoughts I had while working on one of my recent semantic web projects. I have no idea what these ideas worth, but I hope to open a discussion regarding the problematic described bellow.

 

The problematic

What is beautiful with the semantic web is that anyone can define anything about anything. I can define myself; I can define a project; I can define a car; Etc.

What is so beautiful and powerful is also a huge problem and probably the bigger weakness of the semantic web.

In fact, I can define myself with a FOAF profile. This FOAF profile describes relations about me. But nothing stops anybody to define other relations about me. One can describe good things about me, but one can describe bad things as well.

 

How does it happen?

Readers: you will need some understanding of semantic web concepts to read the rest of that article.

    

Basically, I can use any URI (unique resource identifier) to describe relations about me. But anybody can define new relations with that URI as well.

The problem arises if I get each RDF files from these two people and that I index them in the same triple store. Then I would get many relations (defined by two different people) for the same subject: me.

Do you see what the situation is starting to look like?

 

Same as Wikipedia

Yeah, the problem seems to be the same as Wikipedia’s. In fact it is the same thing, except that Wikipedia is centralized on a same infrastructure, and that the semantic web is a decentralized over the Web.

Since we can’t restrict people to use a URI to describe it, we have to find another solution.

 

Some possible solutions

Possible solutions exist. By example, people indexing these documents could only index URI that are dereferenceable (this mean an URI I can look up over the HTTP protocol to get a RDF file about the identified resource).

That way, they would restrict Bob to define anything about my URI from another web server than my own.

A variant of that solution would be to get documents only from a list of trusted resources (like most of the current memetrackers does).

However all these solutions downgrade the power of RDF: defining anything about anything. This mean that if I want, I can define everything about my dog “bud” even if he is not resolvable over HTTP.

 

Another solution: a Wikipedia like supervising authority

One solution could be to create an authority that would supervise the evolution of the description of URIs in the semantic web.

You should see the interface as the one of Wikipedia expect that we would not use WikiWords but URI to define things.

So people could register to that “authority”. They could define things about these URIs. Conflicted URIs could be tagged as is. Discussions about URI description could be open, Etc.

 

How such a system could be use as a solution to the problematic?

The authority could create an ontology to define that “meta-information” about URIs. Then, they would get all the information from the service and publish it in RDF using the ontology.

From that point, everybody that display information about URIs could add the “meta-information” about URIs in their results. That way, their users would see if false information is included into the result, if the things defined for the URI (the subject) are conflicted, etc.

 

Conclusion

For semantic web developers this solution is as simple as indexing RDF data into a triple store: (1) download (2) index (3) query and (4) display.

This solution is not perfect, but it could help developers to display meaningful information to their system users while keeping the description power of RDF.

 

Technorati: | | | | | | | | |

Reaching at least 600 000 people with 19 contacts

Sergio Fernández and Iván Frade lately started a really interesting experience called Futil. This is a small computer program that got Sergio’s FOAF profile as seed person to discover new people from its relations (the friend of a friend of a friend, etc). The experience is to discover how much people you can find only starting from people you know. So far, Sergio’s Futil program found about 600 000 people. I guess that it should discover around 2 500 000 people before it finishes.

The experience is quite interesting in many ways. It gives some insight on how people are connected together, and even more important in today’s web, how communities of users are interacting with the Web.

The graph bellows show how Futil is discovering these profiles. The Y-axis represent the number of pool of people it has to get from the Web and the X-axis is number of profiles it got so far.

The first 50 000 people Futil found were coming from different places on the web. It could be a personal web page, the web page of an organization, etc. Then, eventually, Futil found a couple of links to people belonging to an online community called Tribe. People of that community only link to other people of the same community. What is interesting is that as soon as Futil started to crawl a couple of people of that community, it eventually found all the 200 000 people belonging to that community. Now the same thing is currently happening with another community, much bigger, called Livejournal, with about 2 million of users.

Why Futil only crawled people from the same community? The answer is easy: because these communities are closed. They don’t interact with the rest of the Web. So one user can only link to other community users.

 

How to open a community and let its users interact with other users, of other online communities?

A first step would be to let people describing their relationship with other people outside of their community.

One example of such an online community is Talk Digger. This system let its users importing (and synching) their FOAF profile from another location on the Web. It also let its users defining their relationship with other people outside of the community. By example, a user can say that he knows the people X and Y on Talk Digger; but it can also specifies that he knows a person Z from outside of the community, or from another online community.

In fact, if other online communities would add such a feature to their system, inter-communities communications and relationships could then be possible.

You can read an old blog post that explains how Talk Digger is handling FOAF profiles.

 

Why online communities system should open themselves?

Why a user will use an online community and not the other? It depends; I would say that it principally depends on: the topic(s) of the community, the people he knows in that community, and the user interface of that community (after all, one interface don’t work for everybody).

So, why online communities shouldn’t let their users interacting with other online communities users?

I think it is an error caused by the fear of loosing users and it explains why Futil behaved that way: because current online communities doesn’t let its users interacting with people from outside of the community.

 

Futil is pinging Pingthesemanticweb.com as well

Well, each time Futil discover a new FOAF profile it pings Pingthesemanticweb.com. So far it pinged about 300 000 new FOAF profiles. It is a good example of how this semantic web pinging service can be used.

Now, everybody has access to these new FOAF files. The best thing would be that such online communities (like Tribe.net and Livejournal.com) would ping the service each time there is a new user, or each time a user update its profile. But in the mean time, independent crawlers such as Futil do the job very well.

 

Conclusion

The thing I wish now is that future online communities start to let their users interacting with users from other communities. A good start in that direction would be to let them describing their relationship not only with people of the same community, but also with people from outside of the community. By then, meta-communities should start to emerge.

Technorati: | | | | | | | | | | |