Three days ago I talked about the importance of both RDF data dumps and dereferencable URIs to distribute RDF data over the Web. However yesterday Marc from Geonames.org got some problems with an impolite semantic web crawler. In his article he point out that:
“It simply does not make sense to download a huge database record by record if a full dump is available.”
In the best of the world it doesn’t make sense, but unfortunately it is how the Web always worked. Think about Google, Yahoo! and MSN Search; this is exactly what they do, and it doesn’t make sense. The difference is that they are probably more polite. Marc did the only thing he as to do: banning the belligerent crawler.
The problem with data dumps is that they are generally not that easy to find (if available) on services web site. So some developers will note care taking the time to find them and will fetch everything from the Web server, page by page.
However all that story brings a question: how could we make these data dump more visible? Ecademy.com use a <link> element from their home page to link to a dump of the URLs to their FOAF profiles. However, if you don’t check at the HTML code of the page, you will never be aware of it. A first step would probably be to create a repository of these data dumps.
The SWEO Community Project started the “Linking Open Data on the Semantic Web” project that is basically a list of RDF dumps from different web site or projects.
Personally what I will do to help people finding these RDF dumps (and to make them aware of their existence) is to create a repository of these RDF dump on Pingthesemanticweb.com (should be available later this week).
That way, developers using Pingthesemanticweb.com will probably check that list and then download the data they need. After that, they will only use PTSW to synch their triple store with the remote service’s database (Geonames.org for example).
Technorati: Semantic | web | data | dump | rdf | crawler | spam |