In a recent mail thread, someone was asking what was the best way to get RDF data from a source [having more than a couple of thousands of documents]: a RDF dump or a list of dereferencable URIs?
None is better than the other. Personally what I prefer is to use both.
If we take the example of Geonames.org, getting all the 6.4 million of RDF documents from dereferencable URI would take weeks. However, updating your triple store with new or updated RDF documents with a RDF dump would force you to download and re-index it completely every month (or so). This task would take some days.
So what is best way then? There is what I propose (and currently do):
In fact, the first time I indexed Geonames into a triple store, I requested a RDF dump to Marc. Then I asked him: would it be possible for you to ping Pingthesemanticweb.com each time a new document appears on Geonames or each time a document is updated? In less than a couple of hour he answered to my mail and then Geonames was pinging PTSW.
So, what it means? It means that I populated my triple store with geonames with a RDF dump for the first time. By proceeding that way I saved one to two weeks of work. Then I am now updating the triple store via Pingthesemanticweb.com. By proceeding that way, I save 2 or 3 days each month.
So what I suggest is to use both methods. The important point here is that Pingthesemanticweb.com acts as an agent that send you new and updated files for a specific service (Geonames in the above example). This simple infrastructure could save precious time to many semantic web developers.
Technorati: Uri | rdf | dump | geonames | pingthesemanticweb | semantic | web |
marc
February 3, 2007 — 3:21 pm
Frédérick,
Not only is a dump for huge datasets preferable from a crawler’s point of view it is also easing strain on the data provider’s resources. Fetching a database with millions of document row by row requires a lot of resources to create and deliver the documents. A semantic web crawler may thus have the effects of a denial-of-service attack. More about a recent episode of a semantic web crawler DDOS in my blog :
http://geonames.wordpress.com/2007/02/03/friendly-fire-semantic-web-crawler-ddos/
Marc
Fred
February 4, 2007 — 1:20 pm
Hi Marc,
Yeah I read the story. What I will do, as I said in my last blog post, is that I will add a repository of available rdf data dump on PTSW, hoping it could prevent such situations in the future. However, you have done the only thing to do: banning the IP from crawling geonames. This is unfortunately the only thing that will really work (semweb or not 😉 ).
Take care,
Fred