RDF dump vs. dereferencable URIs

In a recent mail thread, someone was asking what was the best way to get RDF data from a source [having more than a couple of thousands of documents]: a RDF dump or a list of dereferencable URIs?

None is better than the other. Personally what I prefer is to use both.

If we take the example of Geonames.org, getting all the 6.4 million of RDF documents from dereferencable URI would take weeks. However, updating your triple store with new or updated RDF documents with a RDF dump would force you to download and re-index it completely every month (or so). This task would take some days.

So what is best way then? There is what I propose (and currently do):

In fact, the first time I indexed Geonames into a triple store, I requested a RDF dump to Marc. Then I asked him: would it be possible for you to ping Pingthesemanticweb.com each time a new document appears on Geonames or each time a document is updated? In less than a couple of hour he answered to my mail and then Geonames was pinging PTSW.

So, what it means? It means that I populated my triple store with geonames with a RDF dump for the first time. By proceeding that way I saved one to two weeks of work. Then I am now updating the triple store via Pingthesemanticweb.com. By proceeding that way, I save 2 or 3 days each month.

So what I suggest is to use both methods. The important point here is that Pingthesemanticweb.com acts as an agent that send you new and updated files for a specific service (Geonames in the above example). This simple infrastructure could save precious time to many semantic web developers.

Technorati: | | | | | | |

2 Responses to “RDF dump vs. dereferencable URIs”


  1. 1 marc Feb 3rd, 2007 at 3:21 pm

    Frédérick,

    Not only is a dump for huge datasets preferable from a crawler’s point of view it is also easing strain on the data provider’s resources. Fetching a database with millions of document row by row requires a lot of resources to create and deliver the documents. A semantic web crawler may thus have the effects of a denial-of-service attack. More about a recent episode of a semantic web crawler DDOS in my blog :

    http://geonames.wordpress.com/2007/02/03/friendly-fire-semantic-web-crawler-ddos/

    Marc

  2. 2 Fred Feb 4th, 2007 at 1:20 pm

    Hi Marc,

    Yeah I read the story. What I will do, as I said in my last blog post, is that I will add a repository of available rdf data dump on PTSW, hoping it could prevent such situations in the future. However, you have done the only thing to do: banning the IP from crawling geonames. This is unfortunately the only thing that will really work (semweb or not ;) ).

    Take care,

    Fred

Leave a Reply






This blog is a regularly updated collection of my thoughts, tips, tricks and ideas about data mining, data integration, data publishing, the semantic Web, my researches and other related software development.


RSS Twitter LinkedIN


Follow

Get every new post on this blog delivered to your Inbox.

Join 73 other followers:

Or subscribe to the RSS feed by clicking on the counter:




RSS Twitter LinkedIN