In a recent mail thread, someone was asking what was the best way to get RDF data from a source [having more than a couple of thousands of documents]: a RDF dump or a list of dereferencable URIs?
None is better than the other. Personally what I prefer is to use both.
If we take the example of Geonames.org, getting all the 6.4 million of RDF documents from dereferencable URI would take weeks. However, updating your triple store with new or updated RDF documents with a RDF dump would force you to download and re-index it completely every month (or so). This task would take some days.
So what is best way then? There is what I propose (and currently do):
In fact, the first time I indexed Geonames into a triple store, I requested a RDF dump to Marc. Then I asked him: would it be possible for you to ping Pingthesemanticweb.com each time a new document appears on Geonames or each time a document is updated? In less than a couple of hour he answered to my mail and then Geonames was pinging PTSW.
So, what it means? It means that I populated my triple store with geonames with a RDF dump for the first time. By proceeding that way I saved one to two weeks of work. Then I am now updating the triple store via Pingthesemanticweb.com. By proceeding that way, I save 2 or 3 days each month.
So what I suggest is to use both methods. The important point here is that Pingthesemanticweb.com acts as an agent that send you new and updated files for a specific service (Geonames in the above example). This simple infrastructure could save precious time to many semantic web developers.