Integration of Zotero in a Semantic Web environment to find, search and browse the Web’s citations

Zotero is a great FireFox add-on that lets its users find, search, edit and create citations they find on the Web while browsing it. All the power of Zotero resides in its “translation modules“. These modules will detect citations in various types of web pages. When it detects one of these citations, it will notify its users to give them the opportunity to save them.

What interest me is that Zotero already use some ontologies to export users’ citations libraries using RDF. When I noticed that I started to wonder: what could we do with Zotero now?

The Zotero vision

Zotero is the best-integrated citation tool for the Web I know. A phenomenal amount of citations can be discovered on the Web via Zotero users community.

Remember what we have done with the Semantic Radar a couple of months ago? This FireFox add-on was detecting SIOC RDF documents in Web page. Then I contacted Uldis Bojar to ask him to ping PingtheSemanticWeb.com each time a user was detecting a RDF file while he was browsing the Web. Now a good source of RDF data pinged to PTSW come from Semantic Radar users. This is a sort of “social semantic web discovering” technique.

What I would like to do is the same thing but for Zotero.

zotero-ptsw-zitgist.jpg
[Click to enlarge to full size]

  1. Zotero users browse the Web, discover citations and save them into their personal libraries.
  2. Each time a Zotero instance discover a citation, it would send the URL where we can find it to PingtheSemanticWeb.com.
    1. Note: the user should be aware of that functionality via an option into Zotero that would explains him what this feature it is all about, and to gives him the possibility to disable it.
    2. Note: Zotero would ping PTSW each time it detects a citation (so that the icon appears in the FireFox’s URL bar), and not each time a user save it.
  3. Via the Virtuoso Sponger, PingtheSemanticWeb.com will check the incoming URL from Zotero users and will check to find citations too. If a citation is found, it will be added to its list of know citations and archive their content.
  4. PingtheSemanticWeb.com will then send the new citations to Zitgist so that it can include them into its database.
    1. Note: here Zitgist could be replaced by any web service wanting them. Remember that PTSW act as a data-multiplexer.
  5. Via Zitgist (that is a semantic web search engine), users from around the World will be able to search among these citations (discovered by Zotero users) and to browse them.

Zitgist has a Zotero citation provider

What is fantastic here is that Zitgist become a source of citations. So if a Zitgist user has Zotero installed, then he will be able to batch-save the list of results returned by Zitgist; and if the user is browsing Zitgist’s citations, he will be able to include them into their Zotero instance like if Zitgist would be Amazon.com or any other citations web sites.

That way, Zotero’s found data would be accessible to Zotero users via Zitgist that would then become a citations provider (mainly feed by the Zotero community).

You see the interaction?

What have to be developed?

Some things have to be developed to make that vision working. No major development, but only a couple of features to develop on each system.

Integration of Ping the Semantic Web into Zotero

The integration of Ping the Semantic Web into Zotero is quite straightforward.

Pinging PingtheSemanticWeb.com via a web service

The first step is to make Zotero notify PTSW each time it comes across a citation. It has to send the URL of that/these citation(s) via XML-RPC or REST.

That is it. Each time Zotero detect a citation, it sends a simple ping to PTSW via an XML-RPC or REST request.

Adding a pinging option to Zotero

Another thing that Zotero would have to add to their add-on is an option that would gives the possibility to their users to disable that feature in case they don’t want to send a notification to PTSW each time they discover a citation on a Web page while they are browsing the Web.

Development of Zotero translators into Sponger Metadata Cartridge

The biggest development effort that would have to be done is to convert the Zotero translators into Virtuoso Sponger’s Metadata Cartridge.

Right now, Metadata Cartridge exists for: Google Base, Flickr, microformats (hReview, hCalendar, etc.), etc. These cartridges are the same things as “Zotero translators” but for the Virtuoso Sponger. By developing these cartridges, everybody running Virtuoso will be able to see these citations (from Amazon, etc.) as RDF data (mapped using some ontologies).

Documentation about how to develop these cartridges will be available in the coming days. From there, we would be able to setup an effort to convert the Zotero Translators into Spongers Metadata Cartridges.

Conclusion

This is the vision I have of the integration of Zotero into the current Semantic Web environment that exists. Any ideas, suggestions, collaboration propositions would be warmly welcome.

Note: a discussion about this subject started on Zotero’s web forum 

Distribution of semantic web data

Three days ago I talked about the importance of both RDF data dumps and dereferencable URIs to distribute RDF data over the Web. However yesterday Marc from Geonames.org got some problems with an impolite semantic web crawler. In his article he point out that:

“It simply does not make sense to download a huge database record by record if a full dump is available.”

In the best of the world it doesn’t make sense, but unfortunately it is how the Web always worked. Think about Google, Yahoo! and MSN Search; this is exactly what they do, and it doesn’t make sense. The difference is that they are probably more polite. Marc did the only thing he as to do: banning the belligerent crawler.

The problem with data dumps is that they are generally not that easy to find (if available) on services web site. So some developers will note care taking the time to find them and will fetch everything from the Web server, page by page.

However all that story brings a question: how could we make these data dump more visible? Ecademy.com use a <link> element from their home page to link to a dump of the URLs to their FOAF profiles. However, if you don’t check at the HTML code of the page, you will never be aware of it. A first step would probably be to create a repository of these data dumps.

The SWEO Community Project started the “Linking Open Data on the Semantic Web” project that is basically a list of RDF dumps from different web site or projects.

Personally what I will do to help people finding these RDF dumps (and to make them aware of their existence) is to create a repository of these RDF dump on Pingthesemanticweb.com (should be available later this week).

That way, developers using Pingthesemanticweb.com will probably check that list and then download the data they need. After that, they will only use PTSW to synch their triple store with the remote service’s database (Geonames.org for example).

Technorati: | | | | | | |

RDF dump vs. dereferencable URIs

In a recent mail thread, someone was asking what was the best way to get RDF data from a source [having more than a couple of thousands of documents]: a RDF dump or a list of dereferencable URIs?

None is better than the other. Personally what I prefer is to use both.

If we take the example of Geonames.org, getting all the 6.4 million of RDF documents from dereferencable URI would take weeks. However, updating your triple store with new or updated RDF documents with a RDF dump would force you to download and re-index it completely every month (or so). This task would take some days.

So what is best way then? There is what I propose (and currently do):

In fact, the first time I indexed Geonames into a triple store, I requested a RDF dump to Marc. Then I asked him: would it be possible for you to ping Pingthesemanticweb.com each time a new document appears on Geonames or each time a document is updated? In less than a couple of hour he answered to my mail and then Geonames was pinging PTSW.

So, what it means? It means that I populated my triple store with geonames with a RDF dump for the first time. By proceeding that way I saved one to two weeks of work. Then I am now updating the triple store via Pingthesemanticweb.com. By proceeding that way, I save 2 or 3 days each month.

So what I suggest is to use both methods. The important point here is that Pingthesemanticweb.com acts as an agent that send you new and updated files for a specific service (Geonames in the above example). This simple infrastructure could save precious time to many semantic web developers.

Technorati: | | | | | | |

First web service developed using Ping the Semantic Web data

 

A couple of days ago, the first web service(that I am aware of) developed using Ping the Semantic Web data has been made public.

Doap:store is a DOAP (Description Of A Project) documents search, browsing and visualization tool. It has been developed by Alexandre Passant, one of the most active contributors of the SIOC ontology.

 

From Alex’s Blog, Doap:Store is:

Then, doap:store provides a common search engine and browsing interface for these decentralized project description, while authors keep control over their data. Data is updated each time PTSW has a new ping for it (in the future, PTSW should store new pings only if the document has changed, so updated will be made only for real document updates).

That is it: a web service helping people to find projects depending on some criteria. At the time I wrote this article, the web service focus mainly on computer projects involving programming languages. If you check at the bottom of the page, you will see categories of projects related with the programming language(s) involved in the project.

These categories are dynamically generated depending on the DOAP documents it aggregates. So as soon as new document from Ping the Semantic Web are aggregated by Doap:store, they will have an impact on this feature of the service depending on the languages defined for these projects description.

 

Doap:store and Ping the Semantic Web

This new web service is the perfect example of the utility of Ping the Semantic Web. With data in hands, a developer can create wonderful systems and user interface to manipulate, manage and search that data.

In this example, Alex didn’t had to care about where to find DOAP documents, he only had to retrieve the list of latest created/updated DOAP documents from Ping the Semantic Web service (he do that each hour).

 

 

 

Keeping control over its data

This is one of the more important observations of Alex. Such an information infrastructure let the user doing what he wants with the data he generates.

This is the idea I had in mind when I wrote a blog post called “Communities’ websites should use FOAF profiles to help users managing their online persona” and this is the reason why I create an import/export feature for users’s profiles using FOAF document: to give back the power to users.

 

Conclusion

Doap:store really shows how Ping the Semantic Web can be use by any developer. I hope other people will start to use the service the same way Alex does. Ultimately I hope that Ping the Semantic Web will become a vector of development for semantic web projects.

Technorati: | | | | | | | | |

Ping the Semantic Web: a new pings exportation feature

 

    The “pings exportation” feature of Ping the Semantic Web was a little bit messy and I was really not satisfied with it. So I took the time to re-work it and I think I came up with something much better (probably something that people were expecting from the beginning).

 

The new way to request pings

The new way to request a list of pings from Ping the Semantic Web is quite simple. You have a set of pings (all pings received by the service so far) and you apply constraints on that set to get the subset of pings you really want for your application.

There are 7 different constraints you can apply:

  1. Constraint pings for a specific type of RDF document: SIOC, FOAF, DOAP, RDFS or OWL
  2. Constraint pings for a specific serialization language: XML or N3
  3. Constraint pings for a time frame: last hour, yesterday or any time
  4. Constraint pings with a number of results: 0 to x
  5. Constraint pings for a specific domain name, example: getting all the pings from www.talkdigger.com
  6. Constraint pings for a specific namespace, example: getting all the pings where the namespace “http://purl.org/dc/elements/1.1/

This new method is much more powerful. That way you can easily get a specific subset of pings for the specialized needs of your web services or software agent.

 

The new way to handle namespaces

Reworking this feature leaded me to rework the way Ping the Semantic Web was handling namespaces.

Now all the namespaces of the RDF documents aggregated by the service are aggregated by the service as well.

This means two things:

  1. You can get RDF documents defining a specialized namespace
  2. You can take a look at the list of namespaces know by Ping the Semantic Web

For the moment the service know about 400 namespaces, but it is discovering them at a rapid pace.

 

Conclusion

I am stabilizing the system right now and the redevelopment of this feature was resulting from that stabilization. All my updates are mostly finished and soon enough a first version of a SPARQL endpoint (and user interface) should be publicly available.

Technorati: | | | | | | | |