Frederick Giasson – Page 32 – Machine Learning, Engineering & Data

Content negotiation: bad use cases I recently observed

July 6, 2007 Frederick Giasson

Given the current projects I am working on, I daily see misuse of content-negotiation methodology, particularly the misuse of the Accept and Content-type HTTP header parameters.

As you will see bellow, I came across many misuse of these HTTP header parameters: potentially by their misunderstanding, or simply by forgetting to set them properly when content is negotiated between their web servers and applications requesting pages.

In any way, people should take a greater care about setting the content-negotiation properly between their servers and other applications. In fact, I saw many examples, on many web servers: from the semantic web research groups, to the hobbyists.

The principle

The principle is simple, if a requester sends a HTTP query with the Accept header:

Accept: text/html, application/rdf+xml

The web server should check the priority of the mime types that the requester is requesting and send back the requested document type with the greater priority, along with the Content-type of the document in the HTTP header answer.

The Content-type parameter is quite important, since if a user application request a list of 10 mimes having all the same priority, it should know which of them as been sent by the web server.

Trusting the Content-type parameter

It is hard.

In fact, Ping the Semantic Web do not trust any web server that returns Content-type. This parameter is so misused that it makes it useless. So I had to develop procedures to detect the type and the encoding of files it crawls.

For example, sometime, people will return the mime TEXT/HTML when it facts it’s a RDF/XML or a RDF/N3 file; this is just one example among many others.

The Q parameter

Another situation I came across recently was with the “priority” of each mime in an Accept parameter.

Ping the Semantic Web is sending this Accept parameter to any web server from which it receives a ping:

Accept: text/html, html/xml, application/rdf+xml;q=0.9, text/rdf+n3;q=0.9, application/turtle;q=0.9, application/rdf+n3;q=0.9, */*;q=0.8

The issue I came across is that one of the web servers was sending me a RDF/XML document for that Accept parameter string, even if it was able to send a TEXT/HTML document. In fact, if the server was reading “application/rdf+xml” in the Accept parameter, it was automatically sending a RDF document to it, even if it has a “lesser priority” than theTEST/HTML document.

In fact, this Accept parameter means:

Send me text/html or html/xml is possible.

If not, then send me application/rdf+xml, text/rdf+n3, application/turtle or application/rdf+n3.

If not, then send me anything; I will try to do something with it.

This is really important to consider the Q (or absence of the Q) parameter. Because its presence, or its non-presence, mean much.

Discrimination of software User-agents

Recently I faced a new kind of cyber-discrimination: discrimination based on the User-agent string of a HTTP request. In fact, even if I was sending “Accept: application/rdf+xml”, I was receiving a HTML document. So I contacted the administrator of the web server and he pointed me out to an example available on the W3C’s web site, called Best Practice Recipes for Publishing RDF Vocabularies, which explained why he has done that:

# Rewrite rule to serve HTML content from the namespace URI if requested
RewriteCond %{HTTP_ACCEPT} text/html [OR]
RewriteCond %{HTTP_ACCEPT} application/xhtml\+xml [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/.*
RewriteRule ^example5/$ example5-content/2005-10-31-docs/index.html [R=303]

# Rewrite rule to serve HTML content from class or prop URIs if requested
RewriteCond %{HTTP_ACCEPT} text/html [OR]
RewriteCond %{HTTP_ACCEPT} application/xhtml\+xml [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/.*
RewriteRule ^example5/(.+) example5-content/2005-10-31-docs/$1.html [R=303]

So, in the .htaccess file they published in the article, we can see that if the user agent is “Mozilla” it will send a HTML document.

However, it is wrote in the same document:

Note that, however, with RDF as the default response, a ‘hack’ has to be included in the rewrite directives to ensure the URIs remain ‘clickable’ in Internet Explorer 6, due to the peculiar ‘Accept:’ header field values sent by IE6. This ‘hack’ consists of a rewrite condition based on the value of the ‘User-agent:’ header field. Performing content negotiation based on the value of the ‘User-agent:’ header field is not generally considered good practice.

So no it is not a good practice, and people should really take care about this.

Conclusion

People should really take care of the Accept parameter when their server receive a request, and send back the good Content-type depending on the document they send to the requester. Content-negotiation is becoming the main way to find and access RDF data on the Web, and such behaviors should be fixed by web server administrators and developers.

Zitgist’s RDF Browser: Browse the Semantic Web

June 20, 2007June 20, 2007 Frederick Giasson

I am pleased to announce the pre-release of the Zitgist RDF Browser. This new tool from Zitgist will help users to browse the information available on the Semantic Web. As you will see bellow, this tool is a sort of information shape-shifter. Depending on the data available for a given Thing (a resource), it will shapes its user interface so that the data is best displayed for a better understanding of its semantic and for a better browsing experience.

This pre-release version is usable by anybody, however I would appreciate that you report any bug, issues or suggestions to me so that I can enhance the browser to meet people’s expectations.

Introducing Zitgist’s RDF Browser

The Templating system

The core of this new RDF browser is its templating system. This system will enhance the RDF browsing experience of users along with their understand of the information displayed to them. People can see it as a typical web browser such as Internet Explorer or FireFox, but instead of reading and displaying HTML, it display RDF data. Users only have to put the URI of a resource (it can be a URL where the browser can find RDF information about this Thing), then pressing the “browse” button.

Then, depending on the information available about this Thing, the RDF browser will shape its interface to optimize users’ browsing experience with the data.

Sources of data

Data displayed in the Zitgist RDF Browser can come from many different data sources:

Zitgist’s internal RDF datastore
URI dereferencing
On-the-fly conversation of data sources such as:
- Microformats
- RDFa
- eRDF
- HTML meta tags
- API data source such as: Amazon.com, Google Base, etc.

So, depending on what information is available for a given URI, the browser will mesh-up these data sources and displays the information to the user.

First example of the templating system

This first example shows how the browser will create a web page out of a RDF data source. In this case, the data source is a URI where Madonna’s latest album “Confession on a Dance Floor” is described.

The browser will check for that URI: http://zitgist.com/music/record/d7929b28-5812-4b8f-a99f-1800983c71fb
No information is available in its data store, so it will dereference the URI to get the RDF triples describing the album.
All in all, 15 different URIs will be dereferenced to create the web page.
The browser will detect that the type of the entity related to this URI is a mo:Album; so it will triggers the “moAlbum” template to skin the data source so that the user can easily see and understand the information we have about this resource (music album).
Then the skinned information is displayed to the user.

The templating system in action

Now we will see the templating system in action. In fact, the RDF browser does much more than skinning a single data source.

If you put that URI in the browser, you will see Sebastian’s profile. The browser will fire the foafPerson template, and his profile will be skinned according to this template.

However, what is interesting in that example is not only Sebastian’s profile, but the entities it links to. In fact, if you take a closer look and go down the page a little bit, you will notice the “Current projects” section of his profile. Then you will see a list of projects.

The first project is a musical group described as a foaf:Group. So, the browser will check the URI Sebastian’s profile link to, get information about it, skin it accordingly to the foafGroup template, and embed the result within Sebastian’s profile page.

Since we could embed such entities at infinitum, the browser restricts this automatic browsing to 3 deep levels in the graph.

Finally, we can “lookup” an individual embedded item by clicking on the lookup icon at the upper right corner of each entity.

Sidebar Navigator

In some cases some generated web page can be quite large, so a navigation widget has been developed to help users to navigate generated documents. The navigation of a document is based on the entities displayed in it.

For example, if we run the Zitgist RDF Browser for that URI: http://www.macosxhints.com, we will notice that information displayed is many pages long. So, to help us navigating this long document, we will use the entity navigator widget.

All the types available in that web page are listed in the sidebar, and for each type you have all the instances available.

In that example, you can easily browse the web feed of that web page. In a click, you can see all Posts, Feeds and Authors.

Interesting examples

There is a list of starting points to see the Zitgist RDF Browser in action:

http://www.macosxhints.com/
- Browsing a web feed converted into RDF.

http://swaml.berlios.de/doap.rdf
- The genetic template used to display the description of a doap:Project

http://homepages.cwi.nl/~ivan/AboutMe/CV/publist.rdf
- Ivan Herman’s list of publications.

http://b4mad.net/2006/05/30/googlegroups-sioc-dev.rdf
- Google group described using SIOC.

http://iswc2004.semanticweb.org/posters/metadata.rdf
- Poster abstracts of the ISWC2004 conference.

And all the examples above.

Bookmarklet

The Zitgist RDF Browser can process any URI. So, from any web page, a user can launch the browser to see what semantic web information is available for that URI. Then, all the information the browser can find/generate out of that data source will be displayed to the user.

To help users, I developed this really simple bookmarklet that get the URI of the current web page, send it to the browser, and then redirect the user to the browser’s generated page.

Zitgist RDF Browser’s Bookmarklet

Conclusion

As you noticed above, this new RDF browser is a sort of information shape-shifter. Depending on the information available for a given URI, it will skin it to make it easier to browse and understand for users.

My Personal Library and the Semantic Web

June 4, 2007 Frederick Giasson

Since the last couple of years, I was constantly reminding myself to put all the titles, authors and ISBN of my library in a database so that I could say to my insurances: there is a the list of books I had prior this thing that happen in my apartment that destroyed all of them. Considering the time that it would have taken, I always pushed this work for later.

Then recently this thought restarted the haunt me, so I asked to my girlfriend: would you like to do this for me before we move to the other apartment? So naturally, she said yes 🙂 So I explained her how we would proceed to save some time and to archive the maximum number of information about these books.

I told her: “take this laptop and open FireFox. You notice the small “Z” icon in the lower right corner of the screen? This is Zotero; this software will save you much time to get the work done.” Naturally, she was dubious.

So I told her to go to Amazon.com, to get the ISBN of each book, to get to the Amazon web page of the book, and to click on the Zotero icon to save the information about the book. So, in 2 clicks, we were saving all the information describing each book: its title, its authors, its ISBN, its publisher, etc. It was taking about 30 seconds per book.

With that procedure, we archived all the information about the books in my library in about 5 hours.

Then I told me: fantastic, now I even have all this information in RDF, thanks to Zotero! I had to do something with that, so I put the Zotero RDF exportation file into a Virtuoso triple store. In less than a minute I had all the information about my books inside a triples store, ready to be queried with SPARQL.

Querying and browsing my book library

There is the list of the books of my library and their authors. You will notice a “book_uri” and a “author_uri” column. By using these columns, you will be able to browse information about each book and about each author of the list. The only thing you have to do is clicking on the “link”, then a contextual window will appear and then you click on the “Explore” link. That way, you can browse information about each resource. If you want to come back to the previous window (results), you only have to click on the “small-left-array” in the top of the page.
There is the list of books and their publisher.
My library is composed of 115902 pages. This query is possible thanks to the aggregates capabilities of Virtuoso’s SPARQL parser.

Linking books authors’ quotes to each book of the library

Then I wanted to know what the authors that wrote the books in my library already said. So I took the QuotationsBook.com quotes database and I linked it to the information I have about my library.

It is why you can read some quotes of Nietzsche at that web page. (note that I created the totally random “foaf:quote” property to add the quotes into the Zotero’s author resource)

Getting more information about the authors

Then, I needed to get more information about the authors I read. To get that additional information, I linked the information I have about the authors in the library with the dbpedia (rdf version of wikipedia) database.

The result is quite impressive. Go to the books/authors page. Then, click on the Nietzsches’s URI (rdf:#$kajXe; it’s the first line in the result table). Then click on the “Explore” link once the contextual window appears. From there, you will see a “sameAs” property, so click on the http://dbpedia.org… link. Then click on the “Get Data Set (Dereference)” link once the contextual window appears. That is it; you get all the information, available on Wikipedia, related to this author in my library.

Then, I know that Nietzsche is born 1844 and died in 1900, etc. So now, I can browse this new and enhanced dataset to know facts about authors I read in the past.

The idea here is to say that the author described by Zotero (in the exportation RDF file created by the software) is the same as the one in dbpedia. So, knowing that the entity defined in Zotero is the same as the one defined in dbpedia, tell us that the facts about the first and the later are true for both entities (because the reality is that both entities (different URIs) are the same).

Geographical data

From there, we can think about integrating the current data with any other type of datasets. One of them could be a geographical dataset such as Geonames.org.

In fact, we know that Nietzsche is death in the city of Weinar. So, if we link the goenames dataset with the dbpedia dataset (it is supposed to be done, but it seems that some things changed in the dbpedia dataset and that the links are no longer available; anyway, it can easily be done), we could have much more information about the place where Nietzsche is dead.

Conclusion

So, as you can see, in a couple of hours, I have been able to digitalize my library. Then, I have been able to get quotes by the author of each of my book. Then, I have also been able to get more information about each author I read.

This is really fantastic. That way, I only have to browse this new dataset to find new facts about authors and books that I didn’t know before, and that would have took me days to find (for my entire library). Thanks to the semantic web, everything has been possible in only a couple of hours.

We could push the experience even further and displaying on a map where the authors of my books are born. So, I could find where most of the authors I read in my life are born. Do I mostly read books wrote in Europe, United-States or Canada? Where a part of my knowledge came from? From where part of the World I have been influenced? Etc.

The Music Data Space

May 24, 2007June 4, 2007 Frederick Giasson

Kingsley is talking about Data Spaces since a long time. But what is a Data Space? Nothing is better than an example to understand something, so I will try to explain you with a single data space that has been created yesterday, the Music Data Space:

This is the Music Data Space. This Data Space contains information about musical things. These things are described mainly by using the Music Ontology, but also by using other ontologies like FOAF. Finally, things (musical things) belonging to this space are accessible, on the Web, via dereferencable URIs.

So, the Music Data Space is a place where all musical things are defined on the Semantic Web, and accessible via the Web.

That is it, and it is what we created last Monday.

Now, some of you could wonder: why on earth Amazon.com belongs to the Music Data Space?

Amazon.com also belongs to the Music Data Space too!

Amazon.com live in the Music Data space too via their API. In fact, a simple experience with the OpenLink RDF Browser clearly demonstrates that Amazon.com’s data belongs to the Music Data Space too.

Open the RDF Browser by following that link

Now you will visualize RDF information about an album called “Chore of Enchantment”. Take a look at this line:

amazon_asin: http://amazon.com/exec/obidos/ASIN/B00003XAA7/searchcom07-20

Click on the link to Amazon. A window should popup. Select the Get Data Set (dereference) option.

At this point, some magic will happens. In fact, the new information that is displayed in the RDF Browser is coming directly from Amazon.com’s web server.

This is why I assume that Amazon.com belong to the Music Data Space too.

In fact, the Virtuoso Sponger will connect to Amazon.com via their API to get some information about that album. It will convert the data into RDF and will display it to the user via the RDF browser’s interface.

One step further: the JPG file also belongs to the Music Data Space!

Yes! Information about the JPG file, hosted on Amazon.com’s web servers, also belong to the Music Data Space and there is the proof:

Open that same RDF Browser page by following that link

Click on the Image (JPG) representing the cover of this album. A window should popup. Select the Get Data Set (dereference) option.

Check the triples that have been created from this image. The Virtuoso Sponger downloaded the JPG file, it analyzed its header, RDFized everything and sent the information back to the RDF Browser so that the user can see the information available for that image.

Where is the end? I have no idea… probably at the same place where the imagination ends too.

Unifying everything

This is that simple. All data sources (relational databases, remote data accessible via APIs, native rdf data, etc.) are unified together via the Music Data Space. And this Music Data Space is accessible, via URI dereferencing, at http://zitgist.com/music/

Other Data Spaces available

Conclusion

The Music Data Space is the starting point and many other type of data spaces should emerge soon.

Browsing Musicbrainz’s dataset via URI dereferencing

May 22, 2007June 4, 2007 Frederick Giasson

Musicbrainz’s dataset can finally be browsed, node-by-node, using URI dereferencing.

What this mean?

Since the Musicbrainz relational database has been converted into RDF using the Music Ontology, all relations existing between Musicbrainz entities (an entity can be a Music Artist, a Band, an Album, a Track, etc.) are creating a musical relations graph. Each node of the graph is a resource and each arc is a property between two resources. Welcome in the World of RDF.

This means that from a resource “Madonna” we can browse the musical relations graph to find other entities such as Records, People, Bands, Etc.

Kingsley, inspired by Diana Ross, said: “URI Everything, and Everything is Cool!”

This is cool! Now Diana Ross has her own URI on the semantic web: http://zitgist.com/music/artist/60d41417-feda-4734-bbbf-7dcc30e08a83

Paul McCarney:
http://zitgist.com/music/artist/ba550d0e-adac-4864-b88b-407cab5e76af

The Beatles:
http://zitgist.com/music/artist/b10bbbfc-cf9e-42e0-be17-e2c3e1d2600d

Madonna:
http://zitgist.com/music/artist/79239441-bfd5-4981-a70c-55c3f15c1287

Have their own too!

URIs for Musical Things

These URIs are not only used to refer to Musicbrainz entities. In fact, these URIs are used to refer to any Musical Entities that you can describe using the Music Ontology. In a near future, the Musicbrainz data will be integrated along with data from Jamendo and Magnatune. In the future, we will be able to integrate any sort of musical data at the same place (radio stations data, user foaf profiles relations to musical things, etc.). So from a single source (http://zitgist.com/music/) all these different sources of musical data will be queriable at once.

URI schemes

The URI schemes are defined in the Musicbrainz Virtuoso RDF View:

http://zitgist.com/music/artist/*******
http://zitgist.com/music/artist/birth/*******
http://zitgist.com/music/artist/death/*******
http://zitgist.com/music/artist/simlink/*******
http://zitgist.com/music/record/*******
http://zitgist.com/music/performance/*******
http://zitgist.com/music/composition/*******
http://zitgist.com/music/musicalwork/*******
http://zitgist.com/music/sound/*******
http://zitgist.com/music/recording/*******
http://zitgist.com/music/signal/*******
http://zitgist.com/music/track/*******
http://zitgist.com/music/track/duration/*******

All these URI schemes terms refer to their Music Ontology classes’ descriptions.

Conclusion

I am getting closer and closer to the first goal I set to myself when I first started to write the Music Ontology. This first goal was to make the Musicbrainz relational database available in RDF on the Web. Months later and with the help of the Music Ontology Community (specially Yves Raimond that worked tirelessly on the project) and the OpenLink Software Inc. Team, we finally make this data available through URI dereferencing.

From there, we will build-up new music services, integrate more musical datasets into the Music Data Space, etc. It is just the beginning of something much bigger.