Given the current projects I am working on, I daily see misuse of content-negotiation methodology, particularly the misuse of the Accept and Content-type HTTP header parameters.
As you will see bellow, I came across many misuse of these HTTP header parameters: potentially by their misunderstanding, or simply by forgetting to set them properly when content is negotiated between their web servers and applications requesting pages.
In any way, people should take a greater care about setting the content-negotiation properly between their servers and other applications. In fact, I saw many examples, on many web servers: from the semantic web research groups, to the hobbyists.
The principle
The principle is simple, if a requester sends a HTTP query with the Accept header:
Accept: text/html, application/rdf+xml
The web server should check the priority of the mime types that the requester is requesting and send back the requested document type with the greater priority, along with the Content-type of the document in the HTTP header answer.
The Content-type parameter is quite important, since if a user application request a list of 10 mimes having all the same priority, it should know which of them as been sent by the web server.
Trusting the Content-type parameter
It is hard.
In fact, Ping the Semantic Web do not trust any web server that returns Content-type. This parameter is so misused that it makes it useless. So I had to develop procedures to detect the type and the encoding of files it crawls.
For example, sometime, people will return the mime TEXT/HTML when it facts it’s a RDF/XML or a RDF/N3 file; this is just one example among many others.
The Q parameter
Another situation I came across recently was with the “priority” of each mime in an Accept parameter.
Ping the Semantic Web is sending this Accept parameter to any web server from which it receives a ping:
Accept: text/html, html/xml, application/rdf+xml;q=0.9, text/rdf+n3;q=0.9, application/turtle;q=0.9, application/rdf+n3;q=0.9, */*;q=0.8
The issue I came across is that one of the web servers was sending me a RDF/XML document for that Accept parameter string, even if it was able to send a TEXT/HTML document. In fact, if the server was reading “application/rdf+xml” in the Accept parameter, it was automatically sending a RDF document to it, even if it has a “lesser priority” than theTEST/HTML document.
In fact, this Accept parameter means:
Send me text/html or html/xml is possible.
If not, then send me application/rdf+xml, text/rdf+n3, application/turtle or application/rdf+n3.
If not, then send me anything; I will try to do something with it.
This is really important to consider the Q (or absence of the Q) parameter. Because its presence, or its non-presence, mean much.
Discrimination of software User-agents
Recently I faced a new kind of cyber-discrimination: discrimination based on the User-agent string of a HTTP request. In fact, even if I was sending “Accept: application/rdf+xml”, I was receiving a HTML document. So I contacted the administrator of the web server and he pointed me out to an example available on the W3C’s web site, called Best Practice Recipes for Publishing RDF Vocabularies, which explained why he has done that:
# Rewrite rule to serve HTML content from the namespace URI if requested
RewriteCond %{HTTP_ACCEPT} text/html [OR]
RewriteCond %{HTTP_ACCEPT} application/xhtml\+xml [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/.*
RewriteRule ^example5/$ example5-content/2005-10-31-docs/index.html [R=303]# Rewrite rule to serve HTML content from class or prop URIs if requested
RewriteCond %{HTTP_ACCEPT} text/html [OR]
RewriteCond %{HTTP_ACCEPT} application/xhtml\+xml [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/.*
RewriteRule ^example5/(.+) example5-content/2005-10-31-docs/$1.html [R=303]
So, in the .htaccess file they published in the article, we can see that if the user agent is “Mozilla” it will send a HTML document.
However, it is wrote in the same document:
Note that, however, with RDF as the default response, a ‘hack’ has to be included in the rewrite directives to ensure the URIs remain ‘clickable’ in Internet Explorer 6, due to the peculiar ‘Accept:’ header field values sent by IE6. This ‘hack’ consists of a rewrite condition based on the value of the ‘User-agent:’ header field. Performing content negotiation based on the value of the ‘User-agent:’ header field is not generally considered good practice.
So no it is not a good practice, and people should really take care about this.
Conclusion
People should really take care of the Accept parameter when their server receive a request, and send back the good Content-type depending on the document they send to the requester. Content-negotiation is becoming the main way to find and access RDF data on the Web, and such behaviors should be fixed by web server administrators and developers.