Semantic Web – Page 21 – Frederick Giasson

Zitgist Browser’s server stabilized

July 31, 2007 Frederick Giasson

Five weeks ago I introduced the Zitgist Browser on this blog. At that time, I talked about a pre-release of the service. These two little words probably helped to explain what followed in the following weeks.

In fact, some of you probably noticed that the Zitgist Browser was down half of the time for a couple of weeks. In fact, we found many issues at many levels that rendered the browser’s server unstable. In the last weeks, we performed a battery of tests to fix all issues that appeared. Now, about three weeks later, the server is back stable. At least, it has been online for the last couple of days without any issues.

Thanks to the OpenLink Software Inc. development team, we have been able to stabilize the service; and it wouldn’t have been possible without their help and expertise.

Finally, stay tuned for the next release of this service and continue to use it and report issues that you could encounter while browsing the semantic web (more information about the next version in the next blog post); and sorry about the possible frustrations you possibly had when you used the unstable version of the service.

UMBEL: Upper-level Mapping and Binding Exchange Layer

July 19, 2007 Frederick Giasson

Mike Bergman released the first draft of its UMBEL ontology. Me and some other people helped him to come up with that new ontology.What is UMBEL? UMBEL is a lightweight subject reference structure. People can see it as a pool of subjects. Subjects are related together at a synonymy level; so, subjects of related meaning will be binded together.

The objectives

The objectives of this new ontology are:

A reference umbrella subject binding ontology, with its own pool of high-level binding subjects
Lightweight mechanisms for binding subject-specific ontologies to this structure
A standard listing of subjects that can be refererenced by resources described by other ontologies (e.g., dc:subject)
Provision of a registration and look-up service for finding appropriate subject ontologies
Identification of existing data sets for high-level subject extraction
Codification of high-level subject structure extraction techniques
Identification and collation of tools to work with this subject structure, and
A public Web site for related information, collaboration and project coordination.

Main applications

Given these objectives, I see a couple of main applications where such ontology could be used:

Helping systems to find data sources for a given ontology. UMBEL is much more than a subject structure. In fact, UMBEL will bind subjects, with related ontologies and data sources for these related ontologies. So, for a given subject, people will be able to find related ontologies, and then related data sources.
Acting as a subject reference backbone. So, it could be use by people to links resources, using dc:subject, to its subject resource (the UMBEL subject proxy resource), etc.
Could be used by user interface to help them with handling subjects (keywords) references to find related ontologies (that have the power to describe these subjects).
Eventually it should be used by PingtheSemanticWeb to bind pinged data to the subject reference structure.
And probably many others.

Creation of the Ontology

A procedure will be created to automatically generate the ontology. The gross idea is to reuse existing knowledge bases to create the set of subjects, and their relationship, that will create the ontology. So, the idea is to come up with a representative, not too general, not too specialized, set of subjects. For that, we will play with knowledge bases such as WordNet, Wikipedia, Dmoz, etc. We will try to find out how we could prune unnecessary subjects out of them, how we could create such a subject reference framework by taking a look at the intersection of each data set, etc. The procedure is not yet developed, but the first experiments will look like that.

As explained in the draft:

The acceptance of the actual subjects and their structure is one key to the acceptance — and thus use and usefulness — of the UMBEL ontology. (The other key is simplicity and ease-of-use or tools.) A suitable subject structure must be adaptable and self-defining. It should reflect expressions of actual social usage and practice, which of course changes over time as knowledge increases and technologies evolve.

A premise of the UMBEL project is that suitable subject content and structures already exist within widely embraced knowledge bases. A further premise is that the ongoing use of these popular knowledge bases will enable them to grow and evolve as societal needs
and practices grow and evolve.

The major starting point for the core subject pool is WordNet. It is universally accepted, has complete noun and class coverage, has an excellent set of synonyms, and has frequency statistics. It also has data regarding hierarchies and relationships useful to the UMBEL look-up reference structure, the ‘unofficial’ complement to the core ontology.

A second obvious foundation to building a subject structure is Wikipedia. Wikipedia’s topic coverage has been built entirely from the bottom up by 75,000 active contributors writing articles on nearly 1.8 million subjects in English alone, with versions in other
degrees of completeness for about 100 different languages. There is also a wealth of internal structure within Wikipedia’s templates.

These efforts suggest a starting formula for the UMBEL project of W + W + S + ? (for WordNet + Wikipedia + SKOS + other?). Other potential data sets with rich subject coverage include existing library classification systems, upper-level ontologies such as SUMO, Proton or DOLCE, the contributor-built Open Directory Project, subject ‘primitives’ in other languages such as Chinese, or the other sources listed in Appendix 2 – Candidate Subject Data Sets.

Though the choice of the contributing data sets from which the UMBEL subject structure is to be built will never be unanimous, using sources that have already been largely selected by large portions of the Web-using public will go a long ways to establishing authoritativeness. Moreover, since the subject structure is only intended as a lightweight reference — and not a complete closed-world definition — the UMBEL project is also setting realistic thresholds for acceptance.

Conclusion

If you are interested in such an ontology project, please join us on the mailing list of the ontology’s development group, ask questions, writes comments and suggestions.

Next step is to start creating a first version of the subject proxies.

The Bibliographic Ontology: a first proposition

July 13, 2007July 17, 2007 Frederick Giasson

This Document is about the creation of The Bibliographic Ontology. It is the first proposition from Bruce D’Arcus and me that should lead to the writing of the first draft of the ontology. Some things have been developed, many questions have been raised, and the discussion that will arise from this first proposition will set the basis for the first draft of the ontology.

The goal of this ontology is simple: creating a bibliographic ontology that will set the basis to describes a document: so describing a writing that provides information. If well done, it will enable other people or organizations to create extension modules that will enable it to be expressive enough to describe more specialized sub-domains such as law documents, etc. It also re-use existing ontologies that already define some properties of documents.

Related materials

1. The proposed OWL/N3 file describing The Bibliographic Ontology (note: read the comment, FG are from me, and BD are from Bruce)
2. An enhanced version of the Zotero RDF dump of the book “Spinning the Semantic Web”, that shows the expressiveness power of the ontology by extending its content using the bibo:Part class and the locators properties (RDF/XML)
3. Other examples that shows other possible descriptions such as the description of events, places, etc.(RDF/N3)

Main concept of the ontology: a Document

The main concept of the ontology is bibo:Document. This class is described as “Writing that provides information” (from Wordnet). So, basically, any writing is a Document. It is equivalent to a foaf:Document and a dcterms:BibliographicResource. These two links are quite important since it will enable us to re-use these two widely used ontologies: FOAF and DCTERMS.

Second main concept: Contributions to these Documents

The second main concept of the ontology is bibo:Contribution. This class is described as “A part played by a person in bringing about a resulting Document”. The goal of this concept is to relate people, by their contributions, to documents they wrote, or helped to write. For now, contributions are defined by three properties:

bibo:role, that defines the role of the contributor: author, translator, publisher, distributor, etc.
bibo:contributor, that links a contribution to its contributor
bibo:position, that loselessly associates a “contribution” level for each contributors. This property is mainly used to sort multiple authors that worked on the writing of a document. More about that in the examples document.

With these two concepts, you can describe any Document and any Contribution to any document. So you can relate any piece of writing to its contributors.

What is really interesting with the concept (in my opinion) is that it opens the door the much more. In fact, by using this concept, we can now extend the idea and describe many more things about how people contributed to the writing of a document.

From these two concepts, we extended the idea to be able to cope with a larger range of use-cases.

Extensions of bibo:Document

The document class has been specialized in a series of more specialized type of documents, with restrictions of their own:

Article
LegalCase
Manuscript
Book
Manual
Legistlation
Patent
Report
Thesis
Transcript
Note
Law

Classes or individuals?

The development of this proposition has been made with Lee W. Lacy’s OWL book quote in mind:

“Individuals often mirror “real world” objects. When you stop having different property attributes (and just have different values) you have often identified an object (individual)“

This mean that if a subclass of a class didn’t have specific restrictions, or if no properties were restricted by using this class in their domain, then the class was dropped and an individuals of the super-class.

One example of this is the type bibo_types:dissertation. It is an individual of the class bibo:Thesis, but since it doesn’t have anything different other than its meaning, then we created an individual of the class bibo:Thesis. Check the examples document to see what it means concretely.

Collections of documents

Another main concept of the ontology is bibo:Collection. This concept has an aggregation inherent property. Its only purpose is to aggregate bibo:Document(s). An entity of this class will have a role of hubs into the RDF graph (network) created out of bibliographic relations (properties).

Other types of collections, with some restrictions of their own, have also been created. These other collections, such as bibo:CourtReporter are intended to be anchor points that can be extended by Bibliographic Ontology Extension Modules of particular specialized sub-domains such as Law documents.

There is the current list of specialized collections:

InternetSite
Series
Periodical
- Journal
- Magazine
- CourtReporter

Part of Documents

Another important concept is bibo:Part. This concept, along with locators (more about them in the next section), enables us to specify the components of Document. In fact, sometimes documents are aggregated to create collections, such as journals, magazines or court reporters. However, sometimes, documents are embedded within a document (embedded versus aggregated). This is the utility of bibo:Part; a bibo:Part is a document, but in fact, it’s a part of a document. The special property of a bibo:Part is dcterms:hasPart. So, a bibo:Part has use this property to relate it to the document it is part of. Check the examples document to know how bibo:Part can be used.

Locating Parts

To support the concept of Parts, a set of properties, called “locators” have been created. These locator properties will help to describe the relation between a Part and its related Document.

Three of these locators are bibo:volume, bibo:chapter and bibo:page. So, these properties will locate Parts inside documents. For example: a chapter within a book, or a volumne within a document that is a set of volumes.

Check the example about the document “The Art of Computer Programming” by Donald Knuth for a good example of how locators can be used.

This said, we could now think to describe a document by its parts, recursively from its volumes to its pages.

Open questions

Should we develop the ontology such that we can describe the entire workflow that lead to the creation and publication (possibly) of a document? All this workflow would be supported by the FRBR principles. At the moment, all the ontology describes the manifestation of a work, and not the work itself or its expression. Take a look at The Music Ontology (its workflow) to see how it could be done for the bibliographic ontology.
If the creation of classes and individuals of classes the good way to describe type of documents?
Is it the good way, or is there other ways, to describe contributions of people to the elaboration of documents?

Re-used ontologies

DCTERMS: re-used to describe main properties of document.
FOAF: re-used to describe people and organizations.
EVENT: re-used to describe events (example: conferences)
TIME: re-used to describe temporal properties
wgs84_pos: re-used to describe geographical entities

Conclusion

Please give any feedbacks, suggestions or comments directly on the mailing list of the group that develop this ontology. This group is intended to create an ontology that would create some type of consensus between people and organization working with bibliographical data.

Note: I disabled comment on this post only, to make sure that people comment on the mailing list.

Content negotiation: bad use cases I recently observed

July 6, 2007 Frederick Giasson

Given the current projects I am working on, I daily see misuse of content-negotiation methodology, particularly the misuse of the Accept and Content-type HTTP header parameters.

As you will see bellow, I came across many misuse of these HTTP header parameters: potentially by their misunderstanding, or simply by forgetting to set them properly when content is negotiated between their web servers and applications requesting pages.

In any way, people should take a greater care about setting the content-negotiation properly between their servers and other applications. In fact, I saw many examples, on many web servers: from the semantic web research groups, to the hobbyists.

The principle

The principle is simple, if a requester sends a HTTP query with the Accept header:

Accept: text/html, application/rdf+xml

The web server should check the priority of the mime types that the requester is requesting and send back the requested document type with the greater priority, along with the Content-type of the document in the HTTP header answer.

The Content-type parameter is quite important, since if a user application request a list of 10 mimes having all the same priority, it should know which of them as been sent by the web server.

Trusting the Content-type parameter

It is hard.

In fact, Ping the Semantic Web do not trust any web server that returns Content-type. This parameter is so misused that it makes it useless. So I had to develop procedures to detect the type and the encoding of files it crawls.

For example, sometime, people will return the mime TEXT/HTML when it facts it’s a RDF/XML or a RDF/N3 file; this is just one example among many others.

The Q parameter

Another situation I came across recently was with the “priority” of each mime in an Accept parameter.

Ping the Semantic Web is sending this Accept parameter to any web server from which it receives a ping:

Accept: text/html, html/xml, application/rdf+xml;q=0.9, text/rdf+n3;q=0.9, application/turtle;q=0.9, application/rdf+n3;q=0.9, */*;q=0.8

The issue I came across is that one of the web servers was sending me a RDF/XML document for that Accept parameter string, even if it was able to send a TEXT/HTML document. In fact, if the server was reading “application/rdf+xml” in the Accept parameter, it was automatically sending a RDF document to it, even if it has a “lesser priority” than theTEST/HTML document.

In fact, this Accept parameter means:

Send me text/html or html/xml is possible.

If not, then send me application/rdf+xml, text/rdf+n3, application/turtle or application/rdf+n3.

If not, then send me anything; I will try to do something with it.

This is really important to consider the Q (or absence of the Q) parameter. Because its presence, or its non-presence, mean much.

Discrimination of software User-agents

Recently I faced a new kind of cyber-discrimination: discrimination based on the User-agent string of a HTTP request. In fact, even if I was sending “Accept: application/rdf+xml”, I was receiving a HTML document. So I contacted the administrator of the web server and he pointed me out to an example available on the W3C’s web site, called Best Practice Recipes for Publishing RDF Vocabularies, which explained why he has done that:

# Rewrite rule to serve HTML content from the namespace URI if requested
RewriteCond %{HTTP_ACCEPT} text/html [OR]
RewriteCond %{HTTP_ACCEPT} application/xhtml\+xml [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/.*
RewriteRule ^example5/$ example5-content/2005-10-31-docs/index.html [R=303]

# Rewrite rule to serve HTML content from class or prop URIs if requested
RewriteCond %{HTTP_ACCEPT} text/html [OR]
RewriteCond %{HTTP_ACCEPT} application/xhtml\+xml [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/.*
RewriteRule ^example5/(.+) example5-content/2005-10-31-docs/$1.html [R=303]

So, in the .htaccess file they published in the article, we can see that if the user agent is “Mozilla” it will send a HTML document.

However, it is wrote in the same document:

Note that, however, with RDF as the default response, a ‘hack’ has to be included in the rewrite directives to ensure the URIs remain ‘clickable’ in Internet Explorer 6, due to the peculiar ‘Accept:’ header field values sent by IE6. This ‘hack’ consists of a rewrite condition based on the value of the ‘User-agent:’ header field. Performing content negotiation based on the value of the ‘User-agent:’ header field is not generally considered good practice.

So no it is not a good practice, and people should really take care about this.

Conclusion

People should really take care of the Accept parameter when their server receive a request, and send back the good Content-type depending on the document they send to the requester. Content-negotiation is becoming the main way to find and access RDF data on the Web, and such behaviors should be fixed by web server administrators and developers.

Zitgist’s RDF Browser: Browse the Semantic Web

June 20, 2007June 20, 2007 Frederick Giasson

I am pleased to announce the pre-release of the Zitgist RDF Browser. This new tool from Zitgist will help users to browse the information available on the Semantic Web. As you will see bellow, this tool is a sort of information shape-shifter. Depending on the data available for a given Thing (a resource), it will shapes its user interface so that the data is best displayed for a better understanding of its semantic and for a better browsing experience.

This pre-release version is usable by anybody, however I would appreciate that you report any bug, issues or suggestions to me so that I can enhance the browser to meet people’s expectations.

Introducing Zitgist’s RDF Browser

The Templating system

The core of this new RDF browser is its templating system. This system will enhance the RDF browsing experience of users along with their understand of the information displayed to them. People can see it as a typical web browser such as Internet Explorer or FireFox, but instead of reading and displaying HTML, it display RDF data. Users only have to put the URI of a resource (it can be a URL where the browser can find RDF information about this Thing), then pressing the “browse” button.

Then, depending on the information available about this Thing, the RDF browser will shape its interface to optimize users’ browsing experience with the data.

Sources of data

Data displayed in the Zitgist RDF Browser can come from many different data sources:

Zitgist’s internal RDF datastore
URI dereferencing
On-the-fly conversation of data sources such as:
- Microformats
- RDFa
- eRDF
- HTML meta tags
- API data source such as: Amazon.com, Google Base, etc.

So, depending on what information is available for a given URI, the browser will mesh-up these data sources and displays the information to the user.

First example of the templating system

This first example shows how the browser will create a web page out of a RDF data source. In this case, the data source is a URI where Madonna’s latest album “Confession on a Dance Floor” is described.

The browser will check for that URI: http://zitgist.com/music/record/d7929b28-5812-4b8f-a99f-1800983c71fb
No information is available in its data store, so it will dereference the URI to get the RDF triples describing the album.
All in all, 15 different URIs will be dereferenced to create the web page.
The browser will detect that the type of the entity related to this URI is a mo:Album; so it will triggers the “moAlbum” template to skin the data source so that the user can easily see and understand the information we have about this resource (music album).
Then the skinned information is displayed to the user.

The templating system in action

Now we will see the templating system in action. In fact, the RDF browser does much more than skinning a single data source.

If you put that URI in the browser, you will see Sebastian’s profile. The browser will fire the foafPerson template, and his profile will be skinned according to this template.

However, what is interesting in that example is not only Sebastian’s profile, but the entities it links to. In fact, if you take a closer look and go down the page a little bit, you will notice the “Current projects” section of his profile. Then you will see a list of projects.

The first project is a musical group described as a foaf:Group. So, the browser will check the URI Sebastian’s profile link to, get information about it, skin it accordingly to the foafGroup template, and embed the result within Sebastian’s profile page.

Since we could embed such entities at infinitum, the browser restricts this automatic browsing to 3 deep levels in the graph.

Finally, we can “lookup” an individual embedded item by clicking on the lookup icon at the upper right corner of each entity.

Sidebar Navigator

In some cases some generated web page can be quite large, so a navigation widget has been developed to help users to navigate generated documents. The navigation of a document is based on the entities displayed in it.

For example, if we run the Zitgist RDF Browser for that URI: http://www.macosxhints.com, we will notice that information displayed is many pages long. So, to help us navigating this long document, we will use the entity navigator widget.

All the types available in that web page are listed in the sidebar, and for each type you have all the instances available.

In that example, you can easily browse the web feed of that web page. In a click, you can see all Posts, Feeds and Authors.

Interesting examples

There is a list of starting points to see the Zitgist RDF Browser in action:

http://www.macosxhints.com/
- Browsing a web feed converted into RDF.

http://swaml.berlios.de/doap.rdf
- The genetic template used to display the description of a doap:Project

http://homepages.cwi.nl/~ivan/AboutMe/CV/publist.rdf
- Ivan Herman’s list of publications.

http://b4mad.net/2006/05/30/googlegroups-sioc-dev.rdf
- Google group described using SIOC.

http://iswc2004.semanticweb.org/posters/metadata.rdf
- Poster abstracts of the ISWC2004 conference.

And all the examples above.

Bookmarklet

The Zitgist RDF Browser can process any URI. So, from any web page, a user can launch the browser to see what semantic web information is available for that URI. Then, all the information the browser can find/generate out of that data source will be displayed to the user.

To help users, I developed this really simple bookmarklet that get the URI of the current web page, send it to the browser, and then redirect the user to the browser’s generated page.

Zitgist RDF Browser’s Bookmarklet

Conclusion

As you noticed above, this new RDF browser is a sort of information shape-shifter. Depending on the information available for a given URI, it will skin it to make it easier to browse and understand for users.

Frederick Giasson

Machine Learning, Engineering & Data

Category: Semantic Web

Zitgist Browser’s server stabilized

UMBEL: Upper-level Mapping and Binding Exchange Layer

The objectives

Main applications

Creation of the Ontology

Conclusion

The Bibliographic Ontology: a first proposition

Related materials

Main concept of the ontology: a Document

Second main concept: Contributions to these Documents

Extensions of bibo:Document

Classes or individuals?

Collections of documents

Part of Documents

Locating Parts

Open questions

Re-used ontologies

Conclusion

Content negotiation: bad use cases I recently observed

The principle

Trusting the Content-type parameter

The Q parameter

Discrimination of software User-agents

Conclusion

Zitgist’s RDF Browser: Browse the Semantic Web

Introducing Zitgist’s RDF Browser

The Templating system

Sources of data

First example of the templating system

The templating system in action

Interesting examples

Bookmarklet

Conclusion