Frederick Giasson

Zitgist: a semantic web search engine

February 18, 2007March 23, 2007 Frederick Giasson

Recently I started to talk about the project I am currently working on and people were wondering what it was. “Hey, what Zitgist is about Fred?” – “I heard a couple of things but I can’t find any information about it” – Etc. So I started to gives away some information about Zitgist (pronounced “zeitgeist”), what it was, what were the goals, etc.

Zitgist is basically a semantic web search engine. Some people could be wondering about what is a semantic web search engine? In what is it different from more traditional search engines such as Google, Yahoo! or MSN Search? It only differs in the information it aggregates, index and use to answer users queries. Instead of using human readable documents such as HTML, PDF or DOC, Zitgist will use semantic web documents (RDF).

The characteristic of these documents is that they describe things. In fact, these documents can describe anything: a person (its interests, its relations with other people, etc.), objects like books, music CDs; projects (computer projects, architectural projects, etc.), geographical locations such as countries, cities, mountains, etc. So, with semantic web documents one can describe virtually anything.

Since Zitgist is aware of the meaning of all these descriptions, powerful queries can be sent by users to let them find what they want. By example, a Zitgist user could send queries such as:

Give me the name of the people that are interested in writing leaving near London.
Give me the name of groups (group, organization, etc.) that has Brian Smith as member.
Give me the name of the computer projects programmed using C++ that work for Linux or Windows.
Give me the name of the discussion forums that are related to cooking.
Give me the name of the cities in UK that have more than 150 000 people.
Give me the name of the documents where its topic is a person named Paul.
Etc.

Note: these queries are not built using natural language (such as phrases), but with an easy to use user interface that help users to build the queries they want.

Once a user has built and sent these queries, the search engine will return results matching these criteria. Then if that user click on a result interesting him, he will be redirected to Zitgist’s semantic web browser. This interface display the semantic web document know by Zitgist to users.

This is what is interesting with Zitgist. Since semantic web documents are intended to be used by machines, humans can’t easily read them. This semantic web browser is a user interface that displays the information held in these semantic web documents such that humans can easily and intuitively read and understand them. So from a single result (say, a person), a user will be able to surf the semantic web like if he would be surfing the Web: from link to link. That way, users will use the same behaviors to surf the semantic web has they have when they surf the current Web.

The main goal with Zitgist is to create a semantic web search engine for users that doesn’t even know the existence of the semantic web. A user shouldn’t be aware of the concepts supporting the semantic web to use it. Their experience should be as close as the one they currently have with the current Web and the search engines they use daily.

What is the current state of Zitgist? All the things I mentioned above are working. A first private version should be released in the next months. Some demos to the semantic web community should be performed too. Slowly, in the next months, more and more things will be rolled-out to the public. However be aware that I am not hyping the project here. I will do things that way to make sure that the supporting architecture gives the best experience to all users. For this, we have to scale the architecture slowly to make sure that too much users do not make the service unusable.

In the next week, I should gives more technical information and write documentation about how web developers can make their data available to optimize their indexation into Zitgist. So best practice documents describing how web developer should create their semantic web document will eventually be put online.

Zitgist LLC. is a company founded by me and OpenLink Software Inc. I would like to specially thanks the OpenLink development team that gives me all the support I need to develop a first working version of this new search engine. I wouldn’t be able to write about Zitgist today without them.

Revision 1.03 of the Music Ontology

February 12, 2007 Frederick Giasson

I just published the revision 1.03 of the Music Ontology. You can notify the addition of mo:Movement, mo:has_movement, mo:movementNum and mo:tempo. Half of the modifications has been made to enhance the descriptiveness of classical music. The other half of the modifications has been made to make the distinction between production (MusicalExpression) and publication (MusicalManifestation) clearer.

All these modifications have been made possible by Ivan Herman and Yves Raimond and I would specially like to thanks them for their insightful comments and suggestions they made via the mailing list.

There is the changes log for the revision 1.03

Changed the range of mo:key from a rdfs:Literal to a http://purl.org/NET/c4dm/music.owl#Key
Added property mo:opus
Added mo:MusicalWork in the domain of: mo:bpm, mo:duration, mo:key, mo: pitch
Added class mo:Movement
Added property mo:has_movement
Added property mo:movementNum
Added property mo:tempo
Added mo:Instrument in the range of mo:title
Remove mo:MusicalWork and mo:MusicalManifestation and added mo:MusicalManifestation to the property’s domain mo: publisher, mo: producer, mo:engineer, mo:conductor, mo:arranger, mo:sampler, mo:compiler
Remove mo:MusicalWork and mo:MusicalManifestation and added mo:MusicalManifestation to the property’s range mo: published, mo: produced, mo:engineering, mo:conducted, mo:arranged, mo:sampled, mo:compiled

Distribution of semantic web data

February 4, 2007 Frederick Giasson

Three days ago I talked about the importance of both RDF data dumps and dereferencable URIs to distribute RDF data over the Web. However yesterday Marc from Geonames.org got some problems with an impolite semantic web crawler. In his article he point out that:

“It simply does not make sense to download a huge database record by record if a full dump is available.”

In the best of the world it doesn’t make sense, but unfortunately it is how the Web always worked. Think about Google, Yahoo! and MSN Search; this is exactly what they do, and it doesn’t make sense. The difference is that they are probably more polite. Marc did the only thing he as to do: banning the belligerent crawler.

The problem with data dumps is that they are generally not that easy to find (if available) on services web site. So some developers will note care taking the time to find them and will fetch everything from the Web server, page by page.

However all that story brings a question: how could we make these data dump more visible? Ecademy.com use a <link> element from their home page to link to a dump of the URLs to their FOAF profiles. However, if you don’t check at the HTML code of the page, you will never be aware of it. A first step would probably be to create a repository of these data dumps.

The SWEO Community Project started the “Linking Open Data on the Semantic Web” project that is basically a list of RDF dumps from different web site or projects.

Personally what I will do to help people finding these RDF dumps (and to make them aware of their existence) is to create a repository of these RDF dump on Pingthesemanticweb.com (should be available later this week).

That way, developers using Pingthesemanticweb.com will probably check that list and then download the data they need. After that, they will only use PTSW to synch their triple store with the remote service’s database (Geonames.org for example).

RDF dump vs. dereferencable URIs

February 1, 2007 Frederick Giasson

In a recent mail thread, someone was asking what was the best way to get RDF data from a source [having more than a couple of thousands of documents]: a RDF dump or a list of dereferencable URIs?

None is better than the other. Personally what I prefer is to use both.

If we take the example of Geonames.org, getting all the 6.4 million of RDF documents from dereferencable URI would take weeks. However, updating your triple store with new or updated RDF documents with a RDF dump would force you to download and re-index it completely every month (or so). This task would take some days.

So what is best way then? There is what I propose (and currently do):

In fact, the first time I indexed Geonames into a triple store, I requested a RDF dump to Marc. Then I asked him: would it be possible for you to ping Pingthesemanticweb.com each time a new document appears on Geonames or each time a document is updated? In less than a couple of hour he answered to my mail and then Geonames was pinging PTSW.

So, what it means? It means that I populated my triple store with geonames with a RDF dump for the first time. By proceeding that way I saved one to two weeks of work. Then I am now updating the triple store via Pingthesemanticweb.com. By proceeding that way, I save 2 or 3 days each month.

So what I suggest is to use both methods. The important point here is that Pingthesemanticweb.com acts as an agent that send you new and updated files for a specific service (Geonames in the above example). This simple infrastructure could save precious time to many semantic web developers.

Revision 1.02 of the Music Ontology: a first crystallization of the ontology

January 30, 2007 Frederick Giasson

Since about a month people are looking at the Music Ontology; they talk about it; they revise it; and they give suggestions. The revision 1.02 of the ontology as been possible by all their comments and suggestions. It should be a good compromise for all users’ visions.

As I previously explained in one of my blog post, the FRBR Final Report has been use as the ground for the ontology. It describes musical entities and their relationship.

Finally the revision 1.02 is the first “crystallization” of the Music Ontology. Now people can certainly start to use it into their FOAF profile. Artists can start to describe themselves, their musical group, their musical creations, etc. Music stores could start to export the content of their inventory using the ontology. Internet radio could certainly start to gives information about the tracks they stream.

If you are interested in participating in the elaboration of that ontology, I would encourage you to subscribe to its mailing list. I would like to thank everybody subscribed to this mailing list since this new revision wouldn’t has been possible without their insightful comments and suggestions.

Machine Learning, Engineering & Data

Author: Frederick Giasson

Zitgist: a semantic web search engine

Revision 1.03 of the Music Ontology

Distribution of semantic web data

RDF dump vs. dereferencable URIs

Revision 1.02 of the Music Ontology: a first crystallization of the ontology