UMBEL: Upper-level Mapping and Binding Exchange Layer

umbel_medium.png

Mike Bergman released the first draft of its UMBEL ontology. Me and some other people helped him to come up with that new ontology.What is UMBEL? UMBEL is a lightweight subject reference structure. People can see it as a pool of subjects. Subjects are related together at a synonymy level; so, subjects of related meaning will be binded together.

The objectives

The objectives of this new ontology are:

  • A reference umbrella subject binding ontology, with its own pool of high-level binding subjects
  • Lightweight mechanisms for binding subject-specific ontologies to this structure
  • A standard listing of subjects that can be refererenced by resources described by other ontologies (e.g., dc:subject)
  • Provision of a registration and look-up service for finding appropriate subject ontologies
  • Identification of existing data sets for high-level subject extraction
  • Codification of high-level subject structure extraction techniques
  • Identification and collation of tools to work with this subject structure, and
  • A public Web site for related information, collaboration and project coordination.

Main applications

Given these objectives, I see a couple of main applications where such ontology could be used:

  • Helping systems to find data sources for a given ontology. UMBEL is much more than a subject structure. In fact, UMBEL will bind subjects, with related ontologies and data sources for these related ontologies. So, for a given subject, people will be able to find related ontologies, and then related data sources.
  • Acting as a subject reference backbone. So, it could be use by people to links resources, using dc:subject, to its subject resource (the UMBEL subject proxy resource), etc.
  • Could be used by user interface to help them with handling subjects (keywords) references to find related ontologies (that have the power to describe these subjects).
  • Eventually it should be used by PingtheSemanticWeb to bind pinged data to the subject reference structure.
  • And probably many others.

Creation of the Ontology

A procedure will be created to automatically generate the ontology. The gross idea is to reuse existing knowledge bases to create the set of subjects, and their relationship, that will create the ontology. So, the idea is to come up with a representative, not too general, not too specialized, set of subjects. For that, we will play with knowledge bases such as WordNet, Wikipedia, Dmoz, etc. We will try to find out how we could prune unnecessary subjects out of them, how we could create such a subject reference framework by taking a look at the intersection of each data set, etc. The procedure is not yet developed, but the first experiments will look like that.

As explained in the draft:

The acceptance of the actual subjects and their structure is one key to the acceptance — and thus use and usefulness — of the UMBEL ontology. (The other key is simplicity and ease-of-use or tools.) A suitable subject structure must be adaptable and self-defining. It should reflect expressions of actual social usage and practice, which of course changes over time as knowledge increases and technologies evolve.

A premise of the UMBEL project is that suitable subject content and structures already exist within widely embraced knowledge bases. A further premise is that the ongoing use of these popular knowledge bases will enable them to grow and evolve as societal needs
and practices grow and evolve.

The major starting point for the core subject pool is WordNet. It is universally accepted, has complete noun and class coverage, has an excellent set of synonyms, and has frequency statistics. It also has data regarding hierarchies and relationships useful to the UMBEL look-up reference structure, the ‘unofficial’ complement to the core ontology.

A second obvious foundation to building a subject structure is Wikipedia. Wikipedia’s topic coverage has been built entirely from the bottom up by 75,000 active contributors writing articles on nearly 1.8 million subjects in English alone, with versions in other
degrees of completeness for about 100 different languages. There is also a wealth of internal structure within Wikipedia’s templates.

These efforts suggest a starting formula for the UMBEL project of W + W + S + ? (for WordNet + Wikipedia + SKOS + other?). Other potential data sets with rich subject coverage include existing library classification systems, upper-level ontologies such as SUMO, Proton or DOLCE, the contributor-built Open Directory Project, subject ‘primitives’ in other languages such as Chinese, or the other sources listed in Appendix 2 – Candidate Subject Data Sets.

Though the choice of the contributing data sets from which the UMBEL subject structure is to be built will never be unanimous, using sources that have already been largely selected by large portions of the Web-using public will go a long ways to establishing authoritativeness. Moreover, since the subject structure is only intended as a lightweight reference — and not a complete closed-world definition — the UMBEL project is also setting realistic thresholds for acceptance.

Conclusion

If you are interested in such an ontology project, please join us on the mailing list of the ontology’s development group, ask questions, writes comments and suggestions.

Next step is to start creating a first version of the subject proxies.

The Bibliographic Ontology: a first proposition

This Document is about the creation of The Bibliographic Ontology. It is the first proposition from Bruce D’Arcus and me that should lead to the writing of the first draft of the ontology. Some things have been developed, many questions have been raised, and the discussion that will arise from this first proposition will set the basis for the first draft of the ontology.

The goal of this ontology is simple: creating a bibliographic ontology that will set the basis to describes a document: so describing a writing that provides information. If well done, it will enable other people or organizations to create extension modules that will enable it to be expressive enough to describe more specialized sub-domains such as law documents, etc. It also re-use existing ontologies that already define some properties of documents.

Related materials

1. The proposed OWL/N3 file describing The Bibliographic Ontology (note: read the comment, FG are from me, and BD are from Bruce)
2. An enhanced version of the Zotero RDF dump of the book “Spinning the Semantic Web”, that shows the expressiveness power of the ontology by extending its content using the bibo:Part class and the locators properties (RDF/XML)
3. Other examples that shows other possible descriptions such as the description of events, places, etc.(RDF/N3)

Main concept of the ontology: a Document

The main concept of the ontology is bibo:Document. This class is described as “Writing that provides information” (from Wordnet). So, basically, any writing is a Document. It is equivalent to a foaf:Document and a dcterms:BibliographicResource. These two links are quite important since it will enable us to re-use these two widely used ontologies: FOAF and DCTERMS.

Second main concept: Contributions to these Documents

The second main concept of the ontology is bibo:Contribution. This class is described as “A part played by a person in bringing about a resulting Document”. The goal of this concept is to relate people, by their contributions, to documents they wrote, or helped to write. For now, contributions are defined by three properties:

  1. bibo:role, that defines the role of the contributor: author, translator, publisher, distributor, etc.
  2. bibo:contributor, that links a contribution to its contributor
  3. bibo:position, that loselessly associates a “contribution” level for each contributors. This property is mainly used to sort multiple authors that worked on the writing of a document. More about that in the examples document.

With these two concepts, you can describe any Document and any Contribution to any document. So you can relate any piece of writing to its contributors.

What is really interesting with the concept (in my opinion) is that it opens the door the much more. In fact, by using this concept, we can now extend the idea and describe many more things about how people contributed to the writing of a document.

From these two concepts, we extended the idea to be able to cope with a larger range of use-cases.

Extensions of bibo:Document

The document class has been specialized in a series of more specialized type of documents, with restrictions of their own:

  • Article
  • LegalCase
  • Manuscript
  • Book
  • Manual
  • Legistlation
  • Patent
  • Report
  • Thesis
  • Transcript
  • Note
  • Law

Classes or individuals?

The development of this proposition has been made with Lee W. Lacy’s OWL book quote in mind:

Individuals often mirror “real world” objects. When you stop having different property attributes (and just have different values) you have often identified an object (individual)

This mean that if a subclass of a class didn’t have specific restrictions, or if no properties were restricted by using this class in their domain, then the class was dropped and an individuals of the super-class.

One example of this is the type bibo_types:dissertation. It is an individual of the class bibo:Thesis, but since it doesn’t have anything different other than its meaning, then we created an individual of the class bibo:Thesis. Check the examples document to see what it means concretely.

Collections of documents

Another main concept of the ontology is bibo:Collection. This concept has an aggregation inherent property. Its only purpose is to aggregate bibo:Document(s). An entity of this class will have a role of hubs into the RDF graph (network) created out of bibliographic relations (properties).

Other types of collections, with some restrictions of their own, have also been created. These other collections, such as bibo:CourtReporter are intended to be anchor points that can be extended by Bibliographic Ontology Extension Modules of particular specialized sub-domains such as Law documents.

There is the current list of specialized collections:

  • InternetSite
  • Series
  • Periodical
    • Journal
    • Magazine
    • CourtReporter

Part of Documents

Another important concept is bibo:Part. This concept, along with locators (more about them in the next section), enables us to specify the components of Document. In fact, sometimes documents are aggregated to create collections, such as journals, magazines or court reporters. However, sometimes, documents are embedded within a document (embedded versus aggregated). This is the utility of bibo:Part; a bibo:Part is a document, but in fact, it’s a part of a document. The special property of a bibo:Part is dcterms:hasPart. So, a bibo:Part has use this property to relate it to the document it is part of. Check the examples document to know how bibo:Part can be used.

Locating Parts

To support the concept of Parts, a set of properties, called “locators” have been created. These locator properties will help to describe the relation between a Part and its related Document.

Three of these locators are bibo:volume, bibo:chapter and bibo:page. So, these properties will locate Parts inside documents. For example: a chapter within a book, or a volumne within a document that is a set of volumes.

Check the example about the document “The Art of Computer Programming” by Donald Knuth for a good example of how locators can be used.

This said, we could now think to describe a document by its parts, recursively from its volumes to its pages.

Open questions

  1. Should we develop the ontology such that we can describe the entire workflow that lead to the creation and publication (possibly) of a document? All this workflow would be supported by the FRBR principles. At the moment, all the ontology describes the manifestation of a work, and not the work itself or its expression. Take a look at The Music Ontology (its workflow) to see how it could be done for the bibliographic ontology.
  2. If the creation of classes and individuals of classes the good way to describe type of documents?
  3. Is it the good way, or is there other ways, to describe contributions of people to the elaboration of documents?

Re-used ontologies

  • DCTERMS: re-used to describe main properties of document.
  • FOAF: re-used to describe people and organizations.
  • EVENT: re-used to describe events (example: conferences)
  • TIME: re-used to describe temporal properties
  • wgs84_pos: re-used to describe geographical entities

Conclusion

Please give any feedbacks, suggestions or comments directly on the mailing list of the group that develop this ontology. This group is intended to create an ontology that would create some type of consensus between people and organization working with bibliographical data.

Note: I disabled comment on this post only, to make sure that people comment on the mailing list.

The Music Data Space

Kingsley is talking about Data Spaces since a long time. But what is a Data Space? Nothing is better than an example to understand something, so I will try to explain you with a single data space that has been created yesterday, the Music Data Space:

mbz_rdfview_uris.jpg

This is the Music Data Space. This Data Space contains information about musical things. These things are described mainly by using the Music Ontology, but also by using other ontologies like FOAF. Finally, things (musical things) belonging to this space are accessible, on the Web, via dereferencable URIs.

So, the Music Data Space is a place where all musical things are defined on the Semantic Web, and accessible via the Web.

That is it, and it is what we created last Monday.

Now, some of you could wonder: why on earth Amazon.com belongs to the Music Data Space?

Amazon.com also belongs to the Music Data Space too!

Amazon.com live in the Music Data space too via their API. In fact, a simple experience with the OpenLink RDF Browser clearly demonstrates that Amazon.com’s data belongs to the Music Data Space too.

Open the RDF Browser by following that link

Now you will visualize RDF information about an album called “Chore of Enchantment”. Take a look at this line:

amazon_asin: http://amazon.com/exec/obidos/ASIN/B00003XAA7/searchcom07-20

Click on the link to Amazon. A window should popup. Select the Get Data Set (dereference) option.

At this point, some magic will happens. In fact, the new information that is displayed in the RDF Browser is coming directly from Amazon.com’s web server.

This is why I assume that Amazon.com belong to the Music Data Space too.

In fact, the Virtuoso Sponger will connect to Amazon.com via their API to get some information about that album. It will convert the data into RDF and will display it to the user via the RDF browser’s interface.

One step further: the JPG file also belongs to the Music Data Space!

Yes! Information about the JPG file, hosted on Amazon.com’s web servers, also belong to the Music Data Space and there is the proof:

Open that same RDF Browser page by following that link

Click on the Image (JPG) representing the cover of this album. A window should popup. Select the Get Data Set (dereference) option.

Check the triples that have been created from this image. The Virtuoso Sponger downloaded the JPG file, it analyzed its header, RDFized everything and sent the information back to the RDF Browser so that the user can see the information available for that image.

Where is the end? I have no idea… probably at the same place where the imagination ends too.

Unifying everything

This is that simple. All data sources (relational databases, remote data accessible via APIs, native rdf data, etc.) are unified together via the Music Data Space. And this Music Data Space is accessible, via URI dereferencing, at http://zitgist.com/music/

Other Data Spaces available

Conclusion

The Music Data Space is the starting point and many other type of data spaces should emerge soon.

Browsing Musicbrainz’s dataset via URI dereferencing

Musicbrainz’s dataset can finally be browsed, node-by-node, using URI dereferencing.

What this mean?

Since the Musicbrainz relational database has been converted into RDF using the Music Ontology, all relations existing between Musicbrainz entities (an entity can be a Music Artist, a Band, an Album, a Track, etc.) are creating a musical relations graph. Each node of the graph is a resource and each arc is a property between two resources. Welcome in the World of RDF.

madonna-rdf-description.jpg

This means that from a resource “Madonna” we can browse the musical relations graph to find other entities such as Records, People, Bands, Etc.

Kingsley, inspired by Diana Ross, said: “URI Everything, and Everything is Cool!

This is cool! Now Diana Ross has her own URI on the semantic web: http://zitgist.com/music/artist/60d41417-feda-4734-bbbf-7dcc30e08a83

Paul McCarney:
http://zitgist.com/music/artist/ba550d0e-adac-4864-b88b-407cab5e76af

The Beatles:
http://zitgist.com/music/artist/b10bbbfc-cf9e-42e0-be17-e2c3e1d2600d

Madonna:
http://zitgist.com/music/artist/79239441-bfd5-4981-a70c-55c3f15c1287

Have their own too!


URIs for Musical Things

These URIs are not only used to refer to Musicbrainz entities. In fact, these URIs are used to refer to any Musical Entities that you can describe using the Music Ontology. In a near future, the Musicbrainz data will be integrated along with data from Jamendo and Magnatune. In the future, we will be able to integrate any sort of musical data at the same place (radio stations data, user foaf profiles relations to musical things, etc.). So from a single source (http://zitgist.com/music/) all these different sources of musical data will be queriable at once.

mbz-magnatune-jemendo-rdf.jpg

URI schemes

The URI schemes are defined in the Musicbrainz Virtuoso RDF View:

  • http://zitgist.com/music/artist/*******
  • http://zitgist.com/music/artist/birth/*******
  • http://zitgist.com/music/artist/death/*******
  • http://zitgist.com/music/artist/simlink/*******
  • http://zitgist.com/music/record/*******
  • http://zitgist.com/music/performance/*******
  • http://zitgist.com/music/composition/*******
  • http://zitgist.com/music/musicalwork/*******
  • http://zitgist.com/music/sound/*******
  • http://zitgist.com/music/recording/*******
  • http://zitgist.com/music/signal/*******
  • http://zitgist.com/music/track/*******
  • http://zitgist.com/music/track/duration/*******

All these URI schemes terms refer to their Music Ontology classes’ descriptions.

Conclusion

I am getting closer and closer to the first goal I set to myself when I first started to write the Music Ontology. This first goal was to make the Musicbrainz relational database available in RDF on the Web. Months later and with the help of the Music Ontology Community (specially Yves Raimond that worked tirelessly on the project) and the OpenLink Software Inc. Team, we finally make this data available through URI dereferencing.

From there, we will build-up new music services, integrate more musical datasets into the Music Data Space, etc. It is just the beginning of something much bigger.

Free text search on Musicbrainz literals using Virtuoso RDF Views

I introduced a Virtuoso RDF View that maps the Musicbrainz relational database into RDF using the Music Ontology a couple of weeks ago. Now I will show some query examples evolving a special feature of these Virtuoso RDF Views: full text search on literals.

How RDF Views work

A Virtuoso RDF View can be seen as a layer between a relational database schemas and its conceptualization in RDF. The role of this layer is to convert relation data in its RDF conceptualization.

That is it. You can see it as a conversion tool or as a sort of lens to see RDF data out of relation data.

How full text search over literals works

Recently OpenLink Software introduced the full text feature of their Virtuoso’s SPARQL processor with the usage of the “bif:contains” operator (it is introduced into the SPARQL syntax like a FILTER).

When a user sends a SPARQL query using the bif:contains operator against a Virtuoso triple store, the parser will use the triple store’s full text index to perform the full text search over the queried literal.

With Virtuoso RDF View, instead of using the triple store’s full text index, it will use the relational database’s full text index (if the relational database is supporting full text indexes, naturally).

Some queries examples

In this section I will show you how the full text feature of the Virtuoso RDF Views can be used to increase the performance of a query against the Musicbrainz RDF View modeled using the Music Ontology

Note: if the system asks you for a login and a password to see the page, use the login name “demo” and the password “demo” to see the results of these SPARQL queries.

Example #1

A user remember that first name of the music artist is Paul, and he remember that one of the albums composed by this artists is Press Play. So this user wants to get the full name of this artist with the following SPARQL query:

sparql
define input:storage virtrdf:MBZROOT
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX mo: <http://purl.org/ontology/mo/>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
SELECT ?artist_name ?album_title
FROM <http://musicbrainz.org/>
WHERE
{
?artist rdf:type mo:SoloMusicArtist .
?artist foaf:name ?artist_name .
?artist mo:creatorOf ?album .

?album rdf:type mo:Record .
?album dc:title ?album_title .

FILTER bif:contains(?artist_name, “Paul”) .
FILTER bif:contains(?album_title, “Press and Play”) .
};

Results of this query against the musicbrainz virtuoso rdf view

As you can notice with that query, the user will use the full text capabilities of Virtuoso over two different literals: the objects of these two properties foaf:name and dc:title.

Example #2

In this example, the user wants to know the name of the albums published by Madonna between 1990 and 2000. The answer to this question is returned by the following SPARQL query:

sparql
define input:storage virtrdf:MBZROOT
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX mo: <http://purl.org/ontology/mo/>
PREFIX dcterms: <http://purl.org/dc/terms/>
prefix dc: <http://purl.org/dc/elements/1.1/>
SELECT DISTINCT ?albums_titles ?creation_date
FROM <http://musicbrainz.org/>
WHERE
{
?madonna rdf:type mo:SoloMusicArtist .
?madonna foaf:name ?madonna_name .
FILTER bif:contains(?madonna_name, “Madonna”) .

?madonna mo:creatorOf ?albums .
?albums rdf:type mo:Record .
?albums dcterms:created ?creation_date .
FILTER ( xsd:dateTime(?creation_date) > “1990-01-01T00:00:00Z”^^xsd:dateTime ) .
FILTER ( xsd:dateTime(?creation_date) < “2000-01-01T00:00:00Z”^^xsd:dateTime ) .
?albums dc:title ?albums_titles .
};

Results of this query against the musicbrainz virtuoso rdf view

Here the user will use the full text capabilities of the Virtuoso RDF Views to find artists with the name Madonna and he uses two filters on xsd:dateTime objects to find the albums that have been created between 1990 and 2000.

Examples #3

In this last example, the user wants to know the name of the members of the music group U2.

sparql
define input:storage virtrdf:MBZROOT
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX mo: <http://purl.org/ontology/mo/>
SELECT ?band_name ?member_name
FROM <http://musicbrainz.org/>
WHERE
{
?band rdf:type mo:MusicGroup .
?band foaf:name ?band_name .
?band_name bif:contains ‘”U2″‘ .
?band foaf:member ?members .
?members rdf:type mo:SoloMusicArtist .
?members foaf:name ?member_name .
};

Results of this query against the musicbrainz virtuoso rdf view

Here the user will use the full text feature to get the name of the music group, then the name of the members related to this (these) music group(s) will be returned as well.

Special operators of a full text search

Some full texts operators can be used in the literal parameter of the bif:contains clause. The operators are the same used in the full text feature of Virtuoso’s relational database. A list and a description of the operators can be found on that page.

I would only add that the near operator is defined as +/- 100 chars from the searched literal. And the wildcard ‘*’ operator should at least be placed after the third character of the literal. So, “tes*t” or “tes*” or “test*” are legal usages of the wildcard operator, but “*test”, “t*” or “te*st” are illegal usages of the operator.

Conclusion

Finally, as you can see, the full text feature available with the Virtuoso RDF Views is a more than essential feature that people should use to increase the performance of their SPARQL queries. The only two other options they have are: (1) using a normal “literal” that as to be well written and with the good cases; in one word this option render such queries useless and (2) they can use a FILTER with a regular expression with the “I” parameter that is far too slow for normal usages.