Frederick Giasson – Page 33 – Machine Learning, Engineering & Data

Free text search on Musicbrainz literals using Virtuoso RDF Views

May 17, 2007 Frederick Giasson

I introduced a Virtuoso RDF View that maps the Musicbrainz relational database into RDF using the Music Ontology a couple of weeks ago. Now I will show some query examples evolving a special feature of these Virtuoso RDF Views: full text search on literals.

How RDF Views work

A Virtuoso RDF View can be seen as a layer between a relational database schemas and its conceptualization in RDF. The role of this layer is to convert relation data in its RDF conceptualization.

That is it. You can see it as a conversion tool or as a sort of lens to see RDF data out of relation data.

How full text search over literals works

Recently OpenLink Software introduced the full text feature of their Virtuoso’s SPARQL processor with the usage of the “bif:contains” operator (it is introduced into the SPARQL syntax like a FILTER).

When a user sends a SPARQL query using the bif:contains operator against a Virtuoso triple store, the parser will use the triple store’s full text index to perform the full text search over the queried literal.

With Virtuoso RDF View, instead of using the triple store’s full text index, it will use the relational database’s full text index (if the relational database is supporting full text indexes, naturally).

Some queries examples

In this section I will show you how the full text feature of the Virtuoso RDF Views can be used to increase the performance of a query against the Musicbrainz RDF View modeled using the Music Ontology

Note: if the system asks you for a login and a password to see the page, use the login name “demo” and the password “demo” to see the results of these SPARQL queries.

Example #1

A user remember that first name of the music artist is Paul, and he remember that one of the albums composed by this artists is Press Play. So this user wants to get the full name of this artist with the following SPARQL query:

sparql
define input:storage virtrdf:MBZROOT
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX mo: <http://purl.org/ontology/mo/>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
SELECT ?artist_name ?album_title
FROM <http://musicbrainz.org/>
WHERE
{
?artist rdf:type mo:SoloMusicArtist .
?artist foaf:name ?artist_name .
?artist mo:creatorOf ?album .

?album rdf:type mo:Record .
?album dc:title ?album_title .

FILTER bif:contains(?artist_name, “Paul”) .
FILTER bif:contains(?album_title, “Press and Play”) .
};

Results of this query against the musicbrainz virtuoso rdf view

As you can notice with that query, the user will use the full text capabilities of Virtuoso over two different literals: the objects of these two properties foaf:name and dc:title.

Example #2

In this example, the user wants to know the name of the albums published by Madonna between 1990 and 2000. The answer to this question is returned by the following SPARQL query:

sparql
define input:storage virtrdf:MBZROOT
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX mo: <http://purl.org/ontology/mo/>
PREFIX dcterms: <http://purl.org/dc/terms/>
prefix dc: <http://purl.org/dc/elements/1.1/>
SELECT DISTINCT ?albums_titles ?creation_date
FROM <http://musicbrainz.org/>
WHERE
{
?madonna rdf:type mo:SoloMusicArtist .
?madonna foaf:name ?madonna_name .
FILTER bif:contains(?madonna_name, “Madonna”) .

?madonna mo:creatorOf ?albums .
?albums rdf:type mo:Record .
?albums dcterms:created ?creation_date .
FILTER ( xsd:dateTime(?creation_date) > “1990-01-01T00:00:00Z”^^xsd:dateTime ) .
FILTER ( xsd:dateTime(?creation_date) < “2000-01-01T00:00:00Z”^^xsd:dateTime ) .
?albums dc:title ?albums_titles .
};

Results of this query against the musicbrainz virtuoso rdf view

Here the user will use the full text capabilities of the Virtuoso RDF Views to find artists with the name Madonna and he uses two filters on xsd:dateTime objects to find the albums that have been created between 1990 and 2000.

Examples #3

In this last example, the user wants to know the name of the members of the music group U2.

sparql
define input:storage virtrdf:MBZROOT
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX mo: <http://purl.org/ontology/mo/>
SELECT ?band_name ?member_name
FROM <http://musicbrainz.org/>
WHERE
{
?band rdf:type mo:MusicGroup .
?band foaf:name ?band_name .
?band_name bif:contains ‘”U2″‘ .
?band foaf:member ?members .
?members rdf:type mo:SoloMusicArtist .
?members foaf:name ?member_name .
};

Results of this query against the musicbrainz virtuoso rdf view

Here the user will use the full text feature to get the name of the music group, then the name of the members related to this (these) music group(s) will be returned as well.

Special operators of a full text search

Some full texts operators can be used in the literal parameter of the bif:contains clause. The operators are the same used in the full text feature of Virtuoso’s relational database. A list and a description of the operators can be found on that page.

I would only add that the near operator is defined as +/- 100 chars from the searched literal. And the wildcard ‘*’ operator should at least be placed after the third character of the literal. So, “tes*t” or “tes*” or “test*” are legal usages of the wildcard operator, but “*test”, “t*” or “te*st” are illegal usages of the operator.

Conclusion

Finally, as you can see, the full text feature available with the Virtuoso RDF Views is a more than essential feature that people should use to increase the performance of their SPARQL queries. The only two other options they have are: (1) using a normal “literal” that as to be well written and with the good cases; in one word this option render such queries useless and (2) they can use a FILTER with a regular expression with the “I” parameter that is far too slow for normal usages.

Gone for the next 2 weeks

April 26, 2007 Frederick Giasson

I am leaving tomorrow morning for the California until the 11 May. Until then I will be reachable via email, but with some latency, so please pardon me if I don’t answer the same day you send the email.

Do not hesitate to send me an email if you have questions, comments or suggestions about my works, I will be a pleasure for me to answer to them, it will just take a little bit longer than usually.

Zitgist in one image

April 25, 2007 Frederick Giasson

Converting your WordPress and Mediawiki data into RDF on-the-fly

April 24, 2007 Frederick Giasson

Semantic Web (RDF) data won’t come from initiatives such as LiveJournal.com and Tribe.net with the exportation of their user profiles into RDF using the FOAF ontology; at least not at first. These initiatives are marginal considering the current state of the Web: billion of web pages where most of them are archived into relational database and generated, on-the-fly, in HTML.

Semantic Web (RDF) data will come from the conversation of relational databases of widely used web software such as WordPress, Mediawiki and phpBB, into RDF using some ontologies. Some methods can be used:

Developing specialized scripts to perform the mapping between the database schema and its RDF representation (this is how the WordPress exportation plugin into SIOC is currently working). But this method is time consuming for their development of the scripts are hard to maintain.
Developing RDF Views. This is how we currently convert the Musicbrainz.org relational database into RDF using the Music Ontology.

This blog post will show you how we can do the same with your WordPress blog and your Mediawiki wiki using Virtuoso RDF Views.

This is quite powerful: by using these views any WordPress or Mediawiki instance could be queried using SPARQL. Other views could easily be created for phpBB (currently on the way), and virtually any relational database accessible from the Web.

Since developing these views is quick and simple, it makes them certainly one of the best tools to convert current relational data sources into RDF.

WordPress and Mediawiki RDF Views

Mitko Iliev developed these two RDF Views that are using the WordPress and Mediawiki database schemes and convert them into RDF using a RDF View. I added some comments in the code but as you can notice, they are quite simple and intuitive to understand (if you have some knowledge in SPARQL.

Installing these RDF Views

You have 3 possibilities to install these RDF Views.

If you have the commercial version of Virtuoso you only have to connect the MySQL remote database with Virtuoso via Conductor. That way you will see MySQL databases as if they would be local into Virtuoso.
If you have the open-source version of Virtuoso you have two choices:
1. You make a SQL dump of the MySQL database and import it into Virtuoso.
2. You install the upgraded version of WordPress or Mediawiki developed by OpenLink Software. These upgraded versions of WordPress and Mediawiki use Virtuoso as dbms instead of MySQL. These two versions should be making available to the public by OpenLink soon.

The idea here is to give access to the relational data to Virtuoso by using one of these three methods. After that, it is just a matter of sending SPARQL queries against the RDF View.

Querying a MediaWiki instance using SPARQL

I will use that MediaWiki instance to show you a couple of examples. This is a modified version of MediaWiki 1.7 that uses Virtuoso instead of MySQL as dbms. Then we installed the RDF View I talked about above. From that point, we can query this Mediawiki wiki instance using SPARQL. Remember that it is always running in a relational database, but thanks to the RDF View, we can view its data in RDF too!

Listing all triples from the RDF view: See results
Listing the names of the Wikis hosted on this server: See results
Listing the wiki pages of the “DemoWiki” wiki instance: See results
Listing the wiki pages created by the “demo” user: See results

Etc.

We can endlessly continue like that. What I would suggest you to do is to click on the results you get in these web pages, and to click on the “explore” link. That way, you will jump from node to node and find interesting stuff.

Conclusion

I believe that it is the best way to push people to adopt the semantic web, and all its concepts, as The way to describes things on the Web. Once we will get all that useful data from existing sources (musicbrainz, US census data, geonames, name it) and that people will start to release services using all this data in a useful way, then people will start to generate their content for the semantic web. This is why we should continue in that direction. Many people are already working to convert existing sources of data (relational database, web APIs, etc.) into RDF: the linked-open-data community, Zitgist, OpenLink, and probably many others. I would guess (in fact I am sure) that in one year we would have several billion of triples ready to be searched and browsed by Web users.

The XBRL Ontology: Financial and Economic Ontology based on XBRL Taxonomies

April 21, 2007April 21, 2007 Frederick Giasson

A new ontology development group has been formed: the XBRL Ontology Specification Group. This new ontology will describes financial and economic data in RDF.

Introduction to the XBRL Ontology

As introduced by Kingsley:

The parallel evolution of the XBRL and the Semantic Web is one of the more puzzling of technology misnomers. The Semantic Web expresses a vision about a Web of Data connected by formal meaning (Context). Congruently, XBRL espouses a vision whereby by formally defined Financial Data is accessible via the *Web (and other networks). The Semantic Web uses Schemas and Ontologies for defining Data Domains while XBRL uses Taxonomies that are XML Schema Based. The Semantic Web uses XML as one of its Data Interchange formats (i.e RDF/XML) while XBRL is based on XML at all levels (model and instance data).

It is the goal of the XBRL Ontology project that we mesh the XBRL and Semantic Web realms by producing OWL based Ontologies of XBRL Schemas that facilitate the generation of RDF Instance Data for XBRL Data Sources (e.g. XBRL Documents). This effort is not intended to supercede the use of XML Schemas in XBRL in any way. It simply provides a mechanism for exposing XBRL based Financial Data to the Semantic Web.

What are the anticipated deliverables:

OWL Ontologies for XBRL Taxonomies such as the XBRL GL (and others)

RDF instance data for said Ontologies

SPARQL (Semantic Web Query Language) based Access Points for XBRL Instance Data

Benefits:

Transparent integration of disparate financial systems

Mapping of application data (e.g. SQL) to relevant XBRL Ontologies which are then exposed to WAN (Web) or LAN (Intranet) via SPARQL access points

Easy mechanism for plugging into burgeoning Semantic Data Web

Current people participating to that project

Some people already started to talk about the development of the XBRL Ontology and are interested (or are already in) to join this new ontology development group. These people are:

Eric E. Cohen
Mark C. Bolgiano
Sehl Mellouli
Rubén Lara Hernández
Thierry Declerck
Kingsley Idehen
Frederick Giasson

Development communication infrastructure

Some systems are already up and running to help the development team to communicate their ideas, suggestions and questions vis-à-vis the XBRL Ontology.

Conclusion

This new ontology development project aim to describes financial and economic data for exchange and analysis. Some people already started to work on the project as you can notice in the list above. The development of this ontology will be based on the XBRL initiative and existing XBRL taxonomies. But it won’t restrict its expressiveness to XBRL related works only.