Frederick Giasson

RDF Aggregates and Full Text Search on Steroids with Solr

April 29, 2009September 4, 2009 Frederick Giasson

Preamble

As I explained in my latest blog post, I am now starting to talk about a couple of things I have been working on in the last few months that will lead to a release, by Structured Dynamics, in the coming months. This blog post is the first step into that path. Enjoy!

Introduction

I have been working with RDF, SPARQL and triple stores for years now. I have created many prototypes and online services using these technologies. Having the possibility to describe everything with RDF, and having the possibility to index everything in a triple store that you can easily query the way you want using SPARQL, is priceless. Using RDF saves development and maintenance cost because of the flexibility of store (triple store), the query language (SPARQL), and associated schemas (ontologies).

However, even if this set of technologies can do everything, quickly and efficiently, it is not necessarily optimal for all tasks you have to do. As we will see in this blog post, we use RDF for describing, integrating and managing any kind of data (structured or unstructured) that exists out there. RDF + Ontologies are what we use as the canonical expression of any kind of data. It is the triple store that we use to aggregate, index and manage that data, from one or multiple data sources. It is the same triple store that we use to feed any other system that can be used in our architecture. The triple store is the data orchestrator in any such architecture.

In this blog post I will show you how this orchestrator can be used to create Solr indexes that are used in the architecture to perform three functions that Solr has been built to perform optimally: full-text search, aggregates and filtering. So, while a triple store can perform these functions, it is not optimal for what we have to do.

Overview

The idea is to use the RDF data model and a triples store to populate the Solr schema index. We leverage the powerful and flexible data representation framework (RDF), in conjunction with the piece of software that lets you do whatever you want with that data (Virtuoso), to feed a carefully tailored Solr schema index to optimally perform three things: full-text search, aggregates and filtering. Also, we want to leverage the ontologies used to describe this data to be able to infer things vis-à-vis these indexed resources in Solr. This leverage enables us to use inference on full-text search, aggregates and filtering, in Solr! This is quite important since you will be able to perform full text searches, filtered by types that are inferred!

Some people will tell me that they can do this with a traditional relational database management system: yes. However, RDF + SPARQL + Triple Store is so powerful to integrate any kind of data, from any data sources; it is so flexible that it saves precious development and maintenance resources: so money.

Solr

What we want to do is to create some kind of “RDF” Solr index. We want to be able to perform full-text searches on RDF literals; we want to be able to aggregate RDF resources by the properties that describe them, and their types; and finally we want to be able to do all the searches, aggregation and filtering using inference.

So the first step is to create the proper Solr schema that will let you do all these wonderful things.

The current Solr index schema can be downloaded here. (View source if simply clicking with your browser.)

Now, let’s discuss this schema.

Solr Index Schema

A Solr schema is composed of basically two things: fields and type of fields. For this schema, we only need two types of fields: string and text. If you want more information about these two types, I would refer you to the Solr documentation for a complete explanation of how they work. For now, just consider them as strings and texts.

What interests us is the list of defined fields of this schema (again, see download):

uri [1] – Unique resource identifier of the record
type [1-N] – Type of the record
inferred_type [0-N] – Inferred type of the record
property [0-N] – Property identifier used to describe the resource and that has a literal as object
text [0-N] (same number as property) – Text of the literal of the property
object_property [0-N] – Property identifier used to describe the resource where the object is a reference to another resource and that this other resource can be described by a literal
object_label [0-N] (same number as object_property) – Text used to refer to the resource referenced by the object_property

Full Text Search

A RDF document is a set of multiple triples describing one or multiple resources. Saying that you are doing full-text searches on RDF documents is certainly not the same thing as saying that you are doing full-text searches on traditional text documents. When you describe a resource, you rarely have more than a couple of strings, with a couple of words each. It is generally the name of the entity, or a label that refers to it. You will have different numbers, and sometimes some description (a short biography, or definition, or summary, as examples). However, except if you index an entire text document, the “textual abundance” is quite poor compared to an indexed corpus of documents.

In any case, this doesn’t mean that there are no advantages in doing full-text searches on RDF documents (so, on RDF resource descriptions). But, if we are going to do so, let’s do so completely, and in a way that meets users’ expectations for full-text document search. By applying this mindset, we can apply some cool new tricks!

Intuitively the first implementation of a full-text search index on RDF documents would simply make a key-value pair assignment between a resource URI and its related literals. So, when you perform a full-text search for “Bob”, you get a reference on all the resources that have “Bob” in one of the literals that describe these resources.

This is good, but this is not enough. This is not enough because this breaks the more basic behavior for any users that uses full-text search engines.

Let’s say that I know the author of many articles is named “Bob Carron”. I have no idea what are the titles of the articles he wrote, so I want to search for them. With the system exposed above, if I do a search for “Bob Carron”, I will most likely get back as a result the reference to “Bob Carron”, the author person. This is good, but this is not enough.

On the results page, I want the list of all articles that Bob wrote! Because of the nature of RDF, I don’t have this “full-text” information of “Bob” in the description of the articles he wrote. Most likely, in RDF, Bob will be related to the articles he wrote by reference (object reference with the URIs of these articles), i.e., <this-article> <author> <bob-uri>. As you can notice, we won’t get back any articles in the resultset for the full-text query “Bob Carron” because this textual information doesn’t exist in the index at the level of the articles he wrote!

So, what can we do?

A simple trick will beautifully do the work. When we create the Solr index, what we want is to add the textual information of the resources being referenced by the indexed resources. For example, when we create the Solr document that describes one of the articles written by Bob, we want to add the literal that refers to the resource(s) referenced by this article. In this case, we want to add the name of the author(s) in the full-text record of that article. So, with this simple enhancement, if we do a search for “Bob Carron”, we will now get the list of all resources that refers to Bob too! (articles he wrote, other people that know him, etc).

So, this is the goal of the “object_property” and “object_label” fields of the Solr index. In the schema above, the “object_property” would be “author” and the “object_label” would be “Bob Carron”. This information would belong to the Solr document of the Article 1.

Full Text Search Prototype

Let’s take a look at the prototype running system (see screen capture below).

The dataset loaded in this prototype is Mike’s Sweet Tools. As you notice in the prototype screen, many things can be done with the simple Solr schema we published above. Let’s start with a search for the word “test”. First, we are getting a resultset of 17 things that have the “test” word in any of their text-indexed fields.

What is interesting with that list is the additional information we now have for each of these resultsets that come from the RDF description of these things, and the ontologies that have been used to describe them.

For example, if we take a look at Result #4, we see that the word “test” has been found in the description of the Ontology project for the “TONES Ontology Repository” record. Isn’t that precision far more useful than saying: the word “test” has been found in “this webpage”? I’ll let you think about it.

Also, if we take a look at Result #1, we know that the word “test” has been found in the homepage of the Data Converter Project for the”Talis Semantic Converter” record.

Additionally, by leveraging this Solr index, we can do efficient aggregates on the types of the things returned in the resultset for further filtering. So, in the section “Filter by kinds” we know what kinds of things are returned for the query “test” against this dataset.

Finally, we can use the drop-down box at the right to do a new search (see screenshot), based on the specific kind of things indexed in the system. So, I could want to make a new search, only for “Data specification projects” with the keyword “rdf”. I already know from the user interface that there are 59 such projects.

All this information comes form the Solr index at query time, and basically for free by virtue of how we set up the system. Everything is dynamically aggregated and displayed to the user.

However, there are a few things that you won’t notice here that are used: 1) SPARQL queries to the triple store to get some more information to display on that page; 2) the use of inference (more about it below), and; 3) the leveraging of the ontologies descriptions.

In any case, on one of SD’s test datasets of about 3 million resources, such a page is generated within a few hundred milliseconds: resultset, aggregates, inference and description of things displayed on that page. This same 3 million resources that returns results in a few hundred milliseconds did so on a small Amazon EC2 server instance for 10 cents per hour. How’s that for performance?!

Aggregates and Filtering on Properties and Types

But, we don’t want to merely do full-text search on RDF data. We also want to do aggregates (how many records has this type, or this property, etc.) and filtering, at query time, in a couple of milliseconds. We already had a look at these two functions in the context of a full-text search. Now let’s see it in action in some dataset prototype browsing tools that uses the same Sweet Tools dataset.

In a few milliseconds, we get the list of different kind of things that are indexed in a given dataset. We can know what are the types, and what is the count for each of these types. So, the ontologies drive the taxonomic display of the list of things indexed in the dataset, and Solr drives the aggregation counts for each of these types of things.

Additionally, the ontologies and the Virtuoso inference rules engine are used to make the count, by inference. If we take the example of the type “RDF project”, we know there are 49 such projects. However, not all these projects are explicitly typed with the “RDF project” type. In fact, 7 of these “RDF project” are “RDF editor project” and 6 are “RDF generator project”.

This is where inference can play an important role: an article is a document. If I browse documents, I want to include articles as well. This “broad context retrieval” is driven by the description of the ontologies, and by inference; this is the same thing for these projects; and this is the same thing for everything else that is stored as structured RDF and characterized by an ontology.

The screenshot above shows how these inferences and their nestings could present themselves in a user interface.

Once the user clicks on one of these types, he starts to browse all things of that type. On the next screenshot below, Solr is used to add filters based on the attributes used to describe these things.

In some cases, I may want to see all the Projects that have a review. To do so, I would simply add this filter criteria on the browsing page and display the “Projects” that have a “review” of them. And thanks to Solr, I already know how many such Projects have reviews, right before even taking a look at them.

Note, then, on this screenshot that the filters and counts come from Solr. The list of the actual items returned in the resultset comes from a SPARQL query, and the name of the types and properties (and their descriptions) come from the description of the ontologies used.

This is what all this stuff is about: creating a symbiotic environment where all these wonderful systems live together to do the effective management of the structured data.

Populating the Solr Index

Now that we know how to use Solr to perform full-text searches, and the aggregating and filtering of structured data, one question still remains: how do we populate this index? As stated at above, the goal is to manage all the structured data of the system using a triple store and ontologies. Then it is to use this triple store to populate the Solr index.

Structured Dynamics uses the Virtuoso Open Source as the triple store to populate this index for multiple reasons. One of the main ones is for its performance and its capability to do efficient basic inference. The goal is to send the proper SPARQL queries to get the structured data that we will index in the Solr schema index that we talked about above. Once this is done, all the things that I talked about in this blog post become possible, and efficient.

Syncing the Index

However, in such a setup, we have to keep one thing in mind: each time the triple store is updated (a resource is created, deleted or updated), we have to sync the Solr index according to these modifications.

What we have to do is to detect any change in the triple store, and to reflect this change into the Solr index. What we have to do is to re-create the entire Solr document (the resource that changed in the triple store) using the <add /> operation.

This design raises an issue with using Solr: we cannot simply modify one field of a record. We have to re-index the entire description of the document even if we want to modify a single field of any document. This is a limitation of Solr that is currently addressed in this new feature proposition; but it is not currently available for prime time.

Another thing to consider here is to properly sync the Solr index with any ontology changes (at the level of the class description) if you are using the inference feature. For example, assume you have an ontology that says that class A is a sub-class-of class B. Then, assume the ontology is refined to say that class A is now a sub-class-of class C, which itself is a sub-class-of class B. To keep the Solr index synced with the triple store, you will have to perform all modifications that affect all the records of these types. This means that the synchronization doesn’t only occur at the level of the description of a record; but also at the level of the changes in the ontologies used to describe those records.

Conclusion

One of the main things to keep in mind here is that now, when we develop Web applications, we are not necessarily talking about a single software application, but a group of software applications that compose an architecture to deliver a service(s). In any such architecture, what is at the center of it is Data.

Describing, managing, leveraging and publishing this data is at the center of any Web service. It is why it is so important to have the right flexible data model (RDF), with the right flexible query language (SPARQL), and the right data management system (triple store) in place. From there, you can use the right tools to make it available on the Web to your users.

The right data management system is what should be used to feed any other specific systems that compose the architecture of a Web service. This is what we demonstrated with Solr; but it is certainly not limited to it.

Starting of a New Era

April 15, 2009 Frederick Giasson

More than three months ago I announced the creation of Structured Dynamics LLC. Since then I stayed mute on my blog because I was too busy doing research, software and business development for this new venture with Mike. However things are moving fast here; and now is the time to start talking about some of the technical stuff we have been working on for about five months now.

I am not announcing anything right now, and won’t for the next couple of months neither. As you are probably expecting, we are working on some products and services. I can say that everything will be released in the open source domain under the Apache 2 license. However I won’t say anything else about this thing.

But, what I will do in the following days, weeks and months is to start talking about the underlying technologies of this system and about methods we used to solve some of our problems. There won’t be any particular order, so I will only talk about interesting stuff we found when it pleases me. We think we are making some useful advances, especially on the architectural and design side, and look forward to sharing what we are learning with you.

Stay tuned!

Different World Views (TBox) for the same Structs (ABox)

January 22, 2009January 22, 2009 Frederick Giasson

Mike continues his series of blog posts that talks about the distinction between ABoxes (the assertions box; the data instances box) and TBoxes (the terminologies box; the data schemas box). Mike suggests to people to make a distinction between the data instances (individuals) that belongs to the ABox, and the vocabularies (schemas, ontologies; whatever how you call these formal specifications of conceptualizations) that belongs to the TBox.

I wanted to hammer an important point that emerged in our recent discussions about these specific questions: the TBox defines the language used to describe different kind of things and the ABox is the actual description of these things. However, there is an important distinction to make here: there is a difference between using some properties to describe a thing and understanding the meaning of the use of these properties to describe these things.

Let’s take the use-case of two systems that exchange data. The data instances that will be transmitted between the two systems will be exactly the same: their ABox description will be the same; they will use the same properties and the same values to describe the same things. However, nothing tells us how each of these properties will be processed, understood and managed by these two systems. Each system has its own Worldview. This mean that their TBox (the meaning of classes and properties used to describe data instances) will probably be different, and so, interpreted and handled differently.

I think the fact that two systems may process the same information differently is the lesser evil. This is no different than how humans communicate. Different people have different Worldviews that will dictate how they will see and reason over things. One person can see a book and think at it as a piece of art where another person can say: “Great! I finally have something to start that damned fire!”. The description of the thing (the book) didn’t change; but its meaning changed from one person to another. Exactly the same thing applies to systems that are exchanging data instances.

This is really important since considerations of the TBox (how data instances are interpreted) shouldn’t be bound to the considerations of the ABox (the actual data instances that are transmitted). Otherwise no systems will ever be able to exchange data considering that they will most than likely always share different Worldviews for the same data (they will handle and reason over data instances differently).

I think this is a really important thing to keep in mind going forward because there won’t ever be a single set of ontologies to describe everything on the semantic web. There will be multiple ontologies that will describe the same things, and there will be an endless number of versions of these ontologies (there are already many). And finally, the cherry on the cake, how these ontologies are handled and implemented in systems is different!

But take care here; this doesn’t mean that we can’t exchange meaningful data between different systems. This only means that different Worldviews exist, which means that care should also be given to not mix data with the interpretation of concepts. This is yet another reason why we have to split apart concerns between the ABoxes and the TBoxes.

Structured Dynamics for the New Year

January 2, 2009 Frederick Giasson

For this New Year, Mike and I wanted to introduce our new venture: Structured Dynamics LLC.

Structured Dynamics is dedicated to assist enterprises and non-profit organizations and projects to adopt Web-accessible and interoperable data. The basic premise is that the data itself becomes the application: by virtue of its structure, information can be combined, inferred, analyzed, filtered by tag or facet, queried, searched, reported, templated or visualized. A suite of Web services provides these capabilities, generalized to be driven by the structure of the input data itself.

Structured Dynamics supports both open and proprietary data, including the extraction of structure from fully structured data (RDF), from conventional structured data (such as relational databases), from unstructured (text) data, and from semi-structured (metadata, tags and mark-up) sources. SD’s professional services include:

Linked data training and education
Project evaluation and planning
Legacy data conversions
Vocabulary (ontology) development and mapping
Named entity (instance) dictionary creation
Information extraction, and
Architectural design, development and deployment assistance.

Structured Dynamics is platform- and language-neutral, though all of our services are based on open source software. Mike and I have been advocates of linked data done right as our frequent and oft-cited blog posts attest.

You can read the whole story here and here.

Structured Dynamics and Zitgist

For the past 2 years and a half I put all my time, energy, efforts and knowledge in developing Zitgist LLC‘s products and services. During all that time I had to opportunity to work with a great company (OpenLink Software Inc.); with Kingsley and its dedicated team; and with the best database management system that I had the chance to use (a Swiss knife that let you do anything with any kind of data): Virtuoso.

However life is full of events. It is these events that forge someone’s life. The creation of Structured Dynamics is one of these events; just like Zitgist was.

I am really grateful to OpenLink and Kingsley.

The Next Bibliographic Ontology: OWL

December 2, 2008December 2, 2008 Frederick Giasson

The Bibliographic Ontology‘s aim is to be expressive and flexible enough to be able to convert any existing bibliographic legacy schema (such as Bibtex and its extensions, MARC, Elsevier’s SDOS & CITADEL citation schemas, etc.) and RDFS/OWL ontologies to it.

This new BIBO version 1.2 is the result of more than one year of thinking and discussions between 101 community members and 1254 mail messages. The project’s first aim of expressiveness and flexibility is nearly reached. BIBO’s ongoing development is now pointing to a series of methods and best practices for mature ontology development.

Some BIBO mappings between legacy schemas have been developed, but this trend will now be accelerated. More people are getting interested in BIBO’s ability to describe bibliographic resources. Some people are interested in it to describe bibliographic citations; others are interested in it to integrate data from different bibliographic data sources, using different schemas, into a single and normalized data source. This single data source (in RDF) can then become easily queried, managed and published. Finally, other people are interested in it as a standard agreed to by an open community, that helps them to describe bibliographic data that aims to be published and consumed by different kind of data consumers (such as standalone software like Zotero; or such as citation aggregation Web services like Scirus or Connotea).

With this BIBO 1.2 release, much has changed and been improved. Now, it is time for the community to start implementing BIBO in different systems; to create more mappings; and to complete more converters.

Design Redux

As you may recall from its early definition, BIBO has been designed for both: (1) a core system with extensions relevant to specific domains and uses, and (2) a collaborative development environment governed by the community process.

These design imperatives have guided much of what we have done in this new version 1.2 release to aid these objectives.

BIBO in OWL 2

The new version of BIBO is now described using OWL 2. In the next sections you will know why we choose to use OWL 2 as the way to describe BIBO in the future. However, saying that it is OWL 2 doesn’t mean that it becomes incompatible with everything else that exists. In fact, it validates OWL 1.1 and its DL expressivity is SHOIN(D); this means that fundamentally nothing has changed, but that we are now leveraging a couple of new tools and concepts that are introduced by OWL 2.

As you will see below this decision results in much more than a single update of the ontology. We are introducing an updated, and more efficient, architecture to develop open source ontologies such as The Bibliographic Ontology.

New Versioning System

OWL 2 is introducing a new versioning and importation system for OWL ontologies. This feature alone strongly argued for the adoption of OWL 2 as the way to develop BIBO in the future.

This new versioning system consists of two things: an ontologyURI and a versionURI. The heuristics to define, check, and cache an ontology that as an ontologyURI and possibly a versionURI are described here.

BIBO has an ontologyURI and multiple versionURIs such as http://purl.org/ontology/bibo/1.0/, http://purl.org/ontology/bibo/1.1/, and http://purl.org/ontology/bibo/1.2/.

Right now, the current version of the ontology is 1.2. This means that the current version of BIBO will be located at two places: http://purl.org/ontology/bibo/ and http://purl.org/ontology/bibo/1.2/.

The location logic of ontologies is described here. What we have to take care here is that if someone dereferences any class or properties of BIBO, it will always get the description of that class or property from the latest version of the ontology. This is why the caching logic is quite important. The user agent has to make sure that it caches the version of the ontology that it knows.

What is really important to understand is that the URI of the ontology won’t change over time when we introduce new versions of the same ontology. Only the location of these versions will change.

Finally, the OWL 2 mapping to RDF document tells us that we have to use the owl:versionInfo OWL property to define the versionURI of an ontology. This is the reason why the use of this OWL 2 versioning system doesn’t affect the validity of BIBO as a OWL 1.1 ontology; because owl:versionInfo is also an OWL 1.1 property.

Now, lets take look at the tools that we will use to continue the development of BIBO.

Protégé 4 for Developing BIBO

We chose to now rely on Protégé 4 to develop BIBO in the future. We wanted to start using a tool that would help the community to develop the ontology. Considering that Protégé 4 Beta has been released in August; that it supports OWL 2 by using the OWLAPI library; and many plugins are already supported; it makes it the best free and open-source option available.

What I have done is to add some SKOS annotation properties to annotate the BIBO classes and properties to help us to edit and comment on the ontology. Here is the list of new annotation properties we introduced:

skos:note, is used to write a general notes
skos:historyNote, is used to write some historical comments
skos:scopeNote, is really important. It is the new way to target the classes and properties, imported from external ontologies, that we recommend to use to describe one aspect of BIBO. The scopeNote will tell the users the expected usage for these external resources.
skos:example, is used to give some examples that show how to use a given class or property. Think of RDF/XML or RDF/N3 code examples.

Finally, all these annotations are included in BIBO’s namespace.

OWLDoc for Generating Documentation

OWLDoc is a plugin for Protégé that generates documentation for OWL ontologies. In a single click, we can now get the complete documentation of an ontology. This makes the generation of the documentation for an ontology much, much, more efficient. Users can easily see which ontologies are imported, and then they can easily browse the structure of the ontology. Many facets of the ontology can be explored: all the imported ontologies, the classes, the object/data properties, the individuals, etc.

You can have a look at the new documentation page for BIBO here. On the top-left corner you have a list of all imported ontologies. Then you can click on facet links to display related classes, properties or individuals. Then you may read the description of each of these resources, their usage, and their annotations (scope-notes, Etc.).

Please note there are still some issues and improvements to do with the template used to generate the pages, such as multiple resource descriptions not yet adequately distinguished. We are in the process of cleaning up these minor issues. But, all-in-all, this is a major update to the workflow since any user can easily re-create the documentation pages.

Collaborative Protégé for Community Development

Now that it is available for Protégé 4, we will shortly setup a Protégé server and make it available to the community to support BIBO’s community development. We will shortly announce the availability of this Collaborative Protégé.

In the meantime, I suggest to use the file “bibo.xml” from the “trunk” branch of the SVN repository (see Google Code below). The Bibliographic Ontology can easily be opened that way using the “Open…” option to open the local file of the SVN folder, or by using the “Open URI…” option to open the bibo.xml file from the Google Code servers. That way, each modification to the ontology can easily be committed to the SVN instance.

Google Code to Track Development

As noted above, the BIBO Google Code SVN is used to keep track of the evolution of the ontology. All modifications are tracked and can easily be recovered. This is probably one of the most important features for such a collaborative ontology development effort.

But this is not the only use of this SVN repository. In fact, it as an even more central role: it is the SVN repository that sends the description of the ontology for any location query, by any user, for any version. Below we will see the workflow of a user query that leads the SVN repository to send back a description for the ontology.

Google Groups to Discuss Changes

The best tool to discuss ontology development is certainly a mailing list. A Google Groups is an easy way to create and manage an ontology development mailing list. It is also a good way to archive and search discussions that has an impact on the development (and the history) of the ontology.

Purl.org to Access the Ontology

Another important piece of the puzzle is to have a permanent URI for an ontology that is hosted by an independent organization. That way, even if anything happens with the ontology development group, hopefully, the URI will remain the same over time.

This is what Purl.org is about. It adds one more step to the querying workflow (as you will notice in the querying schema bellow), but this additional step is worth it.

General Query Workflow

There is one remaining thing that I have to talk about: the general querying workflow. I have been talking about the new OWL 2 versioning system, purl.org redirection and using the SVN repository to deliver ontology descriptions. So, there is what the workflow looks like:

[clik to enlarge this schema]

At the first step, the user requests the rdf+xml http://purl.org/ontology/bibo/. As we discussed above, this permanent URI is hosted by Purl.org; what this service does is to redirect the user to the location of the content negotiation script.

At the second step, the user requests the rdf+xml serialization of the description of the ontology at the URI of the location sent by the Purl.org server: http://conneg.com/script/. One of the challenges we have with this architecture is that neither Purl.org nor Google Code handles content negotiation with a user.

Thus, it is also necessary to create a “middle-man” content negotiation script that performs the content negotiation with the user, and redirects it to the proper file hosted on SVN repository. (If Purl.org or the SVN repository could handle the content negotiation part of the workflow, we could then remove the step #2 from the schema above and then improve the general architecture. However, for the present, this step is necessary.)

Note 1: Take a special look at the redirection location sent back by the content negotiation script: http://…/tags/1.2/bibo.xml. This is a direct cause of the new versioning has the versionURI http://purl.org/ontology/bibo/1.2/. Considering the versioning system, the content negotiation script redirects the user to the description of the latest version of the ontologyURI (which is currently the version 1.2).

Note 2: Purl.org current doesn’t strictly conform with the TAG resolution on httpRange-14. However this should be resolved in an upgrade of the Purl.org system that is underway (the current system is dated as of the early 1990s).

At the third step, the SVN repository returns the requested document by the user with the proper Content-Type.

Conclusion

Developing open source ontologies is not an easy task. Development is made difficult considering the complexity of some ontologies, considering the different way to describe the same thing and considering the level of community involvement needed. Thus, open source ontology development needs the proper development architecture to succeed.

I have had the good fortune to work on the this kind of ontology development with Yves Raimond on the Music Ontology, with Bruce D’Arcus on the Bibliographic Ontology, and with Mike Bergman on UMBEL. Each of these projects has led to an improvement of this architecture. After two years, these are the latest tools and methods I can now personally recommend to use to collectively create, develop and maintain ontologies.

Machine Learning, Engineering & Data

Author: Frederick Giasson

RDF Aggregates and Full Text Search on Steroids with Solr

Preamble

Introduction

Overview

Solr

Solr Index Schema

Full Text Search

Full Text Search Prototype

Aggregates and Filtering on Properties and Types

Populating the Solr Index

Syncing the Index

Conclusion

Starting of a New Era

Different World Views (TBox) for the same Structs (ABox)

Structured Dynamics for the New Year

Structured Dynamics and Zitgist

The Next Bibliographic Ontology: OWL

Design Redux

BIBO in OWL 2

New Versioning System

Protégé 4 for Developing BIBO

OWLDoc for Generating Documentation

Collaborative Protégé for Community Development

Google Code to Track Development

Google Groups to Discuss Changes

Purl.org to Access the Ontology

General Query Workflow

Conclusion