Archive for the 'conStruct' Category

structWSF Web Services Tutorial

Print This Post Print This Post

One thing that was hard to do with structWSF was explaining what structWSF is, and how users can interact with it. For most people, structWSF was abstracted behind conStruct and they didn’t know that each single functionalities of conStruct was bound to one, or multiple queries to one, or multiple, structWSF instance.

It is the reason why we took the time to write a complete structWSF interaction tutorial. This tutorial explains what the general structWSF architecture is, and it describes a series of general interaction usecases. We hope that this tutorial will helps developers and system implementators understanding the capabilities of structWSF and how they can use it.

You can read the complete structWSF Web Services Tutorial here.

Additionally, we released a new version of structWSF, conStruct and the irJSON Parser which are products of this toturial.

New versions of structWSF and conStruct

Print This Post Print This Post

triple_120construct_logo_120

We just released a new (major) version of both structWSF and conStruct. Though some months had passed since we last released this software, we finally got the time and opportunity to make these important upgrades. Many things have changed in both packages. I don’t want to iterate all the changes in this blog post, so I would suggest you to read the changes log files here:

These new versions have greatly been impacted by the needs of our clients. We also started to introduce some new concepts we wrote about the last few months.

A really good addition to this release is the a brand new Installation Manual. Hopefully people will be able to “easily” and properly install and setup a Web server to host these two packages.

All documentation files have been updated:

You can download both software packages from here:

An Amazon EC2/EBS Architecture

Some of the changes to these new versions have been made to help create, setup and maintain Web servers that host structWSF and conStruct instances.

At Structured Dynamics, we have developed and use a server architecture that leverages Amazon computer-in-the-clouds services such as: EC2, EBS, Elastic IP in the Cloud. Such an architecture is giving us the flexibility to easily maintain and upgrade server instances, to instantly create new structWSF instances in one click (without performing all these steps everytime), etc.

You can contact us for more information about these EC2 AMIs and EBS Volumes that we developed for this purpose. Here is an overview of the architecture that is now in place:

structwsf_amazon

There is a clear separation of concerns between three major things:

  • Software & libraries
  • Configuration files
  • Data files.

We chose to put all software and libraries needed to create a stand-alone structWSF instance in an EC2 AMI. This means that all needed software to run a structWSF instance is present on the Virtuoso server running Ubuntu server.

Then we chose to put all configuration and data files on an EBS volume that we attach, and mount, on the EC2 instance. You can think about a EBS volume as a physical hard drive: it can be mounted on a server instance, but it can’t be shared between multiple instances.

By splitting the software & libraries, configuration and data files, we make sure that we can easily upgrade a structWSF server in production with the latest version of structWSF (its code base and all related software such as Virtuoso, Solr, etc). Since the configuration and data files are not on the EC2 instance, we can easily create a new EC2 instance by using the latest structWSF AMI we produced, and then to mount the configuration and data files EBS volume on the new (and upgraded) structWSF instance. That way, in a few clicks, we can fully upgrade a server in production without fear of disturbing the configuration or data files.

Additionally, we can easily create backups of configuration and data files at different intervals by using Amazon’s Snapshot technology.

Finally, we chose to put all related software and configuration files needed to run a conStruct instance in another, separate, EBS volume. That way, we have a clean structWSF AMI instance that can be upgraded at any time, and we can plug (mount) a conStruct instance (EBS instance) into a structWSF server at any time. This means that we can easily have structWSF instances with or without a conStruct instance. The same strategy can easily be used to create plugin packages that can be mounted and unmounted to any structWSF instance at any time, depending on the needs.

All this makes structWSF server instances maintenance easier, simpler and faster.

conStruct: a skin for structWSF

Print This Post Print This Post

As I said in my previous blog post, a conStruct instance is nothing more than a skin for one or multiple structWSF instances. conStruct is a user of a structWSF network.

But… what that means?

That means that each conStruct tools communicate with one or multiple structWSF instances. Each each feature of conStruct comes from structWSF. The only thing it does is presenting information to users, and give them some tool to manipulate the data.

A structWSF instances network

A structWSF instance is a set of web service endpoints. Each endpoint gets registered in a network. Each query sent to any of the web service endpoint of the network gets authenticated (and possibly rejected) by the network.

All structWSF instances share the same basic web services endpoints, however some specialized structWSF instance can add new functionality to the framework by developing new endpoints that does special things. Others can un-register services that has nothing to do with the mission of the instance, etc.

Not all structWSF instances are the same, but all of them share the same interface.

Individual people or organizations can choose to create structWSF nodes. The purposes can be quite different. Some organizations could choose to create structWSF nodes for internal purposes only: to help their departments to share different kind of data for example. Some people could want to setup a structWSF node where they can archive and share all data specific to their hobbies. Whatever the use-case is: they want a platform to ingest, manage, interact with and publish data; publicly or privately.

In the schema above, we can notice that different structWSF instances have been created and are maintained by different organizations, for different purposes. Some of the clients will communicate with these structWSF instances as a public user of the datasets published on the node(s), and other users will access to datasets that only them have access to.

As you can see, some users communicate with multiple structWSF instances. This means that these user cares about data of different datasets, maintained by different organizations. Why and what for? We don’t know. It can be for any reasons. It can be as a web portal that aggregates all the information about a specific domain that is shared amongst multiple nodes or it can be because the user get information from his client’s networks to get things done.

What is important to keep in mind with the schema above is that any kind of people, any kind of organizations and any kind of systems can leverage the structured data they have access to that is hosted by different organizations that make available different datasets and different web services endpoints (maybe some organizations can even create a web service endpoint that works with their dataset and to expose some special algorithms they use to disambiguate/tag entities, etc.)

A network in action

You are probably telling yourself: well, the grand vision is good… but where is the meat around the bone?

Lets take a look at the conStructSCS sandbox demo. You have two datasets in there: (1) the Sweet Tools and (2) RePEc. There is one thing that you probably don’t notice: both datasets live on two different structWSF instances (each structWSF instance is hosted on a different web server). This means that if you perform a search, or a browse query, all results you get in the conStruct user interface come from two totally different servers, with different data maintainers, hosted by different organizations, etc. Still, all results are displayed in the same user interface, which is the conStructSCS demo sandbox.

Under the curtain

Lets take a look at what is happening. First, run this search query for “rdf”. You see what appears in the yellow box? This is a list of the queries exchanged between conStruct and two structWSF instances. You want more? Try this other search query for “rdf”. Now you also have access to the body of the messages.

For this demo sandbox, we enabled the “wsf_debug” parameter so that users of the sandbox can see how a conStruct node can interact with structWSF instances. If the value of this URL parameter is “1″, then the header + body of the query is displayed to the users. If the value is “2″, only the header is displayed.

This means that you can happen the “&wsf_debug=1″ parameter to any URL of the demo sandbox and you will be able to see the messages exchanged between the systems. Why? Because all conStruct tools communicate with one or multiple web service endpoint(s) and one or multiple structWSF instances.

Now, lets take a look at the output of the search query above.

  • Web service query: [[url: http://localhost/ws/search/] [method: post] [mime: text/xml] [parameters: ] [execution time: 0.279745101929]] (status: 200) OK – .
  • Web service query: [[url: http://bknetwork.org/ws/search/] [method: post] [mime: text/xml] [parameters: query=rdf&types=all&datasets=http%3A%2F%2Fbknetwork.org%2Fwsf%2Fdatasets%2F283%2F%3Bhttp%3A%2F%2Fconstructscs.com%2Fwsf%2Fdatasets%2F160%2F&items=10&page=0&inference=on&include_aggregates=true&registered_ip=self%3A%3A0] [execution time: 0.289397001266]] (status: 200) OK – .
  • Web service query: [[url: http://localhost/ws/dataset/read/] [method: get] [mime: text/xml] [parameters: uri=all&registered_ip=self%3A%3A0] [execution time: 0.123399972916]] (status: 200) OK – .
  • Web service query: [[url: /ws/dataset/read/] [method: get] [mime: text/xml] [parameters: uri=all&registered_ip=self%3A%3A0] [execution time: 0.18315911293]] (status: 200) OK – .

Each dot is a query sent to a specific structWSF instance. For each query, you have this information:

  • URL of the web service endpoint where the query has been sent.
  • HTTP method used to send the query
  • MIME type (Accept HTTP header parameters) requested
  • Parameters of the query
  • Time it took to execute the query (including network latency & query processing)
  • Status of the query from the web service endpoint

Since this conStruct instance is linked to two different structWSF instances, the search tool will send a search query to two different search web service endpoints. Additionally, it will query these structWSF instances to get the description of the searched dataset (to display the proper name of the datasets in the user interface).

Each query is validated by the structWSF instances to make sure that they are legitimate queries. If they are, then results are returned. Once these queries are sent and answers received, the structSearch tool can then generate the page and display it to the user.

Do you want more? Here is a list of queries sent by different conStruct tools to different web services endpoints:

(Note: this debug info tabs has been added so that people can see what is happening under the hood. However this information is only accessible to the registered conStruct instance and the administrator of that instance).

Do it by yourself, from your desktop computer

I said that people or organizations that managed to create content data on these structWSF instances were able to manage/manipulate their data from anywhere: not only from within conStruct. Lets test this.

I changed the permissions on the Sweet Tools List dataset so that it is publicly available for reading. That way, any anyone will be able to send Curl queries against the dataset, to that structWSF instance.

Now, lets try a couple of queries to different web services endpoints. Let start with a query for the keyword “rdf” on the Sweet Tools dataset:

curl -H “Accept: text/xml” “http://constructscs.com/ws/search/” -d “query=rdf&types=all&datasets=http%3A%2F%2Fconstructscs.com%2Fwsf%2Fdatasets%2F122%2F&items=10&inference=on”

What you will get for this query is a list of 10 instance records that match this query. You don’t like the internal XML representation of the system? Then try the internal JSON representation by running this query:

Maybe this is not good enough for you? Then lets try in RDF+XML:

curl -H “Accept: application/rdf+xml” “http://constructscs.com/ws/search/” -d “query=rdf&types=all&datasets=http%3A%2F%2Fconstructscs.com%2Fwsf%2Fdatasets%2F122%2F&items=10&inference=on”

I think you understood the point here, so I won’t continue.

Now, lets send a query to get all the datasets accessible by you:

curl -H “Accept: application/rdf+xml” “http://constructscs.com/ws/auth/lister/” -d “mode=adataset”

If you can query all these things with Curl, this mean that anything can query these services. Standalone softwares can be developed to leverage these content nodes as well as other online applications.

Conclusion

As you probably learned with this blog post, one of the powers of structWSF is that it creates networks of structured content nodes that can be accessed by any thing, from anywhere, publicly or privately.

As you noticed, all this stuff is not only about integrating any kind of data, but also to publish it in a flexible way.

Re-Introduction

Print This Post Print This Post

I haven’t been active on this blog for more than half a year now. I was telling myself that I was too busy coding to write anything meaningful to my readers. I did write a couple of things, but nothing of importance related to all the things I was working on. I did publish announcements and such, but didn’t really take the time to write about these things. A lot of things have been done and published recently, but little has been said. So, lets try to rectify the shot so that I share more about what I am currently working on, the concepts I am playing with, the systems I am releasing, etc. So, lets restart to write about these things that I really do believe in, and that I put all my time, efforts and energy in. Lets restart writing about things that I do believe in and that are valuable to me.

As you probably know, my company Structured Dynamics released a series of products: structWSF and conStruct. I spent the last six months developing these two products. However, what are they? Why did I spend all my time working on these products? Why does they matter? Why do I think that they are valuable?

Let me outline what they are, what they do and what they are useful at. Then think if they could be of any value to you, your organizations, your enterprises, etc.

StructWSF

StructWSF is a web services framework (WSF) that basically does four things: it ingest, manage, interact with and publish data. What kind data? Any kind of data

Ingesting: the aim is to be able to ingest data from any data source (so data formatted using any language, or described using any vocabularies/schemas techniques). The framework has to be able to ingest any data that come from any data sources with a single conversion step.

Managing: the aim is to be able to manage the data. Managing the data means being able to collectively (with permissions and authentication) manage datasets available in a framework instance. Being about the create, modify, delete or update data. It also means being able to browse and search the data. It means making it publicly available, or to restrict its access to a user or group of users. This means merging datasets together too.

Interacting: but there is another facet to data management. We don’t only want to be able to manage data in a locked system. What we want is to be able to manage its data from anywhere. It can be from my browse, from my website, from some other applications on my desktop, from my home, from my office: from anywhere. All functions of a structWSF instance are accessible as web services endpoints. This means that you can perform any action, on your data, from anywhere you want: from a conStruct node or from a local Curl query. This is I think how people / organizations want to be able to manage the data they create and curate data.

Publishing: like ingesting, we want to be able to publish, to communicate the data we create to other people, other organizations or other entities. We want to do this in such a way that these external entities doesn’t have to recreate/reinvent themselves. We want to be able to communicate data the way they understand it: using any format and any vocabulary/schema.

The mindset behind structWSF is the following: we can ingest any kind of data, we can manage that data in multiple ways, we can interact with that data from anywhere and we can publish-back this data in any ways. structWSF is friction less in the sense of data communication between systems, users and entities.

conStruct

conStruct is just a skin over one, or multiple, structWSF instances. The conStruct software is an example of how a system can interact with a structWSF data provider. conStruct is a suite of generic tools that can be used to search, browse, visualize (template), import, export, create, delete and update data. All these tools interact with one or multiple structWSF functions by using their web service endpoints.

Since conStruct can interact with a single structWSF instance, it can also interact with multiple structWSF instances. That means that conStruct can be a user interface that communicates with multiple data providers (structWSF instances) and display all the results, from all these providers, in a canonical user interface.

But as I said, conStruct is one skin over structWSF instances. We could think about the integration of structWSF into other CMS systems. We could even think about having different CMS systems integrating with the same structWSF instance(s) so that if one user update/create/delete some data, it appears in other CMS systems as well.

The Magic Twist

However, all this is done with a twist: everything is structured. This means that everything that is in the system has a structure: is described using some vocabularies (full blow ontologies; or naive vocabularies). This enable all kind of valuable functionalities: inferencing capabilities in search and browse activities, filtering on types and attributes, helps integrating different datasets from different systems and organizations.

This is the magic twist that make this system different: everything in there is structured in such a way that everything can be ingested and published in any format; in such a way that basic inferencing or more complex reasoning is possible. It integrates data and let users use it the way they want from where they are. The capabilities are there; use it if you need them.

Next steps

The next steps for me will be to describe the features of the system: how the data is managed, how permissions work, what is the granularity of permissions available, etc. These will be more technical blog posts, but they will give you the full potential of the systems and concepts I have been talking in this blog post.

Release of structWSF, conStruct and the Community Web Site

Print This Post Print This Post

The last few months have been challenging in term of amount of work to get done, in focusing on deliverables and in getting ready for the release of conStruct and structWSF sources codes, documentations, tutorials, web sites and demos.

I am now really happy to be able to finally announce the release of both software code sources along with a new development community website where users and developers can exchange ideas about these two news projects.

The biggest milestone of the last months is now behind us. However, this is just the beginning of everything!

I think that many things have been written about these two projects already. I don’t want to write any tutorial at this point. So the only thing I will do right now is to point you the more relevant documentation, web sites, blog posts and demos about each project. The next step will be to write about specific use cases, features, etc.

Community Web Site

The community Web site is a place where developers and users of structWSF and conStruct can meet to talk about both projects, to report bugs and issues, to submit new enhancements, to find tips and tricks, etc.

I would suggest you to create a new user profile on the community Web site if you are interested in communicating with other members.

structWSF

structWSF is a platform-independent Web services framework for accessing and exposing structured RDF data. Its central organizing perspective is that of the dataset. These datasets contain instance records, with the structural relationships amongst the data and their attributes and concepts defined via ontologies (schema with accompanying vocabularies).

The structWSF middleware framework is generally RESTful in design and is based on HTTP and Web protocols and open standards. The initial structWSF framework comes packaged with a baseline set of about a dozen Web services in CRUD, browse, search and export and import. All Web services are exposed via APIs and SPARQL endpoints. Each request to an individual Web service returns an HTTP status and optionally a document of resultsets. Each results document can be serialized in many ways, and may be expressed as either RDF or pure XML.

conStruct

conStruct is a distro of the Drupal framework that aims to set a new standard in data integration and as a structured content system (SCS). With conStruct, you can let your data and its structure drive your applications. You can easily interoperate your diverse internal information with public content on the Web. And you can leverage a platform designed from the ground up for knowledge management and collaboration.

structWSF and conStruct websites unveiled

Print This Post Print This Post

I am proud to announce the release the websites of two of our products to come: structWSF and conStruct. Both products will be available in open source under the Apache 2 license. Mike just unveiled and demoed the two projects in his talk at SemTech 2009.

As we describe them on Structured Dynamics‘ website:

structWSF

structWSF is a platform-independent Web services framework for accessing and exposing structured  RDF data. Its central organizing perspective is that of the dataset. These datasets contain instance records, with the structural relationships amongst the data and their attributes and concepts defined via ontologies (schema with accompanying vocabularies).

The structWSF middleware framework is generally RESTful in design and is based on HTTP and Web protocols and open standards. The initial structWSF framework comes packaged with a baseline set of about a dozen Web services in CRUD, browse, search and export and import.

All Web services are exposed via APIs and SPARQL endpoints. Each request to an individual Web service returns an HTTP status and optionally a document of resultsets. Each results document can be serialized in many ways, and may be expressed as either RDF or pure XML.

In initial release, structWSF has direct interfaces to the Virtuoso RDF triple store (via ODBC, and later HTTP) and the Solr faceted, full-text search engine (via HTTP). However, structWSF has been designed to be fully platform-independent. Support for additional datastores and engines is planned. The design also allows other specialized systems to be included, such as analysis or advanced inference engines.

The framework is open source (Apache 2 license) and designed for extensibility. structWSF and its extensions and enhancements are distributed and documented on the OpenStructs Web site.

conStruct

conStruct SCS is a structured content system that extends the basic Drupal content management framework. conStruct  enables structured data and its controlling vocabularies (ontologies) to drive applications and user interfaces.

Users and groups can flexibly access and manage any or all datasets exposed by the system depending on roles and permissions. Report and presentation templates are easily defined, styled or modified based on the underlying datasets and structure. Collaboration networks can readily be established across multiple installations and non-Drupal endpoints. Powerful linked data integration can be included to embrace data anywhere on the Web.

Depending on roles and permissions, a given user may or may not see specific datasets or tools within the Drupal interface. Search and browse results are similarly sequestered depending on access rights.

conStruct provides Drupal-level CRUD (create – read – update – delete), data display templating, faceted browsing, full-text search, and import and export over structured data stores based on RDF. It also provides a system for additional tools additions and expansions for this structured data. conStruct SCS is built on the platform-independent structWSF Web services framework.

Like Drupal and structWSF, conStruct is free and open source (GPL license). Versions of conStruct SCS are planned to adopt it to other content management systems (CMS).

Next

The alpha version of the code with all the proper documentation will be released later this summer. Everybody will be able to contribute to the project by enhancing/developing the core code or by extending it with new modules and web services. Stay tuned!

RDF Aggregates and Full Text Search on Steroids with Solr

Print This Post Print This Post

Preamble

As I explained in my latest blog post, I am now starting to talk about a couple of things I have been working on in the last few months that will lead to a release, by Structured Dynamics, in the coming months. This blog post is the first step into that path. Enjoy!

Introduction

I have been working with RDF, SPARQL and triple stores for years now. I have created many prototypes and online services using these technologies. Having the possibility to describe everything with RDF, and having the possibility to index everything in a triple store that you can easily query the way you want using SPARQL, is priceless. Using RDF saves development and maintenance cost because of the flexibility of store (triple store), the query language (SPARQL), and associated schemas (ontologies).

However, even if this set of technologies can do everything, quickly and efficiently, it is not necessarily optimal for all tasks you have to do. As we will see in this blog post, we use RDF for describing, integrating and managing any kind of data (structured or unstructured) that exists out there. RDF + Ontologies are what we use as the canonical expression of any kind of data. It is the triple store that we use to aggregate, index and manage that data, from one or multiple data sources. It is the same triple store that we use to feed any other system that can be used in our architecture. The triple store is the data orchestrator in any such architecture.

In this blog post I will show you how this orchestrator can be used to create Solr indexes that are used in the architecture to perform three functions that Solr has been built to perform optimally: full-text search, aggregates and filtering. So, while a triple store can perform these functions, it is not optimal for what we have to do.

Overview

The idea is to use the RDF data model and a triples store to populate the Solr schema index. We leverage the powerful and flexible data representation framework (RDF), in conjunction with the piece of software that lets you do whatever you want with that data (Virtuoso), to feed a carefully tailored Solr schema index to optimally perform three things: full-text search, aggregates and filtering. Also, we want to leverage the ontologies used to describe this data to be able to infer things vis-à-vis these indexed resources in Solr. This leverage enables us to use inference on full-text search, aggregates and filtering, in Solr! This is quite important since you will be able to perform full text searches, filtered by types that are inferred!

Some people will tell me that they can do this with a traditional relational database management system: yes. However, RDF + SPARQL + Triple Store is so powerful to integrate any kind of data, from any data sources; it is so flexible that it saves precious development and maintenance resources: so money.

Solr

What we want to do is to create some kind of “RDF” Solr index. We want to be able to perform full-text searches on RDF literals; we want to be able to aggregate RDF resources by the properties that describe them, and their types; and finally we want to be able to do all the searches, aggregation and filtering using inference.

So the first step is to create the proper Solr schema that will let you do all these wonderful things.

The current Solr index schema can be downloaded here. (View source if simply clicking with your browser.)

Now, let’s discuss this schema.

Solr Index Schema

A Solr schema is composed of basically two things: fields and type of fields. For this schema, we only need two types of fields: string and text. If you want more information about these two types, I would refer you to the Solr documentation for a complete explanation of how they work. For now, just consider them as strings and texts.

What interests us is the list of defined fields of this schema (again, see download):

  • uri [1] – Unique resource identifier of the record
  • type [1-N] – Type of the record
  • inferred_type [0-N] – Inferred type of the record
  • property [0-N] – Property identifier used to describe the resource and that has a literal as object
  • text [0-N] (same number as property) – Text of the literal of the property
  • object_property [0-N] – Property identifier used to describe the resource where the object is a reference to another resource and that this other resource can be described by a literal
  • object_label [0-N] (same number as object_property) – Text used to refer to the resource referenced by the object_property

Full Text Search

A RDF document is a set of multiple triples describing one or multiple resources. Saying that you are doing full-text searches on RDF documents is certainly not the same thing as saying that you are doing full-text searches on traditional text documents. When you describe a resource, you rarely have more than a couple of strings, with a couple of words each. It is generally the name of the entity, or a label that refers to it. You will have different numbers, and sometimes some description (a short biography, or definition, or summary, as examples). However, except if you index an entire text document, the “textual abundance” is quite poor compared to an indexed corpus of documents.

In any case, this doesn’t mean that there are no advantages in doing full-text searches on RDF documents (so, on RDF resource descriptions). But, if we are going to do so, let’s do so completely, and in a way that meets users’ expectations for full-text document search.  By applying this mindset, we can apply some cool new tricks!

Intuitively the first implementation of a full-text search index on RDF documents would simply make a key-value pair assignment between a resource URI and its related literals. So, when you perform a full-text search for “Bob”, you get a reference on all the resources that have “Bob” in one of the literals that describe these resources.

This is good, but this is not enough. This is not enough because this breaks the more basic behavior for any users that uses full-text search engines.

Let’s say that I know the author of many articles is named “Bob Carron”. I have no idea what are the titles of the articles he wrote, so I want to search for them. With the system exposed above, if I do a search for “Bob Carron”, I will most likely get back as a result the reference to “Bob Carron”, the author person. This is good, but this is not enough.

On the results page, I want the list of all articles that Bob wrote! Because of the nature of RDF, I don’t have this “full-text” information of “Bob” in the description of the articles he wrote. Most likely, in RDF, Bob will be related to the articles he wrote by reference (object reference with the URIs of these articles), i.e., <this-article> <author> <bob-uri>. As you can notice, we won’t get back any articles in the resultset for the full-text query “Bob Carron” because this textual information doesn’t exist in the index at the level of the articles he wrote!

So, what can we do?

A simple trick will beautifully do the work. When we create the Solr index, what we want is to add the textual information of the resources being referenced by the indexed resources. For example, when we create the Solr document that describes one of the articles written by Bob, we want to add the literal that refers to the resource(s) referenced by this article. In this case, we want to add the name of the author(s) in the full-text record of that article. So, with this simple enhancement, if we do a search for “Bob Carron”, we will now get the list of all resources that refers to Bob too! (articles he wrote, other people that know him, etc).

So, this is the goal of the “object_property” and “object_label” fields of the Solr index. In the schema above, the “object_property” would be “author” and the “object_label” would be “Bob Carron”. This information would belong to the Solr document of the Article 1.

Full Text Search Prototype

Let’s take a look at the prototype running system (see screen capture below).



The dataset loaded in this prototype is Mike’s Sweet Tools. As you notice in the prototype screen, many things can be done with the simple Solr schema we published above. Let’s start with a search for the word “test”. First, we are getting a resultset of 17 things that have the “test” word in any of their text-indexed fields.

What is interesting with that list is the additional information we now have for each of these resultsets that come from the RDF description of these things, and the ontologies that have been used to describe them.

For example, if we take a look at Result #4, we see that the word “test” has been found in the description of the Ontology project for the “TONES  Ontology Repository” record. Isn’t that precision far more useful than saying: the word “test” has been found in “this webpage”? I’ll let you think about it.

Also, if we take a look at Result #1, we know that the word “test” has been found in the homepage of the Data Converter Project for the”Talis Semantic Converter” record.

Additionally, by leveraging this Solr index, we can do efficient aggregates on the types of the things returned in the resultset for further filtering. So, in the section “Filter by kinds” we know what kinds of things are returned for the query “test” against this dataset.

Finally, we can use the drop-down box at the right to do a new search (see screenshot), based on the specific kind of things indexed in the system. So, I could want to make a new search, only for “Data specification projects” with the keyword “rdf”. I already know from the user interface that there are 59 such projects.

All this information comes form the Solr index at query time, and basically for free by virtue of how we set up the system. Everything is dynamically aggregated and displayed to the user.

However, there are a few things that you won’t notice here that are used:  1) SPARQL queries to the triple store to get some more information to display on that page; 2) the use of inference (more about it below), and; 3) the leveraging of the ontologies descriptions.

In any case, on one of SD’s test datasets of about 3 million resources, such a page is generated within a few hundred milliseconds: resultset, aggregates, inference and description of things displayed on that page.  This same 3 million resources that returns results in a few hundred milliseconds did so on a small Amazon EC2 server instance for 10 cents per hour. How’s that for performance?!

Aggregates and Filtering on Properties and Types

But, we don’t want to merely do full-text search on RDF data. We also want to do aggregates (how many records has this type, or this property, etc.) and filtering, at query time, in a couple of milliseconds. We already had a look at these two functions in the context of a full-text search. Now let’s see it in action in some dataset prototype browsing tools that uses the same Sweet Tools dataset.

In a few milliseconds, we get the list of different kind of things that are indexed in a given dataset. We can know what are the types, and what is the count for each of these types. So, the ontologies drive the taxonomic display of the list of things indexed in the dataset, and Solr drives the aggregation counts for each of these types of things.

Additionally, the ontologies and the Virtuoso inference rules engine are used to make the count, by inference. If we take the example of the type “RDF project”, we know there are 49 such projects. However, not all these projects are explicitly typed with the “RDF project” type. In fact, 7 of these “RDF project” are “RDF editor project” and 6 are “RDF generator project”.

This is where inference can play an important role: an article is a document. If I browse documents, I want to include articles as well. This “broad context retrieval” is driven by the description of the ontologies, and by inference; this is the same thing for these projects; and this is the same thing for everything else that is stored as structured RDF and characterized by an ontology.

The screenshot above shows how these inferences and their nestings could present themselves in a user interface.

Once the user clicks on one of these types, he starts to browse all things of that type. On the next screenshot below, Solr is used to add filters based on the attributes used to describe these things.

In some cases, I may want to see all the Projects that have a review. To do so, I would simply add this filter criteria on the browsing page and display the “Projects” that have a “review” of them. And thanks to Solr, I already know how many such Projects have reviews, right before even taking a look at them.

Note, then, on this screenshot that the filters and counts come from Solr.  The list of the actual items returned in the resultset comes from a SPARQL query, and the name of the types and properties (and their descriptions) come from the description of the ontologies used.

This is what all this stuff is about: creating a symbiotic environment where all these wonderful systems live together to do the effective management of the structured data.

Populating the Solr Index

Now that we know how to use Solr to perform full-text searches, and the aggregating and filtering of structured data, one question still remains: how do we populate this index? As stated at above, the goal is to manage all the structured data of the system using a triple store and ontologies. Then it is to use this triple store to populate the Solr index.

Structured Dynamics uses the Virtuoso Open Source as the triple store to populate this index for multiple reasons. One of the main ones is for its performance and its capability to do efficient basic inference. The goal is to send the proper SPARQL queries to get the structured data that we will index in the Solr schema index that we talked about above. Once this is done, all the things that I talked about in this blog post become possible, and efficient.

Syncing the Index

However, in such a setup, we have to keep one thing in mind: each time the triple store is updated (a resource is created, deleted or updated), we have to sync the Solr index according to these modifications.

What we have to do is to detect any change in the triple store, and to reflect this change into the Solr index. What we have to do is to re-create the entire Solr document (the resource that changed in the triple store) using the <add /> operation.

This design raises an issue with using Solr: we cannot simply modify one field of a record. We have to re-index the entire description of the document even if we want to modify a single field of any document. This is a limitation of Solr that is currently addressed in this new feature proposition; but it is not currently available for prime time.

Another thing to consider here is to properly sync the Solr index with any ontology changes (at the level of the class description) if you are using the inference feature. For example, assume you have an ontology that says that class A is a sub-class-of class B. Then, assume the ontology is refined to say that class A is now a sub-class-of class C, which itself is a sub-class-of class B. To keep the Solr index synced with the triple store, you will have to perform all modifications that affect all the records of these types. This means that the synchronization doesn’t only occur at the level of the description of a record; but also at the level of the changes in the ontologies used to describe those records.

Conclusion

One of the main things to keep in mind here is that now, when we develop Web applications, we are not necessarily talking about a single software application, but a group of software applications that compose an architecture to deliver a service(s). In any such architecture, what is at the center of it is Data.

Describing, managing, leveraging and publishing this data is at the center of any Web service. It is why it is so important to have the right flexible data model (RDF), with the right flexible query language (SPARQL), and the right data management system (triple store) in place. From there, you can use the right tools to make it available on the Web to your users.

The right data management system is what should be used to feed any other specific systems that compose the architecture of a Web service. This is what we demonstrated with Solr; but it is certainly not limited to it.