#data – Page 4 – Frederick Giasson

Investigating Options to Serialize RDF data as Clojure Code

May 28, 2014June 27, 2016 Frederick Giasson

My initial intuition is that I could serialize RDF data into Clojure code where the OWL semantic of the RDF data is embedded, in some ways, into that code. I want to test how the general saying of homoiconic languages: Data as Code. Code as Data, fits with RDF & OWL.

Another intuition I have is the concept of Portable Data: stateful RDF data which embed its own semantic and which doesn’t rely on external (mostly stateless since we can rarely rely on their stated versions) ontologies. My intuition is that it would be possible to serialize RDF data in such a way that it would be self-aware of its own semantic which means that it would know how it can be interpreted, how it can be used, and how it should be validated. The idea is to end-up with Portable Data snippets that could be exchanged between systems without requiring prior, or post, schemas (ontologies) to interpret that information. Then web service endpoints such as OSF, or any other kind of applications, could emit such Portable Data structures without requiring any subsequent ontologies analysis from their part.

However, before being able to implement and demonstrate these intuitions, the first step is to check what such a RDF serialization may looks like. This is the goal of this blog post.

Serializing RDF Data as Clojure Code

Where to start? There are probably multiple ways to do that. Do we want to do that using a map, a structmap, a records, or…? What I wanted to use (at least initially) is a basic data structure that would give me the flexibility I need to serialize RDF data. I wanted a core structure such that existing Clojure developers could easily manipulate it using the existing Clojure functions and techniques that they are used to use.

The collection I choose to start with is the map. This key/value pair structure is ideal for serializing RDF data. It looks like JSON code, but is even simpler since it doesn’t require commas nor colons in its syntax.

The crux of the map structure is that in a map, the keys can be: keywords, symbols, strings, characters, booleans and numbers. The only things it cannot be are regular expressions and the nilvalue. What should be stated here is that symbols can be a lot of different things. They are names for vars, functions, etc.

This opens a World of possibilities to serialize RDF data as Clojure code. In fact, the keys of the map can virtually be anything: and this is just too nice to be true!

What we will investigate in the remaining of this blog posts are different ways to serialize RDF data as Clojure code. These are the initial tests I did to test my intuitions. All of them works, but only the last one really opens-up a World of possibilities and that enables me to implement my early intuitions.

Quick Introduction to RDF Data

RDF is nothing else than a bunch of triples of the form:

<subject> <predicate> <object>

Where the <subject> is the thing (resource, record, entity, etc.) being described, where the <predicate> is the property (attribute, etc.) that describes the subject and where the <object> is the value of the predicate which can be a reference to another subject, a literal value, etc.

Each <subject> do have at least a type. A type is nothing else than a class of things which is defined in a RDFS schema or a OWL ontology.

Then if you wire these triples together, you get a directed graph which we often refer to as a datataset. It is as simple as that. However, I won’t state that RDF is necessarily simple, since its expressivity (a double-edged sword) can make things much more complex.

The semantic of the data lies into the <predicate> and the type. It is the predicate and the type that tell us how to interpret, and use, the data. It is what is used to validate the data for example. That is exactly where Clojure, and its map structure, can help us to create this kind of portable data.

As you will see below, the serialization of RDF data as Clojure code looks like the structJSON RDF serialization format developed by Structured Dynamics and used at the core of the Open Semantic Framework. It is not a coincidence since that simple structure has been highly effective to serialize and transmit RDF information between OSF web services and other applications such as OSF for Drupal and other JavaScript applications.

Leveraging Serialization’s Hierarchy to Create Triples

Before jumping into Clojure, let’s take a quick look at a really simple structJSON record. What I want to show you is how triples can be extracted from such a data structure. It is the same principle that will be used to extract triples from the Clojure serialization:

[cc lang=’javascript’ line_numbers=’false’]
[raw] “subject”: [
{
“uri”: “http://dataset1.com/record-a/”,
“predicate”: [
{
“rdfs:type”: “http://umbel.org/umbel/rc/Person”
},
{
“iron:prefLabel”: “Bob”
},
{
“foaf:knows”: {
“uri”: “http://dataset2.com/record-b/”
}
}
}
][/raw]
[/cc]

What we leverage here to extract triples is the hierarchy nature of the serialization. Here the "subject" key introduce an array of objects. Each object has a "uri" key which is the identifier (<subject> of a triple). Then the "predicate" key introduces a series of attributes for that record. Each element of the array is a predicate with the key is the prefixed version of the RDF <predicate>. Then you have a value for each of these predicate keys. If you read the documentation, you will see that you can get to another level called the reification of that triple (don’t confuse with Clojure’s reification mechanism) that is used to define extra information related to a triple statement. That structJSON code would produce the following ntriples:

[cc lang=’text’ line_numbers=’false’]
[raw]http://dataset1.com/record-a/ rdfs:type http://umbel.org/umbel/rc/Person .
http://dataset1.com/record-a/ iron:prefLabel “Bob” .
http://dataset1.com/record-a/ foaf:knows http://dataset2.com/record-b/ .[/raw]
[/cc]

Serializing RDF using Maps and Keywords Keys

The most intuitive way to serialize RDF data as Clojure maps would be to create a map where all the keys are keywords. An initial test would be:

[cc lang=’lisp’ line_numbers=’false’]
[raw](def resource {:uri “http://foo.com/1”
:rdf/type [foaf/Person owl/Thing]
:iron/prefLabel {:value “Fred”
:lang nil
:datatype xsd/string}
:foaf/knows [{:uri “http://foo.com/2”
:rei [{:iron/prefLabel [{:value “Bob”
:lang “en”}
{:value “Robert”
:lang “fr”}]}]}
{:uri “http://foo.com/3”
:rei [:iron/prefLabel “Mike”]}]})[/raw]
[/cc]

What we did here is to define a map with the symbol resource. This map is composed of a series of keys and values where the keys are keywords, and were the values can be strings, vectors or maps. The basic serialization rules are:

Each map has a :uri key that define the URI of the resource being described
Each key is a namespaced key where the root of the namespace is the prefix of the ontology where the <predicate> or type is defined
If the predicate is a owl:DatatypeProperty, then its value can be:
- A vector with one or multiple map and/or string
- A map which can have four keys:
  - :value which specify the actual string value
  - :lang which specify the language of that string
  - :datatype which specify the datatype of the string
  - :rei which specify reification statements for the triple
- A string which is the actual value without any additional information about that Literal
If the predicate is a owl:ObjectProperty, then its value can be:
- A vector with one or multiple map, string and/or symbol
- A map which can have two keys:
  - :uri which specify the actual URI of the referenced resource
  - :rei which specify reification statements for the triple
- A string which represent the URI of the resource to be referenced
- A symbol which represents the URI string of the resource to be referenced

Namespacing Keywords

One of the important notion is that the keywords used as map keys are namespaced. This means that they are defined, and live, in their own namespace. This is an essential requirement for a RDF serialization since we re-use multiple ontologies that may share the same name for some of the predicates and that we don’t want these keywords to clash. That is why that by convention we do create each of these keywords in their respective ontology’s namespace. An ontology namespace is defined as the prefix used to refer to the ontology (for example, the Bibliographic Ontology‘s prefix is bibo, so :bibo/shortTitle would be the key referring to the property http://purl.org/ontology/bibo/shortTitle).

Usage

Now let see how we can work with such a structure in Clojure:

[cc lang=’lisp’ line_numbers=’false’]
[raw];; Return the values of the rdf/type property
(:rdf/type resource)
(resource :rdf/type)
(get resource :rdf/type)

;; Return all the properties that describes the resource
(keys resource)

;; Get the URI of the first person known by Fred
(:uri (first (:foaf/knows resource)))

;; Get the French name of the first person known by Fred
(:value (second (:iron/prefLabel (first (:rei (first (:foaf/knows resource)))))))

;; Update the name of Fred to Frederick
(update-in resource [:iron/prefLabel :value] str “erick”)

;; Output the difference betweeen the original resource and the updated one
(diff resource (update-in resource [iron/prefLabel value] str “erick”))

;; Find the value of a key
(find resource iron/prefLabel)

;; Select values of multiple keys
(select-keys resource [iron/prefLabel foaf/knows])

;; Merge a resource into another resource. The URI and properties of the later resource are kept into the merged resource
(def res-1 {uri “http://foo.com/datasets/test/1”
rdf/type owl/Thing
iron/prefLabel “Preferred Label”})

(def res-2 {uri “http://foo.com/datasets/test/2”
rdf/type owl/Thing
iron/altLabel “Alternative Label”})

(merge res-1 res-2)[/raw]
[/cc]

That is all good and easy. We use Clojure’s core functions and mechanism to easily manipulate RDF data into our application.

However, is this implementing the intuitions I started with? Definitely not. This is more like a conventional serialization format for RDF just like structJSON. The thing here is that if we want to do any kind of validation on this data, if we want the data to be self-aware of its own semantic, then it is not possible when keys are keywords. We would need external mechanisms to create that map structure, then to check what it refers to (the properties, the types, etc.). And then we would have to look them up into their respective ontologies and finally we would have to validate the data structure according to what these ontologies are saying by re-processing that map structure.

This is not quite what I had in mind and what my intuition was telling me.

Serializing RDF using Maps and Symbol Keys

Let push this idea further. What if the keys of the map that represent our RDF data are not keywords, but symbols? Symbols in Clojure name things like vars, functions, etc. Initially, let’s use symbols that refers to the URI (string) of the <predicate> and the types.

The serialization would look like:

[cc lang=’lisp’ line_numbers=’false’]
[raw](def resource {uri “http://foo.com/1”
rdf/type [foaf/Person owl/Thing]
iron/prefLabel {value “Fred”
datatype xsd/string}
foaf/knows [{uri “http://foo.com/2”
rei [{iron/prefLabel [{value “Bob”
lang “en”}
{value “Robert”
lang “fr”}]}]}
{uri “http://foo.com/3”
rei [iron/prefLabel “Mike”]}]})[/raw]
[/cc]

Now our resource is defined with the same structure, except that the keys are actual symbols. In this second iteration, we will consider that the symbols we defined here are representing a string which is the URI of the predicates or the types.

The real advantage of using symbol over keywords for what we are doing with these RDF serialization is that a symbol can:

Have a docstring
Have meta-data
The evaluation of the symbol will results into getting the actual full URI of the predicates/types

These are obvious enhancements over using keywords. First, by being able to define docstrings, which means that we will be able to document these properties and types such that Clojure IDEs can display the documentation of these symbols while you are writing/editing RDF data in Clojure.

Clojure’s meta-data system will be highly leveraged in the final candidate serialization format that I will cover in another blog post, so I won’t discuss it further for the moment.

Finally, once we evaluate such a map, we get the map along with all the evaluated properties/types which refers to their full URI. The evaluation of such as structure [(eval resource)] looks like:

[cc lang=’lisp’ line_numbers=’false’]
[raw]{uri “http://foo.com/1”, “http://www.w3.org/1999/02/22-rdf-syntax-ns#type” [“http://xmlns.com/foaf/0.1/Person” “http://www.w3.org/2002/07/owl#Thing”], “http://purl.org/ontology/iron#prefLabel” {value “Fred”, datatype “http://www.w3.org/2001/XMLSchema#string”}, “http://xmlns.com/foaf/0.1/knows” [{uri “http://foo.com/2”, rei [{“http://purl.org/ontology/iron#prefLabel” [{value “Bob”, lang “en”} {value “Robert”, lang “fr”}]}]} {uri “http://foo.com/3”, rei [“http://purl.org/ontology/iron#prefLabel” “Mike”]}]}[/raw]
[/cc]

As you can see, we can get the full description of this resource with the full expansion of the URIs referenced by the symbols.

The same parsing rules defined in the previous section applies for this new format that uses symbols instead of keys. The same comments regarding namespaces applies here too.

The usage is nearly identical except that a symbol is not a function like the keys which means that you cannot get the value of a key like this when the key is a symbol:

[cc lang=’lisp’ line_numbers=’false’]
[raw](rdf/type resource)[/raw]
[/cc]

What you have to do is to access that using one of the following two methods:

[cc lang=’lisp’ line_numbers=’false’]
[raw](resource rdf/type)
(get resource rdf/type)[/raw]
[/cc]

Even if we improved upon using keywords as keys for the map, we still don’t have any kind of embedded semantic or auto-validation capabilities as my intuition was telling me. It remains the same kind of structure without much significant improvements.

Serializing RDF using Maps and Symbol Keys Referring to Functions

Let’s change our mind, and let evolve this idea of symbols: what if the symbols we define in the map are actually functions instead of strings?

What!?!?

A function could be the key of a map in Clojure?

Well not directly, but yes. In Clojure symbols are naming different things such as functions. This is quite an important feature of Clojure: it makes the distinction between how things are named, and these actual things.

This means that what is really used as keys in our map structure is a symbol. However, that symbol happen to refer to a function. So it is not the function itself that is used as a key, but the actual thing that refers to it which is the symbol.

However, the result is the same: if we evaluate the map, we will get a series of symbols that evaluates to functions. That is exactly what we were looking for: that little gem, hanging around, just waiting to be picked-up.

This opens an overwhelming number of possibilities. This means that we have a data structure that can be evaluated to a series of functions and that can be executed. That is exactly what should enable us to define that Portable [RDF] Data serialization format.

That means that we won’t only be able to define RDF triples as Clojure code, but that we could even execute that Clojure code to do different things with the data, such as auto-validating itself, etc.

Finally, what if we consider RDF predicates as Clojure functions? Predicates have all kind of properties and semantics. They can be specified to be used to describe only certain kind of resources, or to refer to specific type of values. Predicates can be symmetric, functional, transitive, etc. What if we simply implement these characterics as Clojure functions? This is what this whole thing is mean to be. When evaluating and “running” that RDF map structure, we would simply execute these functions that define the semantic and characteristics of these predicates. That is exactly where lies my intuitions: we would end-up with a RDF serialization format that “embed” it own semantic and that can be used to self-validate itself by executing the structure. That is what I would refer to as Portable Data: stateful data with embedded stateful semantic.

The initial version of this other revision of the RDF serialization as Clojure code will be outlined in the next blog post since its discussion warrant a full blog post in itself. However I think that you can start understanding where I am heading with these intuitions and why I am using Clojure to test them.

Once an initial version of this serialization will be outlined, we will see how it can be used, what are the benefits, how the idea of Portable Data could be leveraged, how it can help creating and managing data using traditional IDEs such as Emacs. Once the basis will be outlined, we will have all the leisure to explore the benefits of this concept.

Data as Code. Code as Data: Tighther Semantic Web Development Using Clojure

May 26, 2014July 4, 2014 Frederick Giasson

I have been professionally working in the field of the Semantic Web for more than 7 years now. I have been developing all kind of Ontologies. I have been integrating all kind of datasets from various sources. I have been working with all kind of tools and technologies using all kind of technologies stacks. I have been developing services and user interfaces of all kinds. I have been developing a set of 27 web services packaged as the Open Semantic Framework and re-implemented the core Drupal modules to work with RDF data has I wanted it to. I did write hundred of thousands of line of codes with one goal in mind: leveraging the ideas and concepts of the Semantic Web to make me, other developers, ontologists and data-scientists working more accurately and efficiently with any kind data.

However, even after doing all that, I was still feeling a void: a disconnection between how I was thinking about data and how I was manipulating it using the programming languages I was using, the libraries I was leveraging and the web services that I was developing. Everything is working, and is working really well; I did gain a lot of productivity in all these years. However, I was still feeling that void, that disconnection between the data and the programming language.

Every time I want to work with data, I have to get that data serialized using some format, then I have to parse it using a parser available in the language I am working with. Then the data needs to be converted into an internal structure by the parser. Then I have to use all kind of specialized APIs to work with the data represented by that structure. Then if I want to validate the data that I am working with, I will probably have to use another library that will perform the validation for me which may force me to migrate that data to another system that will make it available to these reasoners and validators. Etc, etc, etc…

All this is working: I have been doing this for years. However, the level of interaction between all these systems is big and the integration take time and resources. Is there a way to do things differently?

The Pink Book

Once I realized that, I started a quest to try to change that situation. I had no idea where I was heading, and what I would find, but I had to change my mind, to change my view-point, to start getting influenced by new ideas and concepts.

What I realized is how disconnected mainstream programming languages may be with the data I was working with. That makes a natural first step to start my investigation. I turned my chair and started to stare at my bookshelves. Then, like the One Ring, there was this little Pink (really pink) book that was staring at me: Lambda-calcul types et mod[raw]è[/raw]les. I bought that books probably 10 years ago, then I forgot about it. I always found its cover page weird, and its color awkward. But, because of these uncommon features, I got attracted by it.

Re-reading about lambda-calculus opened my eyes. It leaded me to have a particular interest in homoiconic programming languages such as Lisp and some of its dialects.

Code as Data. Data as Code.

Is this not what I was looking for? Could this not fill the void I was feeling? Is this not where my intuition was heading?

What if the “data” I manipulate is the same as the code I am writing? What if the data that I publish could be the code of a module of an application? What if writing code is no different than creating data? What if data could be self-aware of its own semantic? What if by evaluating data structures, I would validate that data at the same time? What if “parsing” my data is in fact evaluating the code of my application? What if I could reuse the tools and IDEs I use for programming, but for creating, editing and validating data? Won’t all these things make things simpler and make me even more productive to work with data?

My intuition tells me: yes!

We have a saying at Structured Dynamics: the right tool for the right job.

That seems to be the kind of tool I need to fill that void I was feeling. I had the feeling that the distinction between the code and the data should be as minimal as possible and homoiconic languages seems to be the right tool for that job.

Code as Data. Data as Code.

That is all good, but what does that really mean? What are the advantages and benefits?

That is the starting of a journey, and this is what we will discover in the coming weeks and months. Structured Dynamics is starting to invest resources into that new project. We choose to do our work using Clojure instead of other Lisp dialects such as Common Lisp. We choose Clojure for many reason: it is compiled in JVM bytecode. This means that you can re-use any of this code into any other Java applications and this also means that you can re-use any Java libraries natively into Clojure. But we also did use it because of its native way to handle concurrency and parallelism, its unique way to manage metadata within data structures, for its meta-programming capabilities using its macro system that enable us to create DSL, etc.

The goal was to create a new serialization format for RDF and to serialize RDF data as Clojure code. The intuition was that RDF data would then become an integral part of Clojure applications because the data would be the code as well.

The data would be self-aware of its own semantic, which means that by evaluating the Clojure “RDF” code it would also auto-validate itself using its embedded semantic. The RDF data would be in itself an [Clojure] application that would be self-aware of its own semantic and that would know how to validate itself.

That is the crux of my thinking. Then, how could this be implemented?

That is what I will cover in the coming weeks and months. We choose to use Clojure because it seems to be a perfect fit for that job. We will discover the reasons over time. However, the goal of these blog posts is to show how RDF can be serialized into [Clojure] code and the benefits of doing so. It is not about showing all the neat features of, and the wonderful minding behind Clojure. For that, I would strongly suggest you to get started with Clojure by reading the material covered in Tips for Clojure Beginners, and particularly to take a few hours to listen Rich Hickey’s great videos.

3.5 Million DBpedia Entities in Drupal 7

January 23, 2014April 7, 2014 Frederick Giasson

In the previous article Loading DBpedia into the Open Semantic Framework, we explained how we could load the 3.5 million DBpedia entities into a Open Semantic Framework instance. In this article, we will show how these million of entities can be used in Drupal for searching, browsing, mapping and templating these DBpedia entities.

Installing and Configuring OSF for Drupal

This article doesn’t cover how OSF for Drupal can be installed and configured. If you want to properly install and configure OSF for Drupal, you should install it using the OSF Installer by running this command:

[cc lang=’bash’ line_numbers=’false’]
[raw]
./osf-installer –install-osf-drupal
[/raw]
[/cc]

Then you should configure it using the first section of the OSF for Drupal user manual.

Once this is done, the only thing you will have to do is to register the OSF instance that hosts the DBpedia dataset. Then to register the DBpedia data into the Drupal instance. The only thing you will have to do is to make sure that the Drupal’s administator role has access to the DBpedia dataset. It can be done by using the PMT (Permissions Management Tool) by running the following command:

[cc lang=’bash’ line_numbers=’false’]
[raw]
pmt –create-access –access-dataset=”http://dbpedia.org” –access-group=”http://YOU-DRUPAL-DOMAIN/role/3/administrator” –access-perm-create=”true” –access-perm-read=”true” –access-perm-delete=”true” –access-perm-update=”true” –access-all-ws
[/raw]
[/cc]

Searching Entities using the Search API

All the DBpedia entities are searchable via the SearchAPI. This is possible because of the OSF SearchAPI connector module that interface the SearchAPI with OSF.

Here is an example of such a SearchAPI search query. Each of these result come from the OSF Search endpoint. Each of the result is templated using the generic search result template, or other entity type search templates.

What is interesting is that depending on the type of the entity to display in the results, its display can be different. So instead of having a endless list of results with titles and descriptions, we can have different displays depending on the type of the record, and the information we have about that record.

In this example, only the generic search template got used to display these results. Here is the generic search results template code:

[gist id=”8560305″]

Manipulating Entities using the Entity API

The Entity API is a powerful Drupal API that let developers and designers loading and manipulating entities that are indexed in the data store (in this case, OSF). The full Entity API is operational on the DBpedia entities because of the OSF Entities connector module.

As you can see in the template above (and in the other templates to follow), we can easily use the Entity API to load DBpedia entities. In these templates examples, what we are doing is to use this API to load the entities referenced by an entity. In this case, we do this to get their labels. Once we loaded the entity, we end-up with an Entity object that we can use like any other Drupal entities:

[gist id=”8585440″]

Mapping Entities using the sWebMap OSF Widget

Because a big number of DBpedia entities does have geolocation data, we wanted to test the sWebMap OSF Widget to be able to search, browse and locate all the geolocalized entities. What we did is to create a new Content Type. Then we created a new template for that content type that implements the sWebMap widget. The simple template we created for this purpose is available here:

[gist id=”8561181″]

Then, once we load a page of that Content Type, we can see the sWebMap widget populated with the geolocalized DBpedia entities. In the example below, we see the top 20 records in that region (USA):

Then what we do is to filter these entities by type and attribute/values. In the following example, we filtered by RadioStation, and then we are selecting a filter to define the type of radio station we are looking for:

Finally we add even more filtering options to drill-down the geolocalized information we are looking for.

We end-up with all the classical radio station that broadcast in the region of Pittsburgh.

Templating Entities using Drupal’s Templating Engine

Another thing we get out of the box with Drupal and OSF for Drupal, is the possibility to template the entities view pages and the search resultsets. In any case, the selection of the template is done depending on the type of the entity to display.

With OSF for Drupal, we created a template selection mechanism that uses the ontologies’ structure to select the proper templates. For example, if we have a Broadcaster template, then it could be used to template information about a RadioStation or a TelevisionStation, even if these templates are not existing.

Here is an example of a search resultset that displays information about different type of entities:

The first entity is an organization that has an image. It uses the generic template. The second one is a person which also use the generic template, but it has no image. Both are using the generic template because none of the Organization nor the Person templates have been created. However, the third result uses a different template. The third result is a RadioStation. However, it uses the Broadcaster template since the RadioStation class is a sub-class-of Broadcaster and because the Broadcaster template exists in the Drupal instance.

Here is the code of the Broadcaster search result template:

[gist id=”8581527″]

Now let’s take a look at the template that displays information about a specific Entity type:

This minimal records displays some information about this radio station. The code of this template is:

[gist id=”8585895″]

Building Complex Search Queries using the OSF Query Builder

A system administrator can also use the OSF Query Builder to create more complex search queries. In the following query, we are doing a search for the keyword “radio“, we are filtering by type RadioStation, and we are boosting the scoring value of all the results that have the word “life” in their slogan.

The top result is a radio station of Moscow that has “Life in Motion!” as its slogan. We can also see the impact of the scoring booster on the score of that result.

Conclusion

As we can see with these two articles, it is relatively easy and fast to import the DBpedia dataset into a OSF instance. By doing so, we end-up with a series of tools to access, manage and publish this information. Then we can leverage the OSF platform to create all kind of web portals or other web services. All the tools are there, out-of-the-box.

This being said, this is not where lies the challenge. The thing is that there is more than 500 classes and 2000 properties that describes all the content present in the DBpedia Ontology. This means that more than 2000 filters may exists for the Search API, the sWebMap widget, etc. This also means that more than 500 Drupal bundles can be created with hundred of fields, etc.

All this need to be properly configured and managed by the Drupal site developer. However, there are mechanisms that have been developed to help them managing this amount of information such as the entity template selection mechanism that uses the ontologies’ structure to select the display templates to use. For example, you could focus on the entity Broadcaster, and create a single template for it. Automatically, this template could be used by sub-classes such as BroadcastNetwork, RadioStation, TelevisionStation and many others.

The Open Semantic Framework is really flexible and powerful as you may have noticed with this series of two articles. However, the challenge and most of the work lies into creating and configuring the portal that will use this information. The work lies into creating the search and entities templates. To properly define and manage the bundles and fields, etc.

Open Semantic Framework version 3.0 Released!

January 20, 2014 Frederick Giasson

I am really proud to announce the release of the Open Semantic Framework version 3.0. This is a major milestone for the OSF platform and it includes important new features and improvements.

The updated platform has just emerged from more than a year and a half of full-time development sponsored by one of Structured Dynamics‘ clients: Healthdirect Australia. OSF’s development as been highly influenced by the big enterprise requirements of the HDA sponsor, resulting in two portals to be fully operated by OSF: healthinsite and Pregnancy, Birth and Baby. OSF 3.0 is already in production with these two portals, but it will continue to constantly evolve in the coming months and years.

The OSF release is major in a number of ways. The first thing you will notice is that we re-branded the entire project, which includes all of its moving parts, around the OSF name. The OSF for Drupal (previously known as conStruct) was migrated to Drupal 7 and about 80% of its code was re-written. Seven new OSF Web Services (previously known as structWSF) were created. The old IP based security layer was completely replaced by a new key based security layer. A new revisioning system has been put in place to revision every record has it changes. A new caching layer has been added to the OSF Web Services to improve its performance and decrease the load on the other pieces of the OSF stack (about 80% of the non-search queries will hit the cache). A set of command line tools has been developed to help system administrators to manage and automate tasks on OSF instances. A set of system integration tests, which is composed of 746 tests and 4139 assertions, tests all of the functionalities of the system to make sure it is properly deployed on a server. The OSF Wiki has been completely rewritten and re-organized to help users and developers to find answers to their questions.

You can check the list of all the OSF 3.0 features, and the list of all the new features to OSF 3.0. Now let’s see what this new release is really all about.

[extoc]

The OSF (Open Semantic Framework) Brand

The first thing you will notice with this new OSF 3.0 release is that the whole project got re-branded around the OSF terminology. The Open Semantic Framework (OSF) stack is now composed of:

OSF Web Services (formerly structWSF)
OSF for Drupal (formerly conStruct)
OSF Widgets (formerly Semantic Components)
OSF Web Services PHP API
OSF Installer
OSF Tests Suites
OSF Ontologies Management Tool
OSF Datasets Management Tool
OSF Permissions Management Tool
OSF Data Validator Tool

OSF Web Services

The OSF Web Services changed drastically since version 1.1. Most of its code got re-written, a new structure has been put in place, new features and new web service endpoints got created, etc. In this section, we will cover what changed in the OSF Web Services and what are these new features.

New Security Layer

Initially, we created a simple and effective security layer for the OSF Web Services. It was based on the IP of the requester, nothing more, nothing less. That was five years ago. This simple security layer was quite effective, but it was a nightmare to manage.

What we did for OSF 3.0 is to ditch this old security layer, and to replace it by something secure and much easier to manage.

The new security layer does two things:

Validates the web service call
Validates the data access of the user

To validate the web service call, the new security system uses a secret keys authentication system. Every HTTP query that is sent to any web service endpoint needs to comply with the security protocol. If it doesn’t, then the requests will be refused.

Then if the call is authenticated, the web service endpoint will make sure that the requesting user has proper access to the datasets that are being queried. This second authentication step makes sure that the user can only access the data to which he has access rights.

The real improvement of the new security layer is how the users are managed. In the past, we were managing individual IP addresses. Now, we are managing groups of users. All dataset access permissions to records are related to a group. Each group is composed of one or multiple users. Then, when a web service endpoint checks if a requesting user does have access to the content of a certain dataset, it checks if the requesting users belong to a group that has access to the content of that dataset.

It is now much easier to manage groups of users at the level of the dataset than individual IP addresses.

New Revisioning System

A new records revisioning system is now available in OSF. If required, every change to a record can be revisioned. This means that if someone makes an error when editing a record, all changes can be roll-backed at anytime using the new revisioning system.

A new set of web service endpoints has been created to manage the revisions. You can list, read, update, delete, and compare revisions with these new endpoints.

New Web Service Endpoints

A series of new web service endpoints have been created:

Multi-Language Support

All of the web services that create, update or read data from OSF now have multi-lingual capabilities. If you are creating data, the only thing you have to do is to specify the language for each literal you are defining in the RDF documents you are indexing in OSF. If you are reading or searching data, you only have to specify the language you want to use for each web service query you are creating.

New Caching Layer

OSF is a stack that includes a multitude of underlying systems such as Virtuoso, Solr, OWLAPI, GATE, etc. Depending on the web service endpoints that are used, and depending on how they are used, the same query can be requested again and again, and each of these background services may be queried again and again too.

To improve the performances of each of the OSF Web Services, and to minimize the usage of these underlying systems as much as possible, we added a caching layer at the level of the web service endpoints. The result is that every OSF Web Services query is being cached into the caching layer. This means that every time that the same query is being requested twice, the second time the results will come from the caching layer.

The caching system that is used by OSF is Memcached. More information about the OSF Cache can be read on the OSF Wiki.

Improved Search

The Search web service endpoint, which is by far the most used OSF web service endpoint, also improved quite a lot in developing this new version.

First, the Search endpoint is now using the eDisMax query parser. In itself, this changes everything in the endpoint and leads to the creation of multiple new search functionalities.

It is now possible to change the ranking of the search results by boosting the scoring of the results based on different things such as their dataset provenance, their types or any of their attribute/values. This enables the possibility to improve the quality of the results returned on a OSF web portal depending on the context of a search and the semantics of the records being searched.

It is also now possible to add restrictions to the search queries. This means that search keywords will be restricted to a set of attributes. Then it is also possible to boost the scoring of the returned results depending on where the search keywords appeared.

There is a new spell-checker function for the search queries. This means that if no results are returned for a specific search query, then the system will return a series of possible keywords that the user may want to use to re-initiate the search query.

Finally, an extended search query syntax is now supported by the Search endpoint. This enables more complex search queries to be sent to the Search endpoint, opening the door to the creation of more complex contextual search profiles queries.

New Interfacing Mechanism

A new interface mechanism as been put in place for the OSF Web Services. An interface is a the code that is run by the web service endpoint for a given query.

An interface cocorresponds to a specific version of a web service endpoint. Two different interfaces, for the same endpoint, may comply to different versions of its API. However, these two interfaces can work side-by-side using the same data.

If two interfaces comply to the same endpoint API, it means that their processing of the query will be different (like querying Solr 4.0 instead of 3.6). If two interfaces don’t comply to the same endpoint API version, then it means that each interface supports different versions of the endpoint.

This new interfacing mechanism becomes handy to support more than one triple store, or when the same OSF instance needs to use different Solr query parsers, or when some of the endpoints have to be backward compatible for some portals/users that still need to be supported by the OSF instance, etc.

The new interfacing mechanism gives the flexibility to be able to run different code or support different web service API version on the same OSF instance.

OSF for Drupal

OSF for Drupal now runs on Drupal 7. About 80% of the Drupal-related code got rewritten and we can now state that OSF is fully integrated into Drupal.

Drupal Connectors

A series of OSF connectors have been developed in the last year and a half that basically let Drupal’s core features use OSF instead of MySQL: Entity & Entity API, FieldAPI & FieldStorage and the SearchAPI. These connectors mean that if OSF for Drupal is installed and configured on a Drupal 7 instance, developers will be able to use these core APIs to query registered decentralized OSF instances instead of local MySQL/Solr instances.

OSF Entities

The OSF Entities connector module implements the Drupal Entity API. This means that if OSF for Drupal is properly installed and configured on a Drupal instance, that the Entity API can be used to read, create, update and delete content from registered external OSF Web Services networks. Under this scenario, no information about these Drupal entities will be local to the Drupal instance. All of the content will be hosted externally on a dedicated OSF instance. All of the data manipulated by the Entity API is RDF data. What that means is that the Entity API now may interface with an RDF data management system, with communications with it via web service endpoint queries.

In short, this connector makes OSF records visible to Drupal via the Entity API.

OSF FieldStorage

The OSF FieldStorage connector module creates a new FieldStorage type that enables Drupal users to save Drupal content into an OSF instance instead of saving the content in the default storage system (namely MySQL). This means that if someone starts using OSF instead as the backend of a Drupal portal, then all the Drupal content that will be created will be available via the OSF web service endpoints. This means that other external applications that know how to talk to OSF web service endpoints are now able to leverage the content that has been created from the Drupal instance. Also, all of the content will be available as RDF.

What this connector does at the end is to save Drupal entities into OSF instead of in the default storage system (MySQL).

OSF SearchAPI

The OSF SearchAPI connector module creates a new service for the SearchAPI module. It enables the SearchAPI to send search queries to an OSF Search web service endpoint instead of the default search service. This means that the Drupal search engine is now fully powered by the OSF Search endpoint, and gives access to all the datasets hosted on one, or multiple, remote OSF instances.

Better Configuration & Management

Registering, configuring and managing OSF instances and datasets into Drupal has never been easier. The new OSF Configure module is a new module that centralizes all of the features and options that are required to register, configure and maintain OSF instances and datasets.

QueryBuilder & Search Profiles

A new kind of tool has been developed in OSF for Drupal 3.x: OSF Search Profiles. A search profile is a predefined search query where its search results are displayed in a block positioned on some Drupal pages. These search profiles are normally used to display lists of information that match a search query. Search profiles are also to some extent aware of their context. For example, if the main topic of a page is about cancer and if we have a search profile that displays a list of events, then when the search profile is used in the context of that page about cancer, then cancer related events should be displayed. That is one of the core purposes of the search profiles.

The search profiles’ underlying search queries are being created using the new OSF Query Builder module. This powerful user interface enables site administrators to create complex search queries that will be used within a search profile.

OSF Web Services PHP API

In prior versions, knowing how to query the OSF web services was not an easy task. It is the reason why the OSF Web Services PHP API was developed: to help developers to easily query OSF web service endpoints. This PHP API is a set of classes where each of them has a series of methods that can be used to query a particular web service endpoint. Let’s take this example of some OSF WS PHP API code that does send a query to the OSF Search web service endpoint:

[cc lang=’php’ line_numbers=’false’]
[raw]
//
// Step #1: Instantiate the class of the web service then want to query
//

// Create the SearchQuery object
$search = new SearchQuery(‘http://localhost/ws/’, ‘some-app-id’, ‘some-api-key’, ‘http://localhost/users/foo’);

//
// Step #2: Define all the parameters/features/behaviors of the web service by invoking different methods of the class
//

$resultset = $search->enableInference
->excludeAggregates()
->items(20)
->page(40)
->query(“forest”)
->send()
->getResultset();

// Print the PHP array serialization for that resultset
print_r($resultset->getResultset());
[/raw]
[/cc]

OSF Management Tools

A new set of command line tools have been developed for OSF version 3.0. These tools’ focus has been to help OSF instance administrators by giving them command line tools that they could use in their scripts, Cron jobs, or any other middleware toolings that may perform different tasks on a OSF instance.

Datasets Management Tool

The Datasets Management Tool (DMT) is a command line tool used to manage datasets of a OSF instance. With this tool, you may create, delete, update, import and export datasets directly from the command line.

Ontologies Management Tool

The Ontologies Management Tool (OMT) is a command line tool used to manage ontologies of a OSF Web Services network instance. It can be used to list the ontologies of a OSF Web Services instance, to manage those ontologies, to create/import new ones, to delete existing ones, and to generate underlying ontological structures.

Permissions Management Tool

The Permissions Management Tool (PMT) is a command line tool used to manage access permissions on a OSF Web Services network instance. This tool is used to list, create and delete access permissions, groups and users.

Data Validator Tool

The Data Validator Tool (DVT) is a command line tool used to perform a series of post-indexation data validation tests. What this tool does is to run a series of pre-configured tests, and return validation errors if any are found.

OSF Widgets

All the OSF Widgets (formerly the Semantic Components) have been updated to work with OSF 3.0. The big difference with this update is that all of the OSF Widgets now have access to an OSF for Drupal proxy. This proxy enables them to communicate with a OSF Web Services instance without having to authenticate themselves to the endpoints.

OSF Wiki

The OSF Wiki has been completely rewritten and re-organized. It is the go-to place to find more information about the Open Semantic Framework project, and all pieces of the stack.

Installing and Configuring OSF

OSF Installer

Installing and configuring OSF has never been easier to do. The OSF Installer utility has been improved to ease the deployment of OSF on a new Ubuntu 12.10 server. The installation tool will install and configure all the pieces required by the OSF stack. Once everything is installed and configured, it will run the OSF Tests Suites to make sure that all the OSF functionalities are fully operational on the new server.

Then, once the OSF stack is installed, the user is then able to use the OSF Installer tool to install, deploy and configure Drupal 7 with OSF for Drupal.

OSF EC2

Additionally, we created a new public Amazon AWS EC2 image that includes the full OSF stack version 3.0. This new public image is available in all the zones:

Region	arch	root store	AMI
us-east-1	64-bit	EBS	ami-afe4d1c6
us-west-1	64-bit	EBS	ami-d01b2895
us-west-2	64-bit	EBS	ami-c6f691f6
eu-west-1	64-bit	EBS	ami-883fd4ff
sa-east-1	64-bit	EBS	ami-6515b478
ap-southeast-2	64-bit	EBS	ami-4734ab7d
ap-southeast-1	64-bit	EBS	ami-364d1a64
ap-northeast-1	64-bit	EBS	ami-476a0646

Once you create a new instance from that image, you will have to properly configure it to make it secure and fully operational. The only thing you have to do is to follow the steps outlined in the Creating and Configuring an Amazon EC2 AMI OSF Instance manual.

System Integration Tests

A complete suite of integration tests has been created for OSF 3.0. The tests suites are composed of 746 tests and 4139 assertions. These integration tests make sure that all of the functionality of an OSF instance is working. These tests are run every time an OSF instance is deployed using the OSF Installer script. Then, they can be re-run anytime thereafter. Normally, every time an update is made on an OSF instance, the tests should be run as well to make sure that the update didn’t break anything.

These tests are testing:

All of the input parameters of each endpoint
All of the combinations of all the input parameters of each endpoint
All of the mime types supported by each endpoint
All of the expected error returned by each endpoint.

Conclusion

We have been working on this new Open Semantic Framework version 3.0 for almost two full years now. We have been quiet during that time since we had no more time other than coding, documenting, testing and deploying the code that we are releasing today.

This new version is a major leap forward for the Open Semantic Framework open source project. Five years ago, Mike and I set as a goal to have a complete OSF stack in place that could be leveraged by anybody to fulfill the requirement of any kind of projects. I think that with this OSF version 3.0, we reached the middle term goal that we fixed for ourselves 5 years ago.

Neighbourhoods of Winnipeg: A Community Semantic Portal

May 21, 2013April 7, 2014 Frederick Giasson

Introduction

I am proud to announce the new NOW (Neighbourhoods Of Winnipeg) semantic web portal! This new and innovative semantic web portal was publicly announced by the Mayor of Winnipeg City last week.

The NOW (Neighbourhoods of Winnipeg) portal is “a new Web portal (the “Portal”) produced by the City of Winnipeg to provide broad, dynamic and interactive access to local and neighbourhood information. Designed for easy access and use by all citizens, businesses, community organizations and Governments, the information on the site includes municipal data, census and demographic information, economic development information, historical data, much spatial and mapping information, and facilities for including and sharing data by external groups and constituencies.”

I would suggest you to read Mike Bergman’s blog post about this new semantic web portal to have the proper background about that initiative by the city of Winnipeg and how it uses the OSF (Open Semantic Framework) as its foundational technology stack.

This project has been the springboard that led to the Open Semantic Framework version 1.1. Multiple pieces of the framework have been developed in relation to this project, and more particularly pieces like the sWebMap semantic component and several improvements to the structWSF web services endpoints and conStruct modules for Drupal 6.

Development of the Portal

The development plan of this portal is composed of four major areas:

Development of the data structure of the municipal domain by creating a series of ontologies
Conversion of existing data asset using this new data structure
Creation of the web portal by creating its design and by developing all the display templates
Creation of new tools to let users interact with the data available on the portal

Structured Dynamics has been involved in #1, #2 and #4 by providing design and development resources, technology transfer sessions and material and supporting internal teams to create, maintain and deploy their 57 publicly available datasets.

The Data Structure

This technology stack does not have any meaning without the proper data and data structures (ontologies) in place. This gold mine of information is what drives the functionality of the portal.

The portal is driven by 12 ontologies: 2 internal and 10 external. The content of the 57 publicly available datasets is defined by the classes and properties defined in one of these ontologies.

The two internal ontologies have been created jointly by Structured Dynamics and the City of Winnipeg, but they are extended and maintained by the city only.

These ontologies are maintained using two different kind of tools:

Protege
structOntology

Protege is used for the big development tasks such as creating a big number of classes and properties, to do a big reorganization of the classes structure, etc.

structOntology is used for quick ontological changes to have an immediate impact on the behaviors of the portals such as label changes, SCO ontology property assignments to change the behavior of some of the tools that exist in the portal, etc.

structOntology can also be used by portal users to understand the underlying data structure used to define the data available on the portal. All users have access to the reading mode of the tool which let them browse, search and export the loaded ontologies on the portal.

The Data

Except for rare exceptions such as the historical photos, no new data has been created by the City of Winnipeg to populate this NOW portal. Most of its content comes from existing internal sources of data such as:

Conventional relational databases
GIS (Geographic Information System) on-top of relational databases
Spreadsheets

All of the conventional relation databases and legacy data from the GIS systems has been converted into RDF using the FME Workbench ETL system. All of the FME workbench templates are mapping the relational data into RDF using the ontologies loaded into the portal. All of the geolocated records that exist in the portal come from this ETL process and have been converted using FME.

Some smaller datasets come from internal spreadsheets that got modified to comply with the commON spreadsheet format that is used to convert spreadsheet (CSV/TSV) data files into RDF.

All of the dataset creation and maintenance is managed internally by the City of Winnipeg using one of these two data conversion and importation processes.

Here are some internal statistics of the content that is currently accessible on the NOW portal.

General Portal

These are statistics related to different functionalities of the portal.

Number of neighbourhoods: 236
Number of community areas: 14
Number of wards: 15
Number of neighbourhood clusters: 23
Number of major site sections: 7
Total number of site pages: 428,019
- Static pages: 2,245
- Record-oriented pages: 425,874
- Dynamic (search-based) pages: infinite
Number of documents: 1,017
Number of images: 2,683
Number of search facets: 1,392
Number of display templates: 54
Number of links: 1,067
- External links: 784
- Internal links: 283

Site Data

These statistics show the things that are available via the portal, what are their types, their properties, what is the quantity of data that is searchable, manipulable and exportable from the portal.

Number of datasets: 57
Number of records: 425,874
- Number of geolocational records: 418,869
  - Point of interest (POI) records: 193,272
  - Polygon records: 218,602
  - Path (route) records: 6,995
Number of classes (types): 84
Number of properties: 1,308
Number of triple assertions: 8,683,103

Sharing Content

An important aspect of this portal is that all of the content is contextually available, in different formats, to all of the users of the portal. Whether you are browsing content within datasets, searching for specific pieces of content, or looking at a specific record page, you always have the possibility to get your hands on the content that is being displayed to you, the user, with a choice of five different data formats:

Export Page Content

All content pages can be exported in one of the formats outlined above. In the bottom right corner of these pages you will see a Export button that you can click to get the content of that page in one of these formats.

Export Search Content

Every time you do a search on the portal, you can export the results of that search in one of the formats outlined above. You can do that by selecting the Export tab, and by selecting one of the formats you want to use for exporting the data.

Export Datasets

You can export any publicly available dataset from the portal. These datasets have to be exported in slices if they are too big to fit in a single slice. The datasets can be exported in one of the formats mentioned above.

Export Census

Users also have the possibility to export census data, from the census section of the portal, in spreadsheets. They only have to select the Tables tab, and then to click the Export Spreadsheet button.

Export Ontologies

The export functionality would not be complete without the ability to consult and export the ontologies that are used to describe the content exposed by the portal. These ontologies can be read from the ontologies reader user interface, or can be exported from the portal to be read by external ontologies management tools such as Protege.

Portal Design

The portal is using Drupal 6 as its CMS (Content Management System). The Drupal 6 instance communicates with structWSF using the conStruct module, which acts as a bridge between a Druapal portal and a structWSF web service network.

Here are the main design phases that have been required to create the portal:

Creation of the portal’s design, and the Drupal 6 theme that implements it
Creation of the Search and Browse results templates
Creation of the individual records’ page design and templates based on their type
Creation of the sWebMap search results templates.

The portal’s design has been created internally by the City of Winnipeg and by Tactica based on the Citizen DAN demo. Tactica also worked on another Citizen DAN like portal called MyPeg.ca.

Semantic Components

The NOW Web portal is using a series of tools that are called the Semantic Components. These are a set of Flash and JavaScript tools that can be embedded within any web page and that can easily communicate with structWSF instance(s). They display information in all kinds of charts, they can display document reading widgets, they can create dashboards of structured data, etc. The initial set of Semantic Components was developed for the MyPeg.ca project back in November 2010. This was before Steve Jobs announced that Apple would not support Adobe Flash, and far before Google announced that it would drop support for it as well.

Since the NOW portal wanted to re-use as much as possible to lower the development cost related to the portal, they choose to use the complete OSF stack which includes these Semantic Components.

However, when we participated in developing this new NOW portal, we did extended the set of Semantic Components by creating the most complex Semantic Component: the sWebMap. However, because of the two announcements mentioned above, we choose to move forward and to create the sWebMap Semantic Component using JavaScript instead of Flash. The other Semantic Component tools that have been developed in Flash have not yet been ported into JavaScript.

Conclusion

The new NOW semantic web portal’s main asset is its data: how it can be searched (with traditional search engines or using a semantic component to search, browse, filter and localize results), displayed and exported. This portal has been developed using a completely free and open source semantic platform that has been developed from previous projects that open sourced their code.

I consider this portal a pioneer in the way municipal organization will provide new online services to their citizens and to the commercial enterprises based on the quality of the data that will be exposed via such Web portals.