Revision of Serializing RDF Data as Clojure Code Specification

In my previous blog post RDF Code: Serializing RDF Data as Clojure Code I did outline a first version of what a RDF serialization could look like if it would be serialized using Clojure code. However, after working with this proposal for two weeks, I found a few issues with the initial assumptions that I made that turned out to be bad design decisions in terms of Clojure code.

This blog post will discuss these issues, and I will update the initial set of rules that I defined in my previous blog post. Going forward, I will use the current rules as the way to serialize RDF data as Clojure code.

What Was Wrong

After two weeks of using the previous set of serializations rules and developing all kind of functions that uses that codes in the context of UMBEL graph traversal and analysis I found the following issues:

Keys and values should be Vars
Ontologies should all be in the same namespace (and not in different namespaces)
The prefix/entity separator for the RDF resources should be a colon and not a slash

These are the three serialization rules that changed after working with the previous version of the proposal. Now, let’s see what caused these changes to occur.

Keys and Values as Vars

The major change is that when we serialize RDF data as Clojure map structures, the keys, and values that are not strings, should be Vars.

There are three things that I didn’t properly evaluated when I first outlined the specification:

The immutable nature of the Clojure data structures
The dependency between ontologies
The non-cyclical namespaces dependency rule imposed by Clojure

In the previous proposal, every RDF property were Clojure functions and they were also the keys of the Clojure maps that were used to serialize the RDF resources. That was working well. However, there was a side effect to this decision: everything was fine until the function’s internal ID changed.

The issue here is that when we work with Clojure maps, we are working with immutable data structures. This means that even if I create a RDF record like this:

[cc lang=’lisp’ line_numbers=’false’]
[raw](def mike {uri “http://foo.com/datasets/people/mike”
rdf/type foaf/+person
iron/pref-label “Mike”
foaf/knows [“http://foo.com/datasets/people/fred”]})[/raw]
[/cc]

And that somehow, in the compilation process the RDF ontology file get re-compiled, then the internal ID of the rdf/type property (function) will change. That means that if I create another record like this:

[cc lang=’lisp’ line_numbers=’false’]
[raw](def mike-2 {uri “http://foo.com/datasets/people/mike”
rdf/type foaf/+person
iron/pref-label “Mike”
foaf/knows [“http://foo.com/datasets/people/fred”]})[/raw]
[/cc]

that uses the same rdf/type function, then these two records would refer to different rdf/type functions since it changed between the time I created the mike and the mike-2 resources. That may not look like an issue since both functions does exactly the same thing. However, this is an issue since for multiple tasks to manipulate and query RDF data rely on comparing these keys (so, these functions). That means that unexpected behaviors can happen and may even looks like random.

The issue here was that we were not referring to the Var that point to the function, but the function itself. By using the Var as the keys and values of the map, then we fix this inconsistency issue. What happens is that all the immutable data structure we are creating are referring to the Var which point to the function. That way, when we evaluate the Var, we will get reference to the same function whatever when it got created (before or after the creation of mike and/or mike-2). Here is what the mike records looks like with this modification:

[cc lang=’lisp’ line_numbers=’false’]
[raw](def mike {#’uri “http://foo.com/datasets/people/mike”
#’rdf/type #’foaf:+person
#’iron/pref-label “Mike”
#’foaf/knows [“http://foo.com/datasets/people/fred”]})[/raw]
[/cc]

We use the #' macro reader to specify that we use the Var as the key and values of the map and not the actual functions or other values referenced by that Var.

The second and third issues I mentioned are tightly related. In a RDF & OWL world, there are multiple examples of ontologies that re-use external ontologies to describe their own semantic. There are cases where an ontology A use classes and properties from an ontology B and where the ontology B use classes and properties from an ontology A. They cross-use each other. Such usage cycles exists in RDF & OWL and are not that uncommon neither.

The problem with that is that at first, I was considering that each OWL ontologies that were to be defined as Clojure code would be in their own Clojure namespace. However, if you are a Clojure coder, you can envision the issue that is coming: if two ontologies cross-use each other, then it means that you have to create a namespace dependency cycles in your Clojure code… and you know that this is not possible because this is restricted by the compiler. This means that everything works fine until this happens.

To overcome that issue, we have to consider that all the ontologies belong to the same namespace (like clojure.core). However, in my next blog post that will focus on these ontologies description I will show how we can split the ontologies in multiple files while keeping them in the same namespace.

Now that we should have all the ontologies in the same namespace, and that we cannot use the namespaced symbols of Clojure anymore, I made the decision to use the more conventional way to write namespaced properties and classes in other RDF serializations which is to delimit the ontology’s prefix with a colon like that:

[cc lang=’lisp’ line_numbers=’false’]
[raw](def mike {#’uri “http://foo.com/datasets/people/mike”
#’rdf:type #’foaf:+person
#’iron:pref-label “Mike”
#’foaf:knows [“http://foo.com/datasets/people/fred”]})[/raw]
[/cc]

Revision of the RDF Code Rules

Now let’s revise the set of rules that I defined in the previous blog post:

A RDF resource is defined as a Clojure map where:
1. Every key is a Var that point to a function
2. Every value is a:
  1. string
    1. A string is considered a literal if the key is a owl:DatatypeProperty
    2. A string is considered a URI if the key is a owl:ObjectProperty
  2. map
    1. A map represent a literal if the value key is present
    2. A map represent a reference to another resource if the uri key is present
    3. A map is invalid if it doesn’t have a uri nor a value key
  3. vector
    1. A vector refer to multiple values. Values of a vector can be strings, maps, symbols or Vars
  4. symbol
    1. A symbol can be created to simplify the serialization. However, these symbols have to reference a string or a var object
  5. var
    1. A var reference another entity

In addition to these rules, there are some more specific rules such as:

The value of a uri key is always a string
If the #’rdf:type key is not defined for a resource, then the resource is considered to be of type #’owl:+thing (since everything is at least an instance of the owl:Thing class in OWL)

Finally, there are two additional classes and datatypes creation conventions:

The name of the classes starts with a + sign, like: #’owl:+thing
The name of the datatypes starts with a * sign, like: #’xsd:*string

As you can see, the rules that govern the serialization of RDF data as Clojure code are minimal and should be simple to understand for someone who is used to Clojure code and that tried to write a few resource examples using this format. Now, let’s apply these rules with a series of examples.

Note 1: in the examples of this blog post, I am referring to Vars like #’uri, #’value, #’lang, #’datatype, etc. To make the rules simpler to read and understand, consider that these Vars are defined in the user‘s namespace. However, they are vars that are defined in the rdf.core namespace that will be made publicly available later.

Note 2: All the properties and classes resource Vars have been defined in the same namespace. They should be included with :require or :use like (:use [ontologies.core]) from the ns function of the Clojure source code file that define this RDF resource. We will discuss about these namespaces in a subsequent blog post.

Revision of Serializing RDF Code in N-Triples

The serialize-ntriples function got modified to comply with the new set of rules:

[cc lang=’lisp’ line_numbers=’false’]
[raw](declare serialize-ntriples-map-value serialize-ntriples-string-value is-datatype-property?)

(defn serialize-ntriples
[resource]
(let [n3 (atom “”)
iri (get resource #’rdf.core/uri)]
(doseq [[property prop-vals] resource]
(let [property-uri (get (meta property) #’rdf.core/uri)]
; Don’t do anything with the “uri” key
(if (not= property #’rdf.core/uri)
(if (vector? prop-vals)
; Here the value is a vector of maps or values
(doseq [v prop-vals]
(let [val (if (var? v) @v v)]
(if (map? val)
; The value of the vector is a map
(reset! n3 (str @n3 (serialize-ntriples-map-value val iri property-uri)))
(if (string? val)
; The value of the vector is a string
(reset! n3 (str @n3 (serialize-ntriples-string-value val iri property-uri property)))))))
(let [vals (if (var? prop-vals) @prop-vals prop-vals)]
(if (map? vals)
; The value of the property is a map
(reset! n3 (str @n3 (serialize-ntriples-map-value vals iri property-uri)))
(if (string? vals)
; The value of the property is some kind of literal
(reset! n3 (str @n3 (serialize-ntriples-string-value vals iri property-uri property))))))))))
@n3))

(defn- serialize-ntriples-map-value
[m iri property-uri]
(if (not (nil? (get m #’rdf.core/uri)))
; The value is a reference to another resource
(format “<%s> <%s> <%s> .\n” iri property-uri (get m #’rdf.core/uri))
(if (not (nil? (get m #’rdf.core/value)))
; The value is some kind of literal
(let [value (get m #’rdf.core/value)
lang (if (get m #’rdf.core/lang) (str “@” (get m #’rdf.core/lang)) “”)
datatype (if (get m #’rdf.core/datatype) (str “^^<” (get (deref (get m #’rdf.core/datatype)) #’rdf.core/uri) “>”) “”)]
(format “<%s> <%s> \”\”\”%s\”\”\”%s%s .\n” iri property-uri value lang datatype))
(if (string? m)
; The value of the sector is some kind of literal
(format “<%s> <%s> \”\”\”%s\”\”\” .\n” iri property-uri m)))))

(defn- serialize-ntriples-string-value
[s iri property-uri property]
; The value of the vector is a string
(if (true? (is-datatype-property? property))
; The property referring to this value is a owl:DatatypeProperty
(format “<%s> <%s> \”\”\”%s\”\”\” .\n” iri property-uri s)
; The property referring to this value is a owl:ObjectProperty
(format “<%s> <%s> <%s> .\n” iri property-uri s)))

(defn is-datatype-property?
[property]
(if (= (-> property
meta
(get #’ontologies.core/rdf:type)
deref
(get #’rdf.core/uri))
(-> #’ontologies.core/owl:+datatype-property
deref
(get #’rdf.core/uri)))
(eval true)
(eval false)))
[/raw]
[/cc]

Serializing a RDF Resource

Now let’s serialize a new RDF resource using the new set of rules:

[cc lang=’lisp’ line_numbers=’false’]
[raw](def fred {#’uri “http://foo.com/datasets/people/fred”
#’rdf:type [#’foaf:+person #’owl:+thing]
#’iron:pref-label “Fred”
#’iron:alt-label {#’value “Frederick”
#’lang “en”}
#’foaf:skypeID {#’value “frederick.giasson”
#’datatype #’xsd/*string}
#’foaf:knows [{#’uri “http://foo.com/datasets/people/bob”}
mike
“http://foo.com/datasets/people/teo”]})[/raw]
[/cc]

One drawback with these new rules (even if essential) is that they complexify the writing of the RDF resources because of the (heavy) usage of the #' macro.

However, on the other hand, they may looks like more familiar to people used to RDF serializations because of the usage of the colon instead of the slash to split the ontology prefix with the ending of the URI.

What we have above, is how the RDF data is represented in Clojure. However, there is a possibility to make this serialization less compact by creating a macro that would change the input map and automatically inject the usage of the #' reader macro into the map structures that define the RDF resources.

Here is the r macro (“r” stands for Resource) that does exactly this:

[cc lang=’lisp’ line_numbers=’false’]
[raw](defmacro r
[form]
(-> (walk/postwalk
(fn [x]
(if (and (symbol? x) (-> x
eval
string?
not))
`(var ~x)
x))
form)))[/raw]
[/cc]

Then you can use it to define all the RDF resources you want to create:

[cc lang=’lisp’ line_numbers=’false’]
[raw](def fred (r {uri “http://foo.com/datasets/people/fred”
rdf:type [foaf:+person owl:+thing]
iron:pref-label “Fred”
iron:alt-label {value “Frederick”
lang “en”}
foaf:skypeID {value “frederick.giasson”
datatype xsd/*string}
foaf:knows [{uri “http://foo.com/datasets/people/bob”}
mike
“http://foo.com/datasets/people/teo”]})[/raw]
[/cc]

That structure is equivalent to the other one because the r macro will add the #' reader macro calls to change the input map before creating the resource’s Var.

By using the r macro, we can see that the serialization is made much simpler, and that at the end, it is more natural to people used to other RDF serializations.

Conclusion

I used the initial specification in the context of creating a new series of web services for the UMBEL project. This heavy usage of this kind of RDF data leaded to discover the issues I covered in this blog post. Now that these issues are resolved, I am confident that we can move forward in the series of blog posts that covers how (and why!) using Clojure code to serialize RDF data.

The next blog post will cover how to manage the ontologies used to instantiate these RDF resources.

5 thoughts on “Revision of Serializing RDF Data as Clojure Code Specification”

Someone

June 18, 2014 — 2:37 pm

Vars => no ClojureScript.
What about defining a macro instead of a function (that calls eval) to add the #’ to RDF resource definitions?

Reply
1. Frederick Giasson
  
  June 18, 2014 — 3:13 pm
  
  Hi!
  
  About the macro: no specific reason. I fact it should have been a macro right at the beginning since in simply the syntax even more (no need to quote/eval) as you mentioned. So I updated the post accordingly. Thanks.
  
  Interesting comment about ClojureScript. It is right, and I didn’t thought about it, probably because I never coded in ClojureScript before.
  
  Could you think of a mechanism or a change in this proposal that would make it compatible with CS?
  
  Reply
Paul Gearon

June 19, 2014 — 11:26 am

Iâ€™ve been representing RDF in Clojure as well, so hereâ€™s my approach to compare against.

While I agree with much of your proposed structure, my own preference is to use keywords instead of vars. I directly convert between QNames and keywords, so a term like rdf:type becomes :rdf/type.

This has a few advantages:

Keywords evaluate to themselves, so the equality issues are handled.
They are idiomatic Clojure, and efficient as keys in maps.
They do not need to be declared before use, which simplifies the use of large vocabularies and allows ad-hoc minting of IRIs. It also avoids the circular dependency problem of vars.
The #â€™ reader macro is a workaround that appears to take you further from what you were trying to achieve while also making it impossible to express RDF in EDN. Keywords are direct, and can be used unmodified in EDN (and ClojureScript, as is pointed out above).

There are tradeoffs:

Keywords cannot have metadata.
Keywords do not provide direct access to conversion functions or auxiliary data (such as ObjectProperty or DatatypeProperty types).

These issues can both be mitigated by using private structures in the Clojure namespace that you use for RDF with functions for accessing them. This is the approach taken by functions like clojure.core/derive and clojure.core/isa?. This would allow you to set up prefix mappings, so keywords like :rdf/type can be fully resolved to http://www.w3.org/1999/02/22-rdf-syntax-ns#type.

Also, Iâ€™m inclined to allow either IRIs (as strings) or keywords wherever you are expecting an IRI (should your â€œuriâ€ property be changed to â€œiriâ€?). This mimics N-triples or Turtle which allows either for all three positions.

I noted your use of + and * to annotate QName-like values that refer to classes and datatypes. My thought is that they are an unnecessary inconsistency. One potential problem could be if any of these terms are used in punning. If you have this information in a schema or ontology, then this is easy to store and index on loading, as opposed to parsing the name.

My own approach also makes heavier use of literals than yours. You are already using string literals for DatatypeProperties, but I like to push that further.

Not all predicates have sufficient schema information to describe their range. However, the prevalence of datatype properties suggests that it would be useful to presume this type. This means inferring basic data types from literal forms, again like Turtle. So strings are interpreted as xsd:string (since SimpleLiterals now use that datatype), longs are xsd:integer, doubles are xdd:decimal (could choose xdd:double), and java.util.Date (reader tag #inst) is xsd:dateTime. The main issue is that you canâ€™t represent IRIs with strings any more (again, presuming that ObjectProperty or DatatypeProperty types are not necessarily available), but you can still represent them with {:uri â€œhttp://foo.com/barâ€}.

(An alternative Iâ€™ve considered is a #uri reader macro, which would create a java.net.URI instead of a record. However URIs take more space than a string, which is why Iâ€™m not so convinced about using them)

Summarizing, my suggestions would change your â€œfredâ€ entity to look like this:

(prefix :rdf â€œhttp://www.w3.org/1999/02/22-rdf-syntax-ns#â€)
(prefix :owl â€œhttp://www.w3.org/2002/07/owl#â€)
(prefix :foaf â€œhttp://xmlns.com/foaf/0.1/â€)
(prefix :iron â€œhttp://wiki.opensemanticframework.org/index.php/Instance_Record_and_Object_Notation_(irON)_Specification#â€)

(def mike …)

(def fred (r {:uri “http://foo.com/datasets/people/fred”
:rdf/type [:foaf/Person :owl/Thing]
:iron/pref-label “Fred”
:iron/alt-label {:value “Frederick”
:lang “en”}
:foaf/skypeID {:value “frederick.giasson”
:datatype :xsd/string}
:foaf/knows [{:uri “http://foo.com/datasets/people/bob”}
mike
{:uri â€œhttp://foo.com/datasets/people/teoâ€}]})

Thoughts?

Reply
Paul Gearon

June 19, 2014 — 2:22 pm

oops, drop the “(r ” in the definition of “fred”. Copy/paste error 🙂

Reply
Frederick Giasson

June 20, 2014 — 8:15 am

Hi Paul!

First, for the benefit of the readers, I had private emails with Paul since he had issues posting his comment on my blog, so I will resume a few things that have been said during these exchanges.

I totally agree about the keywords. It is sure that the non-cyclical and EDN compatibility are real benefits. As I said in earlier blog posts, I did try that but I was not feeling to be able to do what I wanted to do considering that they cannot have meta-data attached to them, and because they couldn’t be evaluated “in-place” as a function.

However, Paul suggested that the function that starts the evaluation of the structure does a few more things, like using the keys as a lookup to get a reference to a function that would then be evaluated. That is certainly a workable solution. The only drawback I would have with this is that every time we need to have information about the function (so to get its meta-data) we would need to do this lookup between the keyword and the function. However, this probably worth it because of the additional advantages we would gain.

But there is one missing piece with my stuff which is the blog post that explains what I mean by “self-evaluating” data structure, or a data structure that can validate itself just by evaluating it. So my next step is to write it with the proper examples such that people can understand where I am header. Then from there, I think I will re-do the same but with Paul’s suggestion.

Otherwise, about “+” and “*” the goal was really to have a hint about if it was a datatype, class or a property. In my current model, it was not an issue since the URI of these vars are defined in the meta-data of the properties or the datatypes and the values of the classes. However, if we use your design, then yes this because an issue and we would have to get rid of this convention.

About the usage of strings, I think we both use strings equally. The only problem is that you don’t see where they are defined in the examples I put. All the descriptions of the properties and datatypes are in their meta-data (using the same conventions) and in the values of the classes. In my current design, URIs are really strings like you.

I won’t comment much on your observations about the predicates since this will be at the heart of my next blog post. What I will say for now is that datatypes are defined like new datatypes in Ontologies. These are basically rules that you define and embed as a datatype. Then, when the properties will validate the value they are assigned, if a datatype is specified for a datatypeproperty then it will read the definition of the datatype, and validate the value according to this description. If a datatype validation error occurs, then an exception will be rise. That is the same mechanism for the domain and range of the properties.

So, what I will do is to write this blogpost using the current Vars method. Then I will change it to check what it would look likes using keywords and this lookup mechanism, and then we will be in position to figure out the best way to move forward with that.

Thanks for this thoughtful comment!

Reply

Frederick Giasson

Machine Learning, Engineering & Data

Revision of Serializing RDF Data as Clojure Code Specification

What Was Wrong

Keys and Values as Vars

Revision of the RDF Code Rules

Revision of Serializing RDF Code in N-Triples

Serializing a RDF Resource

Conclusion

5 thoughts on “Revision of Serializing RDF Data as Clojure Code Specification”

Someone

Frederick Giasson

Paul Gearon

Paul Gearon

Frederick Giasson

Leave a Reply to Someone Cancel reply