New UMBEL 1.50 Ships With 20 Linked Ontologies

I am proud to announce the immediate release of UMBEL version 1.50. This is a major effort that took a year to release.

What is UMBEL?

Let’s start by explaining what is UMBEL for the ones that never encountered this project before. UMBEL stands for “Upper Mapping and Binding Exchange Layer“. It is a conceptual structure that is designed to help content interoperate between systems.

UMBEL is a coherent general structure of 34 000 reference concepts which provides a scaffolding to link and interoperate other datasets and domain vocabularies. The conceptual structure is organized in a structure of 31 mostly disjoint SuperType.

UMBEL is written in OWL 2 and SKOS.

What are UMBEL’s Objectives?

UMBEL’s main goals are:

  • To create a scaffolding for defining knowledge graphs
  • To create a rich semantic to identify and help disambiguating entities
  • To help expend queries to semantic search engines
  • To help inter-linking ontologies to create a coherent ontological environment
  • To help structure and federate information silos

What is new in UMBEL version 1.50?

Many things changed in UMBEL 1.50: additional of new concepts, multiple structural fixes and improvements, etc. However there are 3 major changes that occurred in this release:

  1. Complete update and addition of linkages between UMBEL reference concepts and related classes existing in external ontologies
  2. Removal of all the named individuals from UMBEL. UMBEL is now only composed of classes reference concepts
  3. Reshaping of the SuperType upper structure by adding new ones and removing some of them

For the complete list of UMBEL changes, I would strongly suggest you to read Mike’s blog post about this UMBEL release.

UMBEL Mapping to External Ontologies

One interesting aspect of the UMBEL structure is to use the coherent structure to federate information silos. We can do that by linking ontologies and vocabularies, used to describe entities indexed in these silos, directly into UMBEL.

But what does that mean? Let’s take a look at a portion of the UMBEL structure related to actors, authors and their relations to humans:

actors-authors-humans

Now let’s assumes that we have two data sources:

  1. DBpedia from which we want to use its Journalist entities, and
  2. Musicbrainz from which we want to use its solo musical artist entities

The journalist entities of the DBpedia data source belong to the dbpedia:Journalist class of the DBpedia ontology. The Musicbrainz solo musical artists belong to the mo:SoloMusicArtist class of the Music Ontology. If you check each of these ontology, you won’t find any connections between these two classes. They appears to be living in two different [conceptual] worlds.

However, what happens if these two classes get connected to some UMBEL reference concepts? Let’s take a look:

dbpedia-mo-connections

What we did here is to connect the two classes to the UMBEL reference structure using the equivalent to property. What we are stating with these assertions is that these two classes are equivalent to these other classes in UMBEL. This seems harmless, but when we start thinking about that, something special is happening.

The special thing that is happening is that we can now query the different datasets (Musicbrains and DBpedia) on new ground. We can now query them such that if I request to get the list of all humans, then I can and I will get all soloist and all journalist. If the data store to get all authors, then I would get all DBpedia journalists and maybe authors of other datasets that may be linked to the UMBEL reference structure.

This is an illustration of how UMBEL can be used to federate information silos.

The good news is that the UMBEL reference structure is already linked to 20 ontologies used by different organizations to define their data sources:

  1. DBPedia Ontology – Links between the DBpedia Ontology classes and the UMBEL Reference Concepts. Half of them comes from the linkage between Proton and UMBEL, and half the others come from hand mapping
  2. Geonames – Geonames
  3. Opencyc – OpenCyc Ontology
  4. Schema.org – Schema.org ontology defines entities known by Google and other search engines
  5. Wikipedia – Links between the Wikipedia pages and the UMBEL Reference Concepts
  6. DOAP – DOAP(Description of a Project) is a vocabulary for project description.
  7. ORG – The ORG (Core Organization) Ontology is a vocabulary for describing organizational structures for a broad variety of types of organization
  8. OO – OO(Open Organizations) is a vocabulary providing supplementary terms for organizations that wish to publish open data about themselves
  9. TRANSIT – TRANSIT(Transit) is a vocabulary for describing transit systems and routes
  10. TIME – The TIME(Time Ontology) defines temporal entities
  11. BIBO – BIBO (Bibliographic Ontology)
  12. CC – CC (CreativeCommons Ontology)
  13. Event – Event Ontology
  14. FOAF – FOAF (Friend Of A Friend Ontology) used to describe people and organizations
  15. GEO – WSG84 Geographic Ontology
  16. MO – MO (Music Ontology)
  17. PO – PO (Programmes Ontology)
  18. RSS – RSS (Really Simple Syndication Ontology)
  19. SIOC – SIOC (Semantically-Interlinked Online Communities Ontology)
  20. FRBR – FRBR (Functional Requirements for Bibliographic Records)

According to Linked Open Vocabularies (LOV) service, the UMBEL reference structure, along with these 20 ontologies linkage would enable you to reach 504 datasets tracked by LOV.

My Literate Programming Commitment

From now on, I make the commitment that everything new I will produce is literate programming code.

The Open Source Revolution

For about a decade, we are experiencing a kind of Open Source revolution with the rise of Git (and all its free online hosting services such as GitHub, BitBucket and GitLab). At the same time the tech Juggernaut like Google, Microsoft, Facebook, Baidu, Twitter and probably all others are increasing their commitment to release several of their internal projects as open source software. There is also a myriad of young and vibrant communities that get created around new programming languages such as Clojure, Scala, R and many others. More and more code is available to people to look at and for developers to use.

My company Structured Dynamics and I are participants of that movement for more than fifteen years, producing open source software, ontologies, datasets and participating into other open source projects by fixing bugs and adding functionalities.

Human Experience of Coding

However, in a [Brave] New World driven by technologies and social networks that encourage its users to write quickly, succinctly and often in a hurry where the latest is more valued than the best, it is normal to find code that reflect that reality.

I am often confronted to the human experience of digging into a new computer project created by others or even an older project of my own. The readability of the code is poor, poorly commented or completely uncommented, the project itself is undocumented and dubious naming of variables, functions, classes, source files, packages, etc. are used. Debugging, improving and upgrading such project is nearly impossible without dedicating a substantial amount of time just trying to understand what the code is supposed to be doing. Everything is created for machines, not humans. The problem is that it is still humans that have to create and maintain these things, so we are better making sure that humans can read a understand such a computer project.

In a World where Open Source computer projects are becoming the norm, we are better making sure that we create projects that can be as easy as possible to be maintained and extended by more than a few people (note that the same applies to proprietary projects).

My Commitment

I am not a particularly good writer. The last thing I wanted to do at school is certainly writing. The only writing experience I have is writing this blog for the last 12 years in English, which is not my native language. However as a developer I always thought that it was important to have clean and well-documented code. It was important since I wanted to be able to re-read that code a few months after I wrote it and still know what it was doing, but it was also important since I wanted to make sure that the code I was publishing could be as easy as possible to be read and reused by other developers.

It is after starting to work with my partner Mike, who is a scientist and a writer but not a software developer, 8 years ago that I took a particular attention to the code I was writing, to properly comment it and to document the whole development process. I had to make sure that a non-coder could review the work I was doing from a higher level, to understand the workflows and the processing. I tried my best over the year to commit to that.

I was happy with this commitment since now. I am now starting to feel that it was an undercommitment, that I could do much better.

Literate Programming

Donald Knuth wrote his Literate Programming paper in 1984, 3 years after I was born, but he started to work on the idea as early as 1978 and first released WEB in 1981. I read about what was literate programming about a decade ago. It had an immediate impact on how I was thinking of my code, how I was commenting and documenting it. However, I never really wrote literate code.

I commented my code, I wrote external documentation on Wikis and other mediums, I wrote API documentation with Doxygen and tried to generate some documentation pages using some of its features, but the process was always siloed.

I started to question my early commitment when I started to write all our applications in Clojure. Clojure led me to read much code in Clojure, Java and Scala. I had to re-use often ill documented and commented code. I had to find and fix bugs, and I had to spend too much time in a debugger for my taste. The problem I had is that I couldn’t easily read the code and I had nothing written to help me out to understand the general data structures, workflows and processing of the applications.

At the same time I have Mike that always look at my code to try to understand the data processing workflows that I write in Clojure to process the data the way he wants. He often tells me “you really write quite beautiful code”, but I am not convinced that he is right. Maybe I write “beautiful code”, but I am not sure that I write that “readable” code, or that I always write readable, well commented and documented code.

This is why I am restarting to commit a substantial amount of time into exploring the process of writing literate code.

Do Everything at the Same Time

The problem of writing readable code which is well commented, well documented and well tested is that ideally we would have to focus on all these aspects at the same time, but given the development environments used by most people, it is not possible. You will plan an aspect of your program and write the code. Then if you are really lucky and you will find (or take) the time to write some documentation and create some unit tests. The problem is that each of these tasks are siloed: they are performed in isolation with 4 different states of minds, at 4 different times and hopefully within 4 weeks. The worse happens when you start fixing bugs or improving the code: comments, documentation and unit tests will often remain unchanged.

This is what Literate Programming is for me: a way to perform all these tasks at once, with the same state of mind, at the same time. This is a process to put in place, a new way to work. The problem is to put in place a process, a way to work, that enables you do to all this at once, at the same time.

In the past I could never commit to Literate Programming for that reason: I couldn’t find a way to put in place such a process, to put in place a new way to work. However this recently changed. About a year and a half ago I started to work with Emacs for programming in Clojure. And then I got introduced by Org-mode a few months ago. Since then, I started to create a new development process that would enable me to finally write my software in a literate way.

The learning curve is steep, the time to invest in important, but the reward is big and satisfying. There is nothing free in this World even if many try to convince you otherwise. This is why I still marvel at coding, because there is always a way to learn new things and to improve the quality of your work. I don’t think the process and experience is any different than what experience professional writers.

Conclusion

I have the feeling that it will become more and more important to write readable code. Much of the code we are writing in these days is code that manipulates and transform data, code that implement [machine learning] workflows and such. The kind of code that would benefit to be readable by many people other than the ones that write the code.

This is why I am now making the commitment to develop all my software as literate code. I yet have to find my style, things will evolve over the next few months and years, but this is the commitment I am making to make me a better developer, a better writer, a better communicator and a better contributor [of open source softwares]. I can’t force people to do what I think is best, but I can force myself in hope to influence others to do it.

In the coming weeks and months I will write a series of blog post about literate programming and more particularly my process of doing so. I will write about the development environment I am using, the way I am using it and how I customized it to work the way I need.

Using Clojure in Org-mode and Implementing Asynchronous Processing

I recently started to get interested in Org-mode which was still unknown to me just a few weeks ago until I read this great article from Howard Abrams about literate programming using Org-mode. Initially I was wondering what this Emacs package was really about (it does all kind of things like document outlining (à la Markdown), tasks management and planning, agenda generation, time clocking… and it has a series of features related to literal programming that let you embed and run code blocks using sub-processes and to display results directly into the Org-mode [text] document.

What I was really interested in are the code block related features of Org-mode. Initially I wanted to test Org-mode using as a Notebook application but I also wanted to re-start trying to coding in literate programming format. I will extend on the later in my next blogpost, for now I will concentrate on why I want to use Org-mode as notebook style programming user interfaces. Since everything I code these days is in Clojure programming language, I wanted to be able to use Org-mode’s code blocks with Clojure.

Finally I will describes a few issues I experimented in the process and how I update the Org-babel-clojure package to fix those issues.

Notebook Creations Using Org-mode

My partner Mike Bergman got me interested in notebook style programming user interfaces maybe a year ago. We wanted to find a way to easily experiment with different data management structures and frameworks we are developing at Structured Dynamics. The idea behind a Notebook was quite interesting: it is to run code snippets anywhere in a document, to see the results within that document and finally to be able to document the process. Then if something changed in the data, or in the code, then each code snipped within a Notebook could be rerun at any time, and the results updated. This is a great way to do experimentation, to keep tracks of the tests your are doing and to document the whole process.

The idea is really interesting for the kind of work we are doing. I tested the Gorilla REPL which is an implementation of this style user interface in Clojure. Other such interfaces exists in other programming languages like IPython, Wolfram, etc. However, I always had an issue with what I was using: I had a hard time re-purposing the content I was creating; I couldn’t easily export this information in different format (blog posts, papers, etc.). Saving, reloading, re-running in different environment was often too much trouble: until I find Org-mode.

I am not sure why I didn’t came across Org-mode before, maybe because it was not advertised as as “notebook style programming user interface” but this is really what it is (mostly) all about, at least to me. As far as I know, this is the only such software that let you work with any kind of programming language in the same notebook. It can also export the notebooks in virtually any formats (several formats are supported by Org-mode itself, others can be exported using Pandoc).

This being said, I started experimenting with Org-mode to create different kind of Notebooks using Clojure. I am using notebooks that shows how to use different APIs we are creating, or ones that shows how different data processing workflows actually works or that shows how some structures (like UMBEL) have been created and how they can be leveraged. I am also creating notebooks to research and experiment different kind of algorithms that we are trying to implement in our products, or to do bug investigation reports for our clients, or… the possibilities are probably endless. But the core idea is almost always the same: communication. We write these notebooks to communicate (write) information for other people to consume (or more important, his future self).

Given this kind of tasks that I am performing in a notebook, I often have to run procedures that may takes minutes or even hours before their processing is finalized. However, as you will see below, running procedures that takes minutes to finalize is a show stopper with the current Org-mode Org-babel-clojure (ob-clojure.el)= package that let Org-mode to run Clojure code.

Installing & Configuring Org-mode

Before outlining the issues I had with the current implementation of the Org-babel-clojure package, let me explain how I installed and configured Org-mode locally.

First of all I installed Org-mode contribs from ELPA, then I configured it that way in my .emacs file. Note that I made multiple little changes here and there to end-up with the kind of editor I am comfortable to use. So this is about installing, enabling and tweaking Org-mode in Emacs:

;; Configure Org-mode with Cider

;; Load Org-mode
(add-to-list 'load-path "~/.emacs.d/lib/org-mode/")
(require 'org)

;; Here I specify the languages I want to be able to use with Org-babel.
(org-babel-do-load-languages
 'org-babel-load-languages
 '((clojure . t)
   (sh . t)
   (emacs-lisp . t)))

;; Specify the Clojure back-end we want to use in Org-mode.
;; I personally use Cider, but one could specify Slime
(setq org-babel-clojure-backend 'cider)

;; Let's have pretty source code blocks
(setq org-edit-src-content-indentation 0
      org-src-tab-acts-natively t
      org-src-fontify-natively t
       org-confirm-babel-evaluate nil
      org-support-shift-select 'always)

;; Useful keybindings when using Clojure from Org
(org-defkey org-mode-map "\C-x\C-e" 'cider-eval-last-sexp)
(org-defkey org-mode-map "\C-c\C-d" 'cider-doc)      

(require 'cider)

;; Remove the markup characters, i.e., "/text/" becomes (italized) "text"
(setq org-hide-emphasis-markers t)

;; No timeout when executing calls on Cider via nrepl
(setq org-babel-clojure-nrepl-timeout nil)

;; Turn on visual-line-mode for Org-mode only
;; Note: you have to install "adaptive-wrap" from elpa
(add-hook 'org-mode-hook 'turn-on-visual-line-mode)

;; Enable Confluence export (or any other contributed export formats)
(require 'ox-confluence)

Note that most of these configurations comes from the Org-babel-clojure webpage.

Timeout issues

The first issue I encountered is when I started to run code that was taking longer than 10 seconds. Every time I was running such code, I ran into the follow error:

“nrepl-send-sync-request: Sync nREPL request timed out”

What this error means is the the synchronous request to nREPL (the Clojure back-end that run the actual code written into Org-mode) timeout. I was really not expecting a query to timeout that way. This led me to start reading the Org-babel-clojure code to see where such an error may be coming from. However, I have to do a disclaimer here: I never really looked into Elisp code until now. The only other work I did with Elisp was to configure Emacs so be indulgent with me and report all awkward code I may be writing here.

My journey started with the ob-clojure.el which is the file used to make the bridge between Org-mode and the Clojure back-end (Cider/nREPL in this case). It is after reading that code that I noticed the following function: org-babel-execute:clojure which appeared to be the thing that is run when we run a Clojure code block in Org-mode. Then I noticed the call to the function nrepl-sync-request:eval. That needed to be the culprit and what sent this Sync timeout error. I found this function in the Cider code. But then I found this other function that is called by the later: nrepl-send-sync-request. It is when I read this function that I noticed the nrepl-sync-request-timeout variable. Looking back at org-babel-execute:clojure I couldn’t see where I could define this timeout parameter. I looks like it was not possible to define it, which was a big issue to me since I needed to be able to run procedure that takes minutes to run.

It is at that time that I choose to hack the ob-clojure.el code to expose that timeout setting such that I could setup it properly for my own needs. The code I created for that purpose is:

; Addition of the org-babel-clojure-nrepl-timeout setting
(defvar org-babel-clojure-nrepl-timeout nil)

(defun org-babel-execute:clojure (body params)
  "Execute a block of Clojure code with Babel."
  (let ((expanded (org-babel-expand-body:clojure body params))
        result)
    (case org-babel-clojure-backend
      (cider
       (require 'cider)
       (let ((result-params (cdr (assoc :result-params params))))
         (setq result
               (nrepl-dict-get
                ; Addition of the org-babel-clojure-nrepl-timeout setting
                (let ((nrepl-sync-request-timeout org-babel-clojure-nrepl-timeout))
                  (nrepl-sync-request:eval
                   expanded (cider-current-connection) (cider-current-session)))
                (if (or (member "output" result-params)
                        (member "pp" result-params))
                    "out"
                  "value")))))
      (slime
       (require 'slime)
       (with-temp-buffer
         (insert expanded)
         (setq result
               (slime-eval
                `(swank:eval-and-grab-output
                  ,(buffer-substring-no-properties (point-min) (point-max)))
                (cdr (assoc :package params)))))))
    (org-babel-result-cond (cdr (assoc :result-params params))
      result
      (condition-case nil (org-babel-script-escape result)
        (error result)))))

What I modified in this code is to add a new global setting org-babel-clojure-nrepl-timeout. If this setting is nil then there won’t be any timeout, otherwise the timeout value will be in seconds. What I did is simply to bind its value to the nREPL setting nrepl-sync-request-timeout and be done with it.

That solved this issue. After I updated ob-clojure.el accordingly, I could run Clojure code that may takes several minutes in Org-mode! That was fanstastic, but it was not optimal.

In fact, when I am running workflows that may take 30 minutes to finalize, I normally output processing steps in the REPL such that I know where the process is and what it is currently processing.

The problem with the current implementation of Org-babel-clojure is that it uses the synchronous API of the nREPL. What I want is to be able to run Clojure code asynchronously such that I can get some feedbacks (via the REPL) from the procedure I am running. This opened a kind of a Pandora box, and something that looked simple turned out to be more complex than anticipated for someone without any knowledge into Elisp, internal mechanisms and APIs of Emacs.

Making Org-babel-clojure “Asynchrone”

The next goal I had is to try to make Org-babel-clojure asynchrone. What I wanted is to be able to get, somehow, was the output of a Clojure procedure when that procedure was outputing something to the REPL. My second journey started after reading John Kitchin’s blog post about Asynchronously running Python code into Org-mode code blocks. What I found out is that Python code was run via a sub-process which run the Python interpreter. John’s solution was to use a local file to write what the interpreter is outputing and then to feed that output to a new window that got created by John’s function.

I took that example as a given, and then I tried to implement the same solution, but for Clojure (without knowing what I was really doing). It is in this process that I found that the Clojure solution to that problem would be quite different than John’s. There is an asynchronous API in nREPL, it is just that it is not used in Org-babel-clojure. What I ended-up using from John’s example is not his code, but his core idea: using a new window to output the asynchrone process and then to kill it once the processing is finalized and before populating #+RESULSTS section of the Org-mode file.

After much testing and debugging I ended-up with the following solution to my problem:

(defun org-babel-execute:clojure (body params)
  "Execute a block of Clojure code with Babel."
  (lexical-let* ((expanded (org-babel-expand-body:clojure body params))
                 ; name of the buffer that will receive the asyn output
                 (sbuffer "*Clojure Sub Buffer*")
                 ; determine if the :async option is specified for this block
                 (async (if (assoc :async params) t nil))
                 ; generate the full response from the REPL
                 (response (cons 'dict nil))
                 ; keep track of the status of the output in async mode
                 status
                 ; result to return to Babel
                 result)
    (case org-babel-clojure-backend
      (cider
       (require 'cider)
       (let ((result-params (cdr (assoc :result-params params))))
         ; Check if the user want to run code asynchronously
         (when async
           ; Create a new window with the async output buffer
           (switch-to-buffer-other-window sbuffer)

           ; Run the Clojure code asynchronously in nREPL
           (nrepl-request:eval
            expanded 
            (lambda (resp) 
              (when (member "out" resp)
                ; Print the output of the nREPL in the asyn output buffer
                (princ (nrepl-dict-get resp "out") (get-buffer sbuffer)))
              (nrepl--merge response resp)
              ; Update the status of the nREPL output session
              (setq status (nrepl-dict-get response "status")))
            (cider-current-connection) 
            (cider-current-session))

           ; Wait until the nREPL code finished to be processed
           (while (not (member "done" status))
             (nrepl-dict-put response "status" (remove "need-input" status))
             (accept-process-output nil 0.01)
             (redisplay))

           ; Delete the async buffer & window when the processing is finalized
           (let ((wins (get-buffer-window-list sbuffer nil t)))
             (dolist (win wins)
               (delete-window win))
             (kill-buffer sbuffer))

           ; Put the output or the value in the result section of the code block
           (setq result (nrepl-dict-get response 
                                        (if (or (member "output" result-params)
                                                (member "pp" result-params))
                                            "out"
                                          "value"))))
         ; Check if user want to run code synchronously
         (when (not async)
           (setq result
                 (nrepl-dict-get
                  (let ((nrepl-sync-request-timeout 
                         org-babel-clojure-nrepl-timeout))
                    (nrepl-sync-request:eval
                     expanded (cider-current-connection) (cider-current-session)))
                  (if (or (member "output" result-params)
                          (member "pp" result-params))
                      "out"
                    "value"))))))
      (slime
       (require 'slime)
       (with-temp-buffer
         (insert expanded)
         (setq result
               (slime-eval
                `(swank:eval-and-grab-output
                  ,(buffer-substring-no-properties (point-min) (point-max)))
                (cdr (assoc :package params)))))))
    (org-babel-result-cond (cdr (assoc :result-params params))
      result
      (condition-case nil (org-babel-script-escape result)
        (error result)))))

The first thing this code does, is to expose a new #+BEGIN_SRC option called :async. If the new :async option is specified in a block code for the Clojure language, then that code block will be processed asynchronously. What this means is that a new window will be created in Emacs, it will be populated with anything that is outputted to the REPL and then it will be closed once the processing will be finalized.

Here is an example of a code block that would use that new option:

#+BEGIN_SRC clojure :results output :async

(dotimes [n 10]
  (println n ".")
  (Thread/sleep 500))

#+END_SRC

This code would output “1. 2.” etc into a new window and would close that window when it reaches 10 and then populate the #+RESULTS section with the output of the code.

This code works with the :results options output, value and silent. If output is specified, then everything that was outputted into the window will be added into the results section of the code block. If value is specified, then all output will still be displayed into the window, but only the resulting value will be added to the results section of the code block. If silent is specified, then all the output will still be displayed into the window, but nothing will be displayed in the results section of the code block.

If the :async is omitted, then the normal behavior of Org-babel-clojure will be used, with the new timeout setting org-babel-clojure-nrepl-timeout.

Call for help!

As I mentioned above, this is my attempt in coding something for Emacs using Elisp. There are certainly things that should be done differently. So if you have any Elisp and/or Cider/nREPL knowledge, and if you have some time to review this code, I am sure we could improve the usage of this function. The only thing I know is that such asynchronous capabilities of the Clojure code blocks is essential.

There is one major area of improvement that I noted. Right now, the results comes asynchronously, but we still can’t use the Emacs instance to do other things (like writing in the Org-mode file while the process is running in background and results reported in this other buffer. Until this other issue is resolved, I don’t think we can say that this really makes Org-babel-clojure really 100% asynchronous. If this can be done (I did not have time to look into this yet), then I think the :async feature would be fully and properly integrated, but I am not yet sure if this is possible.

 Sources

For the ones interested in this update of Org-babel-clojure, here is:

  • The Org-mode file of this blogpost which you can run to test the updated org-babel-execute:clojure function
  • The diff file if you want to update your local ob-clujure.el file



This blog is a regularly updated collection of my thoughts, tips, tricks and ideas about data mining, data integration, data publishing, the semantic Web, my researches and other related software development.


RSS Twitter LinkedIN


Follow

Get every new post on this blog delivered to your Inbox.

Join 90 other followers:

Or subscribe to the RSS feed by clicking on the counter:




RSS Twitter LinkedIN