Tag Archive for 'clojure-2'

My Literate Programming Commitment

From now on, I make the commitment that everything new I will produce is literate programming code.

The Open Source Revolution

For about a decade, we are experiencing a kind of Open Source revolution with the rise of Git (and all its free online hosting services such as GitHub, BitBucket and GitLab). At the same time the tech Juggernaut like Google, Microsoft, Facebook, Baidu, Twitter and probably all others are increasing their commitment to release several of their internal projects as open source software. There is also a myriad of young and vibrant communities that get created around new programming languages such as Clojure, Scala, R and many others. More and more code is available to people to look at and for developers to use.

My company Structured Dynamics and I are participants of that movement for more than fifteen years, producing open source software, ontologies, datasets and participating into other open source projects by fixing bugs and adding functionalities.

Human Experience of Coding

However, in a [Brave] New World driven by technologies and social networks that encourage its users to write quickly, succinctly and often in a hurry where the latest is more valued than the best, it is normal to find code that reflect that reality.

I am often confronted to the human experience of digging into a new computer project created by others or even an older project of my own. The readability of the code is poor, poorly commented or completely uncommented, the project itself is undocumented and dubious naming of variables, functions, classes, source files, packages, etc. are used. Debugging, improving and upgrading such project is nearly impossible without dedicating a substantial amount of time just trying to understand what the code is supposed to be doing. Everything is created for machines, not humans. The problem is that it is still humans that have to create and maintain these things, so we are better making sure that humans can read a understand such a computer project.

In a World where Open Source computer projects are becoming the norm, we are better making sure that we create projects that can be as easy as possible to be maintained and extended by more than a few people (note that the same applies to proprietary projects).

My Commitment

I am not a particularly good writer. The last thing I wanted to do at school is certainly writing. The only writing experience I have is writing this blog for the last 12 years in English, which is not my native language. However as a developer I always thought that it was important to have clean and well-documented code. It was important since I wanted to be able to re-read that code a few months after I wrote it and still know what it was doing, but it was also important since I wanted to make sure that the code I was publishing could be as easy as possible to be read and reused by other developers.

It is after starting to work with my partner Mike, who is a scientist and a writer but not a software developer, 8 years ago that I took a particular attention to the code I was writing, to properly comment it and to document the whole development process. I had to make sure that a non-coder could review the work I was doing from a higher level, to understand the workflows and the processing. I tried my best over the year to commit to that.

I was happy with this commitment since now. I am now starting to feel that it was an undercommitment, that I could do much better.

Literate Programming

Donald Knuth wrote his Literate Programming paper in 1984, 3 years after I was born, but he started to work on the idea as early as 1978 and first released WEB in 1981. I read about what was literate programming about a decade ago. It had an immediate impact on how I was thinking of my code, how I was commenting and documenting it. However, I never really wrote literate code.

I commented my code, I wrote external documentation on Wikis and other mediums, I wrote API documentation with Doxygen and tried to generate some documentation pages using some of its features, but the process was always siloed.

I started to question my early commitment when I started to write all our applications in Clojure. Clojure led me to read much code in Clojure, Java and Scala. I had to re-use often ill documented and commented code. I had to find and fix bugs, and I had to spend too much time in a debugger for my taste. The problem I had is that I couldn’t easily read the code and I had nothing written to help me out to understand the general data structures, workflows and processing of the applications.

At the same time I have Mike that always look at my code to try to understand the data processing workflows that I write in Clojure to process the data the way he wants. He often tells me “you really write quite beautiful code”, but I am not convinced that he is right. Maybe I write “beautiful code”, but I am not sure that I write that “readable” code, or that I always write readable, well commented and documented code.

This is why I am restarting to commit a substantial amount of time into exploring the process of writing literate code.

Do Everything at the Same Time

The problem of writing readable code which is well commented, well documented and well tested is that ideally we would have to focus on all these aspects at the same time, but given the development environments used by most people, it is not possible. You will plan an aspect of your program and write the code. Then if you are really lucky and you will find (or take) the time to write some documentation and create some unit tests. The problem is that each of these tasks are siloed: they are performed in isolation with 4 different states of minds, at 4 different times and hopefully within 4 weeks. The worse happens when you start fixing bugs or improving the code: comments, documentation and unit tests will often remain unchanged.

This is what Literate Programming is for me: a way to perform all these tasks at once, with the same state of mind, at the same time. This is a process to put in place, a new way to work. The problem is to put in place a process, a way to work, that enables you do to all this at once, at the same time.

In the past I could never commit to Literate Programming for that reason: I couldn’t find a way to put in place such a process, to put in place a new way to work. However this recently changed. About a year and a half ago I started to work with Emacs for programming in Clojure. And then I got introduced by Org-mode a few months ago. Since then, I started to create a new development process that would enable me to finally write my software in a literate way.

The learning curve is steep, the time to invest in important, but the reward is big and satisfying. There is nothing free in this World even if many try to convince you otherwise. This is why I still marvel at coding, because there is always a way to learn new things and to improve the quality of your work. I don’t think the process and experience is any different than what experience professional writers.

Conclusion

I have the feeling that it will become more and more important to write readable code. Much of the code we are writing in these days is code that manipulates and transform data, code that implement [machine learning] workflows and such. The kind of code that would benefit to be readable by many people other than the ones that write the code.

This is why I am now making the commitment to develop all my software as literate code. I yet have to find my style, things will evolve over the next few months and years, but this is the commitment I am making to make me a better developer, a better writer, a better communicator and a better contributor [of open source softwares]. I can’t force people to do what I think is best, but I can force myself in hope to influence others to do it.

In the coming weeks and months I will write a series of blog post about literate programming and more particularly my process of doing so. I will write about the development environment I am using, the way I am using it and how I customized it to work the way I need.

clj-fst: Finite State Transducers (FST) for Clojure

clj-fst is a Clojure wrapper around the Lucene FST API. Finite state transducers are finite state machines with two tapes: an input and an output tape. The automaton maps an input string to an output. The output can be a vector of strings or a vector of integers. There are more profound mathematical implications to FSTs, but those are the basics for now.

Why Use FSTs?

Considering that basic definition of a FST, one could legitimately wonder why he should care about FSTs. FSTs could be seen as simple Clojure maps, so why bother with FSTs?

Everything is a matter of scale. Using a map, or such generic structures, for efficiently handling millions or billions of values is far from effective, if even possible.

That is why we need some specialized structures like FSTs: to be able to create such huge associative structures that are lightning fast to query and that use a minimum of memory.

There are two general use cases for using FSTs:

  1. When you want to know if an instance A exists in a really huge set X (where the set X is the FST)
  2. When you want to get a list of outputs from a given input from a really huge set.

Lucene FSTs

There are multiple FSTs implementations out there, however I choose to go with Lucene’s implementation development by Micheal McCandless. The main reason for using the Lucene FST API is because of their implementation of the FST. It implements the work of Stoyan Mihov and Denis Maurel1 to create a minimal unweighted FST from pre-sorted inputs. The implementation results in lightning fast querying of the structure with a really efficient use of memory. Considering the size of the structures we manipulate at Structured Dynamics, these were the two main characteristics to look for and the reason why we choose that implementation.

Limitations

However, there are two things to keep in mind when working with FSTs:

  1. The FSTs are static. This means that you cannot add to them once they are created. You have to re-create them from the beginning if you want to change their content.
  2. The entries have to be pre-sorted. If your entries are not sorted when you create the FST ,then unexpected results will happen.

clj-fst

The clj-fst project is nothing more than a wrapper around the Lucene FST API. However, one of the goals of this project is to make this specific Lucene function outstanding and to liberalize its usage in Clojure.

If you take the time to analyze the clj-fst wrapper, and the Lucene API code, you will notice that not all the of functionality of the API is wrapped. The thing is that the API is somewhat complex and doesn’t have much documentation. What clj-fst tries to do is to simplify the usage of the API and to create more documentation and code usage examples around it. Finally, it tries create an abstraction layer over the API to manipulate the FSTs in the Clojure way…

Basic Usage

Creating an FST is really simple, it has 3 basic, and one optional, steps:

  1. Create the FST builder
  2. Populate the FST using the builder
  3. Create the actual FST from the builder
  4. Optionally, save the FST on the file system to reload it later in memory.

Note that the complete clj-fst documentation is available here.

The simplest code looks like:

;; The first thing to do is to create the Builder
(def builder (create-builder! :type :int))

;; This small sorted-map defines the things
;; to add to the FST
(def values (into (sorted-map) {"cat" 1
                                "dog" 2
                                "mice" 3}))

;; Populate the FST using that sorted-map
(doseq [[input output] values]
  (add! builder {input output}))

;; Creating a new FST
(def fst (create-fst! builder))

;; Save a FST on the file system
(save! "resources/fst.srz" fst)

Once the FST is saved on the file system, you can easily reload it later:

;; Load a FST from the file system
(load! "resources/fst.srz)

You can easily get the output related to an input:

;; Query the FST
(get-output "cat" fst)

You can iterate the content of FST:

;; Create the FST enumeration
(def enum (create-enum! fst))

;; Get the first item in the FST
(next! enum)

;; Get the current FST item pointed by the enumerator
(current! enum)

Finally you have other ways to query the FST using the enumerator:

;; Search for different input terms
(get-ceil-term! "cat" enum)

(get-floor-term! "cat" enum)

(get-exact-term! "cat" enum)

More Complex Example

Let’s take a look at a more complex example. What we will be doing here is to create a FST that will be used as a high performance inference index for UMBEL reference concepts (classes). What we are doing is to query the UMBEL super classes web service endpoint to populate the super-types index.

The process is:

  1. Get the number of concepts in the UMBEL structure
  2. Get the list of all the UMBEL concepts using the UMBEL search endpoint
  3. Sort the list of UMBEL concepts URIs
  4. Get the super-classes, by inference, for each of the concepts
  5. Populate the FST with the concepts as input and its super-classes as output
  6. Save the FST on the file system.

To simplify the example, I simply list all of the UMBEL reference concepts in a CSV file. However, you could have created that list using the UMBEL search web service endpoint.

The function that creates the UMBEL reference concepts super-classes index is:

(ns foo.core
  (:require [clojure.string :as string]
            [clj-http.client :as http]
            [clojure.data.csv :as csv]
            [clojure.java.io :as io]
            [clj-fst.core :as fst]))

(defn get-umbel-reference-concepts []
  (->> (with-open [in-file (io/reader "http://fgiasson.com/blog/wp-content/uploads/2015/04/umbel-reference-concepts.csv")]
         (doall
          (csv/read-csv in-file)))
       flatten
       (into [])))

(defn create-umbel-super-classes-fst []
  (let [ref-concepts (->> (get-umbel-reference-concepts)
                          (map (fn [ref-concept]
                                 [(string/replace ref-concept "http://umbel.org/umbel/rc/" "")]))
                          (apply concat)
                          (into [])
                          distinct
                          sort)
        builder (fst/create-builder! :type :char :pack true)]
    (doseq [ref-concept ref-concepts]
      (println ref-concept)
      (let [resultset (http/get (str "http://umbel.org/ws/super-classes/" ref-concept)
                                {:accept "application/clojure"
                                 :throw-exceptions false})]
        (when (= (get resultset :status) 200)
          (doseq [super-class (->> resultset
                                   :body
                                   read-string
                                   (into []))]
            (fst/add! builder {(str "http://umbel.org/umbel/rc/" ref-concept) super-class})))))
    (let [fst (fst/create-fst! builder)]
      (fst/save! "resources/umbel-super-classes.fst" fst))))

After running the (create-umbel-super-classes-fst) function, a umbel-super-classes.fst file will be created in the resources/ folder of your project. This process should take about 5 to 10 minutes to complete. All the latency comes from the fact that you have to issue a web service query for every concept. From the standpoint of the FST, you could populate one with millions of inputs within a few seconds.

Eventually you will be able to reload that index in any context:

(def umbel-super-classes (fst/load! "resources/umbel-super-classes.fst"))

 

Conclusion

As you can see, an FST is a really interesting structure that lets you query really huge arrays in an effective way. The goal of this new Clojure library is to make its usage as simple as possible. It is intended to be used by any developer that has to query very large sets of data with a computational- and memory-effective way.

Open Semantic Framework 3.1 Released

Structured Dynamics is happy to announce the immediate availability of the Open Semantic Framework version 3.1. This new version includes a set of fixes to different components of the framework in the last few months. The biggest change is deployment of OSF using Virtuoso Open Source version 7.1.0. triple_120

We also created a new API for Clojure developers called: clj-osf. Finally we created a new Open Semantic Framework web portal that better describes the project and is hopefully easier to use and more modern.

Quick Introduction to the Open Semantic Framework

What is the Open Semantic Framework?

The Open Semantic Framework (OSF) is an integrated software stack using semantic technologies for knowledge management. It has a layered architecture that combines existing open source software with additional open source components. OSF is designed as an integrated content platform accessible via the Web, which provides needed knowledge management capabilities to enterprises. OSF is made available under the Apache 2 license.

OSF can integrate and manage all types of content – unstructured documents, semi-structured files, spreadsheets, and structured databases – using a variety of best-of-breed data indexing and management engines. All external content is converted to the canonical RDF data model, enabling common tools and methods for tagging and managing all content. Ontologies provide the schema and common vocabularies for integrating across diverse datasets. These capabilities can be layered over existing information assets for unprecedented levels of integration and connectivity. All information within OSF may be powerfully searched and faceted, with results datasets available for export in a variety of formats and as linked data.

A new Open Semantic Framework website

The OSF 3.1 release also triggered the creation of a new website for the project. We wanted something leaner and more modern and that is what I think we delivered. We also reworked the content, we wrote about a series of usecases 1 2 3 4 5 6 and we better aggregated and presented information for each web service endpoint.

A new OSF sandbox

We also created an OSF sandbox where people can test each web service endpoint and test how each functionality works. All of the web services are open to users. The sandbox is not meant to be stable considering that everybody have access to all endpoints. However, the sandbox server will be recreated on a periodic basis. If the sandbox is totally broken and users experiment issues, they can always request a re-creation of the server directly on the OSF mailing list.

Each of the web service pages on the new OSF portal has a Sandbox section where you see some code examples of how to use the endpoint and how to send requests to the sandbox. Here are the instructions to use the sandbox server.

A new OSF API for Clojure: clj-osf

The OSF release 3.1 also includes a new API for Clojure developers: clj-osf.

clj-osf is a Domain Specific Language (DSL) that should lower the threshold to use the Open Semantic Framework.

To use the DSL, you only have to configure your application to use a specific OSF endpoint. Here is an example of how to do this for the Sandbox server:

;; Define the OSF Sandbox credentials (or your own):
(require '[clj-osf.core :as osf])

(osf/defosf osf-test-endpoint {:protocol :http
                               :domain "sandbox.opensemanticframework.org"
                               :api-key "EDC33DA4D977CFDF7B90545565E07324"
                               :app-id "administer"})

(osf/defuser osf-test-user {:uri "http://sandbox.opensemanticframework.org/wsf/users/admin"})

Then you can send simple OSF web service queries. Here is an example that sends a search query to return records of type foaf:Person that also match the keyword “bob”:

(require '[clj-osf.search :as search])

(search/search
 (search/query "bob")
 (search/type-filters ["http://xmlns.com/foaf/0.1/Person"]))

A complete set of clj-osf examples is available on the OSF wiki.

Finally the complete clj-osf DSL documentation is available here.

A community effort

This new release of the OSF Installer is another effort of the growing Open Semantic Framework community. The upgrade of the installer to deploy the OSF stack using Virtuoso Open Source version 7.1.0 has been created by William (Bill) Anderson.

Deploying a new OSF 3.1 Server

Using the OSF Installer

OSF 3.1 can easily be deployed on a Ubuntu 14.04 LTS server using the osf-installer application. It can easily be done by executing the following commands in your terminal:

mkdir -p /usr/share/osf-installer/

cd /usr/share/osf-installer/

wget https://raw.github.com/structureddynamics/Open-Semantic-Framework-Installer/3.1/install.sh

chmod 755 install.sh

./install.sh

./osf-installer --install-osf -v

Using a Amazon AMI

If you are an Amazon AWS user, you also have access to a free AMI that you can use to create your own OSF instance. The full documentation for using the OSF AMI is available here.

Upgrading Existing Installations

Existing OSF installations can be upgraded using the OSF Installer. However, note that the upgrade won’t deploy Virtuoso Open Source 7.1.0 for you. All the code will be upgraded, but Virtuoso will remain the version you were last using on your instance. All the code of OSF 3.1 is compatible with previous versions of Virtuoso, but you won’t benefit the latest improvements to Virtuoso (in terms of performances) and its latest SPARQL 1.1 implementations. If you want to upgrade Virtuoso to version 7.1.0 on an existing OSF instance you will have to do this by hands.

To upgrade the OSF codebase, the first thing is to upgrade the installer itself:

# Upgrade the OSF Installer
./usr/share/osf-installer/upgrade.sh

Then you can upgrade the components using the following commands:

# Upgrade the OSF Web Services
./usr/share/osf-installer/osf --upgrade-osf-web-services="3.1.0"

# Upgrade the OSF WS PHP API
./usr/share/osf-installer/osf --upgrade-osf-ws-php-api="3.1.0"

# Upgrade the OSF Tests Suites
./usr/share/osf-installer/osf --upgrade-osf-tests-suites="3.1.0"

# Upgrade the Datasets Management Tool
./usr/share/osf-installer/osf --upgrade-osf-datasets-management-tool="3.1.0"

# Upgrade the Data Validator Tool
./usr/share/osf-installer/osf --upgrade-osf-data-validator-tool="3.1.0"



This blog is a regularly updated collection of my thoughts, tips, tricks and ideas about data mining, data integration, data publishing, the semantic Web, my researches and other related software development.


RSS Twitter LinkedIN


Follow

Get every new post on this blog delivered to your Inbox.

Join 90 other followers:

Or subscribe to the RSS feed by clicking on the counter:




RSS Twitter LinkedIN