clj-fst: Finite State Transducers (FST) for Clojure

clj-fst is a Clojure wrapper around the Lucene FST API. Finite state transducers are finite state machines with two tapes: an input and an output tape. The automaton maps an input string to an output. The output can be a vector of strings or a vector of integers. There are more profound mathematical implications to FSTs, but those are the basics for now.

Why Use FSTs?

Considering that basic definition of a FST, one could legitimately wonder why he should care about FSTs. FSTs could be seen as simple Clojure maps, so why bother with FSTs?

Everything is a matter of scale. Using a map, or such generic structures, for efficiently handling millions or billions of values is far from effective, if even possible.

That is why we need some specialized structures like FSTs: to be able to create such huge associative structures that are lightning fast to query and that use a minimum of memory.

There are two general use cases for using FSTs:

  1. When you want to know if an instance A exists in a really huge set X (where the set X is the FST)
  2. When you want to get a list of outputs from a given input from a really huge set.

Lucene FSTs

There are multiple FSTs implementations out there, however I choose to go with Lucene’s implementation development by Micheal McCandless. The main reason for using the Lucene FST API is because of their implementation of the FST. It implements the work of Stoyan Mihov and Denis Maurel1 to create a minimal unweighted FST from pre-sorted inputs. The implementation results in lightning fast querying of the structure with a really efficient use of memory. Considering the size of the structures we manipulate at Structured Dynamics, these were the two main characteristics to look for and the reason why we choose that implementation.

Limitations

However, there are two things to keep in mind when working with FSTs:

  1. The FSTs are static. This means that you cannot add to them once they are created. You have to re-create them from the beginning if you want to change their content.
  2. The entries have to be pre-sorted. If your entries are not sorted when you create the FST ,then unexpected results will happen.

clj-fst

The clj-fst project is nothing more than a wrapper around the Lucene FST API. However, one of the goals of this project is to make this specific Lucene function outstanding and to liberalize its usage in Clojure.

If you take the time to analyze the clj-fst wrapper, and the Lucene API code, you will notice that not all the of functionality of the API is wrapped. The thing is that the API is somewhat complex and doesn’t have much documentation. What clj-fst tries to do is to simplify the usage of the API and to create more documentation and code usage examples around it. Finally, it tries create an abstraction layer over the API to manipulate the FSTs in the Clojure way…

Basic Usage

Creating an FST is really simple, it has 3 basic, and one optional, steps:

  1. Create the FST builder
  2. Populate the FST using the builder
  3. Create the actual FST from the builder
  4. Optionally, save the FST on the file system to reload it later in memory.

Note that the complete clj-fst documentation is available here.

The simplest code looks like:

;; The first thing to do is to create the Builder
(def builder (create-builder! :type :int))

;; This small sorted-map defines the things
;; to add to the FST
(def values (into (sorted-map) {"cat" 1
                                "dog" 2
                                "mice" 3}))

;; Populate the FST using that sorted-map
(doseq [[input output] values]
  (add! builder {input output}))

;; Creating a new FST
(def fst (create-fst! builder))

;; Save a FST on the file system
(save! "resources/fst.srz" fst)

Once the FST is saved on the file system, you can easily reload it later:

;; Load a FST from the file system
(load! "resources/fst.srz)

You can easily get the output related to an input:

;; Query the FST
(get-output "cat" fst)

You can iterate the content of FST:

;; Create the FST enumeration
(def enum (create-enum! fst))

;; Get the first item in the FST
(next! enum)

;; Get the current FST item pointed by the enumerator
(current! enum)

Finally you have other ways to query the FST using the enumerator:

;; Search for different input terms
(get-ceil-term! "cat" enum)

(get-floor-term! "cat" enum)

(get-exact-term! "cat" enum)

More Complex Example

Let’s take a look at a more complex example. What we will be doing here is to create a FST that will be used as a high performance inference index for UMBEL reference concepts (classes). What we are doing is to query the UMBEL super classes web service endpoint to populate the super-types index.

The process is:

  1. Get the number of concepts in the UMBEL structure
  2. Get the list of all the UMBEL concepts using the UMBEL search endpoint
  3. Sort the list of UMBEL concepts URIs
  4. Get the super-classes, by inference, for each of the concepts
  5. Populate the FST with the concepts as input and its super-classes as output
  6. Save the FST on the file system.

To simplify the example, I simply list all of the UMBEL reference concepts in a CSV file. However, you could have created that list using the UMBEL search web service endpoint.

The function that creates the UMBEL reference concepts super-classes index is:

(ns foo.core
  (:require [clojure.string :as string]
            [clj-http.client :as http]
            [clojure.data.csv :as csv]
            [clojure.java.io :as io]
            [clj-fst.core :as fst]))

(defn get-umbel-reference-concepts []
  (->> (with-open [in-file (io/reader "http://fgiasson.com/blog/wp-content/uploads/2015/04/umbel-reference-concepts.csv")]
         (doall
          (csv/read-csv in-file)))
       flatten
       (into [])))

(defn create-umbel-super-classes-fst []
  (let [ref-concepts (->> (get-umbel-reference-concepts)
                          (map (fn [ref-concept]
                                 [(string/replace ref-concept "http://umbel.org/umbel/rc/" "")]))
                          (apply concat)
                          (into [])
                          distinct
                          sort)
        builder (fst/create-builder! :type :char :pack true)]
    (doseq [ref-concept ref-concepts]
      (println ref-concept)
      (let [resultset (http/get (str "http://umbel.org/ws/super-classes/" ref-concept)
                                {:accept "application/clojure"
                                 :throw-exceptions false})]
        (when (= (get resultset :status) 200)
          (doseq [super-class (->> resultset
                                   :body
                                   read-string
                                   (into []))]
            (fst/add! builder {(str "http://umbel.org/umbel/rc/" ref-concept) super-class})))))
    (let [fst (fst/create-fst! builder)]
      (fst/save! "resources/umbel-super-classes.fst" fst))))

After running the (create-umbel-super-classes-fst) function, a umbel-super-classes.fst file will be created in the resources/ folder of your project. This process should take about 5 to 10 minutes to complete. All the latency comes from the fact that you have to issue a web service query for every concept. From the standpoint of the FST, you could populate one with millions of inputs within a few seconds.

Eventually you will be able to reload that index in any context:

(def umbel-super-classes (fst/load! "resources/umbel-super-classes.fst"))

 

Conclusion

As you can see, an FST is a really interesting structure that lets you query really huge arrays in an effective way. The goal of this new Clojure library is to make its usage as simple as possible. It is intended to be used by any developer that has to query very large sets of data with a computational- and memory-effective way.

Open Semantic Framework 3.3 Released

Structured Dynamics is happy to announce the immediate availability of the Open Semantic Framework version 3.3. This new release of OSF lets system administrators choose between two different communication channels to send SPARQL queries to the triple store: triple_120
  1. HTTP
  2. ODBC

In OSF 3.1, the only communication channel available was a ODBC channel using the iODBC drivers. In OSF 3.2, the only communication channel available was a HTTP channel. What we did with OSF 3.3 is to let the system administrator choose between the two.

Quick Introduction to the Open Semantic Framework

What is the Open Semantic Framework?

The Open Semantic Framework (OSF) is an integrated software stack using semantic technologies for knowledge management. It has a layered architecture that combines existing open source software with additional open source components. OSF is designed as an integrated content platform accessible via the Web, which provides needed knowledge management capabilities to enterprises. OSF is made available under the Apache 2 license.

OSF can integrate and manage all types of content â€" unstructured documents, semi-structured files, spreadsheets, and structured databases â€" using a variety of best-of-breed data indexing and management engines. All external content is converted to the canonical RDF data model, enabling common tools and methods for tagging and managing all content. Ontologies provide the schema and common vocabularies for integrating across diverse datasets. These capabilities can be layered over existing information assets for unprecedented levels of integration and connectivity. All information within OSF may be powerfully searched and faceted, with results datasets available for export in a variety of formats and as linked data.

Why Multiple Channels in OSF?

Historically, OSF only used the ODBC channel to communicate with Virtuoso, and it was using the iODBC drivers. As explained in a previous blog post, the fact that we were using the iODBC drivers in Ubuntu was adding a lot of complexity into the system since we had to recompile most of the PHP packages to use that other ODBC driver.

With OSF 3.2, we refactored the code such that we could query any SPARQL HTTP endpoint. The goal of this current improvement is to be able to use any triple store that has a compatible SPARQL HTTP endpoint with OSF, and not just Virtuoso.

With OSF 3.3, what we choose to do is to make both options a possibility. However, what we did is to make sure that the latest version of Virtuoso was now properly working with the unixODBC drivers, which are shipped by default with Ubuntu.

This means that people can now use the ODBC channel, but using the unixODBC drivers instead. The end result of this enhancement is that it makes the maintenance of a Ubuntu/OSF instance much easier since no packages are on hold, and that the PHP5 packages can be updated at any time without needing to be recompiled using the iODBC drivers.

Deploying a New OSF 3.3 Server

Using the OSF Installer

OSF 3.3 can easily be deployed on a Ubuntu 14.04 LTS server using the osf-installer application. The deployment is done by executing the following commands in your terminal:

mkdir -p /usr/share/osf-installer/

cd /usr/share/osf-installer/

wget https://raw.github.com/structureddynamics/Open-Semantic-Framework-Installer/3.3/install.sh

chmod 755 install.sh

./install.sh

./osf-installer --install-osf -v

Using an Amazon AMI

If you are an Amazon AWS user, you also have access to a free AMI that you can use to create your own OSF instance. The full documentation for using the OSF AMI is available here.

Upgrading Existing Installations

It is not possible to automatically upgrade previous versions of OSF to OSF 3.3. It is possible to upgrade an older instance of OSF to OSF version 3.3, but only manually. If you have this requirement, just let me know and I will write about the upgrade steps that are required to upgrade these instances to OSF version 3.3.

Conclusion

This new version of the Open Semantic Framework should be even simpler to install, deploy and maintain. Several additional small updates have also provided in this new version to other aspects of installation simpler and faster.

Open Semantic Framework 3.2 Released

Structured Dynamics is happy to announce the immediate availability of the Open Semantic Framework version 3.2. This is the second important OSF release in a month and a half. triple_120

This new major release of OSF changes the way the web services communicate with the triple store. Originally, OSF web services were using a ODBC channel to communicate with the triple store (Virtuoso). This new release uses the SPARQL HTTP endpoints of the triple store to send queries to it. This is the only changes that occurs in this new version, but as you will see bellow, this is a major one.

Why switching to HTTP?

The problem with using ODBC as the primary communication channel between the OSF web services and the triple store is that it was adding a lot of complexity into OSF. Because the UnixODBC drivers that are shipped with Ubuntu had issues with Virtuoso, we had to use the iODBC drivers to make sure that everything was working properly. This situation forced us to recompile PHP5 such that it uses iODBC instead of UnixODBC as the ODBC drivers for PHP5.

This was greatly complexifying the deployment of OSF since we couldn’t use the default PHP5 packages that shipped with Ubuntu, but had to maintain our own ones that were working with iODBC.

The side effect of this is that system administrators couldn’t upgrade their Ubuntu instances normally since PHP5 needed to be upgraded using particular packages created for that purpose.

Now that OSF doesn’t use ODBC to communicate with the triple store, all this complexity goes away since no special handling is now required. All of the default Ubuntu packages can be used like system administrators normally do.

With this new version, the installation and deployment of a OSF instance has been greatly simplified.

Supports New Triple Stores

Another problem with using ODBC is that it was limiting the number of different triple stores that could be used for operating OSF. In fact, people could only use Virtuoso with their OSF instance.

This new release opens new opportunities. OSF still ships with Virtuoso Open Source as its default triple store, however any triple store that has the following characteristics could replace Virtuoso in OSF:

  1. It has a SPARQL HTTP endpoint
  2. It supports SPARQL 1.1 and SPARQL Update 1.1
  3. It supports SPARQL Update queries that can be sent to the SPARQL HTTP endpoint
  4. It supports the SPARQL 1.1 Query Results JSON Format
  5. It supports the SPARQL 1.1 Graph Store HTTP Protocol via a HTTP endpoint (optional, only required by the Datasets Management Tool)

Deploying a new OSF 3.2 Server

Using the OSF Installer

OSF 3.2 can easily be deployed on a Ubuntu 14.04 LTS server using the osf-installer application. It can easily be done by executing the following commands in your terminal:

mkdir -p /usr/share/osf-installer/

cd /usr/share/osf-installer/

wget https://raw.github.com/structureddynamics/Open-Semantic-Framework-Installer/3.2/install.sh

chmod 755 install.sh

./install.sh

./osf-installer --install-osf -v

Using a Amazon AMI

If you are an Amazon AWS user, you also have access to a free AMI that you can use to create your own OSF instance. The full documentation for using the OSF AMI is available here.

Upgrading Existing Installations

It is not possible to automatically upgrade previous versions of OSF to OSF 3.2. It is possible to upgrade a older instance of OSF to OSF version 3.2, but only manually. If you have this requirement, just let me know and I will write about the upgrade steps that are required to upgrade these instances to OSF version 3.2.

Security

Now that the triple store’s SPARQL HTTP endpoint requires it to be enabled with SPARQL Update rights, it is more important than ever to make sure that the SPARQL HTTP endpoint of the triple store is only available to the OSF web services.

This can be done by properly configuring your firewall or proxy such that only local traffic, or traffic coming from the OSF web service processes, can reach the endpoint.

The SPARQL endpoint that should be exposed to the outside World is OSF’s SPARQL endpoint, which adds an authentication layer above the triple store’s endpoint, and restricts potentially armful SPARQL queries.

Conclusion

This new version of the Open Semantic Framework greatly simplifies its deployment and its maintenance. It also enables other triple stores that exist on the market to be used for OSF instead of Virtuoso Open Source.




This blog is a regularly updated collection of my thoughts, tips, tricks and ideas about data mining, data integration, data publishing, the semantic Web, my researches and other related software development.


RSS Twitter LinkedIN


Follow

Get every new post on this blog delivered to your Inbox.

Join 87 other followers:

Or subscribe to the RSS feed by clicking on the counter:




RSS Twitter LinkedIN