clj-fst
is a Clojure wrapper around the Lucene FST API. Finite state transducers are finite state machines with two tapes: an input and an output tape. The automaton maps an input string to an output. The output can be a vector of strings or a vector of integers. There are more profound mathematical implications to FSTs, but those are the basics for now.
Why Use FSTs?
Considering that basic definition of a FST, one could legitimately wonder why he should care about FSTs. FSTs could be seen as simple Clojure maps, so why bother with FSTs?
Everything is a matter of scale. Using a map, or such generic structures, for efficiently handling millions or billions of values is far from effective, if even possible.
That is why we need some specialized structures like FSTs: to be able to create such huge associative structures that are lightning fast to query and that use a minimum of memory.
There are two general use cases for using FSTs:
- When you want to know if an instance
A
exists in a really huge setX
(where the setX
is the FST) - When you want to get a list of outputs from a given input from a really huge set.
Lucene FSTs
There are multiple FSTs implementations out there, however I choose to go with Lucene’s implementation development by Micheal McCandless. The main reason for using the Lucene FST API is because of their implementation of the FST. It implements the work of Stoyan Mihov and Denis Maurel1 to create a minimal unweighted FST from pre-sorted inputs. The implementation results in lightning fast querying of the structure with a really efficient use of memory. Considering the size of the structures we manipulate at Structured Dynamics, these were the two main characteristics to look for and the reason why we choose that implementation.
Limitations
However, there are two things to keep in mind when working with FSTs:
- The FSTs are static. This means that you cannot add to them once they are created. You have to re-create them from the beginning if you want to change their content.
- The entries have to be pre-sorted. If your entries are not sorted when you create the FST ,then unexpected results will happen.
clj-fst
The clj-fst
project is nothing more than a wrapper around the Lucene FST API. However, one of the goals of this project is to make this specific Lucene function outstanding and to liberalize its usage in Clojure.
If you take the time to analyze the clj-fst
wrapper, and the Lucene API code, you will notice that not all the of functionality of the API is wrapped. The thing is that the API is somewhat complex and doesn’t have much documentation. What clj-fst
tries to do is to simplify the usage of the API and to create more documentation and code usage examples around it. Finally, it tries create an abstraction layer over the API to manipulate the FSTs in the Clojure way…
Basic Usage
Creating an FST is really simple, it has 3 basic, and one optional, steps:
- Create the FST builder
- Populate the FST using the builder
- Create the actual FST from the builder
- Optionally, save the FST on the file system to reload it later in memory.
Note that the complete clj-fst
documentation is available here.
The simplest code looks like:
[cc lang=”lisp”]
[raw]
;; The first thing to do is to create the Builder
(def builder (create-builder! :type :int))
;; This small sorted-map defines the things
;; to add to the FST
(def values (into (sorted-map) {“cat” 1
“dog” 2
“mice” 3}))
;; Populate the FST using that sorted-map
(doseq [[input output] values]
(add! builder {input output}))
;; Creating a new FST
(def fst (create-fst! builder))
;; Save a FST on the file system
(save! “resources/fst.srz” fst)
[/raw]
[/cc]
Once the FST is saved on the file system, you can easily reload it later:
[cc lang=”lisp”]
[raw]
;; Load a FST from the file system
(load! “resources/fst.srz)
[/raw]
[/cc]
You can easily get the output related to an input:
[cc lang=”lisp”]
[raw]
;; Query the FST
(get-output “cat” fst)
[/raw]
[/cc]
You can iterate the content of FST:
[cc lang=”lisp”]
[raw]
;; Create the FST enumeration
(def enum (create-enum! fst))
;; Get the first item in the FST
(next! enum)
;; Get the current FST item pointed by the enumerator
(current! enum)
[/raw]
[/cc]
Finally you have other ways to query the FST using the enumerator:
[cc lang=”lisp”]
[raw]
;; Search for different input terms
(get-ceil-term! “cat” enum)
(get-floor-term! “cat” enum)
(get-exact-term! “cat” enum)
[/raw]
[/cc]
More Complex Example
Let’s take a look at a more complex example. What we will be doing here is to create a FST that will be used as a high performance inference index for UMBEL reference concepts (classes). What we are doing is to query the UMBEL super classes web service endpoint to populate the super-types index.
The process is:
- Get the number of concepts in the UMBEL structure
- Get the list of all the UMBEL concepts using the UMBEL search endpoint
- Sort the list of UMBEL concepts URIs
- Get the super-classes, by inference, for each of the concepts
- Populate the FST with the concepts as input and its super-classes as output
- Save the FST on the file system.
To simplify the example, I simply list all of the UMBEL reference concepts in a CSV file. However, you could have created that list using the UMBEL search web service endpoint.
The function that creates the UMBEL reference concepts super-classes index is:
[cc lang=”lisp”]
[raw]
(ns foo.core
(:require [clojure.string :as string]
[clj-http.client :as http]
[clojure.data.csv :as csv]
[clojure.java.io :as io]
[clj-fst.core :as fst]))
(defn get-umbel-reference-concepts []
(->> (with-open [in-file (io/reader “https://fgiasson.com/blog/wp-content/uploads/2015/04/umbel-reference-concepts.csv”)]
(doall
(csv/read-csv in-file)))
flatten
(into [])))
(defn create-umbel-super-classes-fst []
(let [ref-concepts (->> (get-umbel-reference-concepts)
(map (fn [ref-concept]
[(string/replace ref-concept “http://umbel.org/umbel/rc/” “”)]))
(apply concat)
(into [])
distinct
sort)
builder (fst/create-builder! :type :char :pack true)]
(doseq [ref-concept ref-concepts]
(println ref-concept)
(let [resultset (http/get (str “http://umbel.org/ws/super-classes/” ref-concept)
{:accept “application/clojure”
:throw-exceptions false})]
(when (= (get resultset :status) 200)
(doseq [super-class (->> resultset
:body
read-string
(into []))]
(fst/add! builder {(str “http://umbel.org/umbel/rc/” ref-concept) super-class})))))
(let [fst (fst/create-fst! builder)]
(fst/save! “resources/umbel-super-classes.fst” fst))))
[/raw]
[/cc]
After running the (create-umbel-super-classes-fst)
function, a umbel-super-classes.fst
file will be created in the resources/
folder of your project. This process should take about 5 to 10 minutes to complete. All the latency comes from the fact that you have to issue a web service query for every concept. From the standpoint of the FST, you could populate one with millions of inputs within a few seconds.
Eventually you will be able to reload that index in any context:
[cc lang=”lisp”]
[raw]
(def umbel-super-classes (fst/load! “resources/umbel-super-classes.fst”))
[/raw]
[/cc]
Conclusion
As you can see, an FST is a really interesting structure that lets you query really huge arrays in an effective way. The goal of this new Clojure library is to make its usage as simple as possible. It is intended to be used by any developer that has to query very large sets of data with a computational- and memory-effective way.