clj-fst is a Clojure wrapper around the Lucene FST API. Finite state transducers are finite state machines with two tapes: an input and an output tape. The automaton maps an input string to an output. The output can be a vector of strings or a vector of integers. There are more profound mathematical implications to FSTs, but those are the basics for now.
Why Use FSTs?
Considering that basic definition of a FST, one could legitimately wonder why he should care about FSTs. FSTs could be seen as simple Clojure maps, so why bother with FSTs?
Everything is a matter of scale. Using a map, or such generic structures, for efficiently handling millions or billions of values is far from effective, if even possible.
That is why we need some specialized structures like FSTs: to be able to create such huge associative structures that are lightning fast to query and that use a minimum of memory.
There are two general use cases for using FSTs:
- When you want to know if an instance
Aexists in a really huge set
X(where the set
Xis the FST)
- When you want to get a list of outputs from a given input from a really huge set.
There are multiple FSTs implementations out there, however I choose to go with Lucene’s implementation development by Micheal McCandless. The main reason for using the Lucene FST API is because of their implementation of the FST. It implements the work of Stoyan Mihov and Denis Maurel1 to create a minimal unweighted FST from pre-sorted inputs. The implementation results in lightning fast querying of the structure with a really efficient use of memory. Considering the size of the structures we manipulate at Structured Dynamics, these were the two main characteristics to look for and the reason why we choose that implementation.
However, there are two things to keep in mind when working with FSTs:
- The FSTs are static. This means that you cannot add to them once they are created. You have to re-create them from the beginning if you want to change their content.
- The entries have to be pre-sorted. If your entries are not sorted when you create the FST ,then unexpected results will happen.
clj-fst project is nothing more than a wrapper around the Lucene FST API. However, one of the goals of this project is to make this specific Lucene function outstanding and to liberalize its usage in Clojure.
If you take the time to analyze the
clj-fst wrapper, and the Lucene API code, you will notice that not all the of functionality of the API is wrapped. The thing is that the API is somewhat complex and doesn’t have much documentation. What
clj-fst tries to do is to simplify the usage of the API and to create more documentation and code usage examples around it. Finally, it tries create an abstraction layer over the API to manipulate the FSTs in the Clojure way…
Creating an FST is really simple, it has 3 basic, and one optional, steps:
- Create the FST builder
- Populate the FST using the builder
- Create the actual FST from the builder
- Optionally, save the FST on the file system to reload it later in memory.
The simplest code looks like:
Once the FST is saved on the file system, you can easily reload it later:
You can easily get the output related to an input:
You can iterate the content of FST:
Finally you have other ways to query the FST using the enumerator:
More Complex Example
Let’s take a look at a more complex example. What we will be doing here is to create a FST that will be used as a high performance inference index for UMBEL reference concepts (classes). What we are doing is to query the UMBEL super classes web service endpoint to populate the super-types index.
The process is:
- Get the number of concepts in the UMBEL structure
- Get the list of all the UMBEL concepts using the UMBEL search endpoint
- Sort the list of UMBEL concepts URIs
- Get the super-classes, by inference, for each of the concepts
- Populate the FST with the concepts as input and its super-classes as output
- Save the FST on the file system.
To simplify the example, I simply list all of the UMBEL reference concepts in a CSV file. However, you could have created that list using the UMBEL search web service endpoint.
The function that creates the UMBEL reference concepts super-classes index is:
After running the
(create-umbel-super-classes-fst) function, a
umbel-super-classes.fst file will be created in the
resources/ folder of your project. This process should take about 5 to 10 minutes to complete. All the latency comes from the fact that you have to issue a web service query for every concept. From the standpoint of the FST, you could populate one with millions of inputs within a few seconds.
Eventually you will be able to reload that index in any context:
As you can see, an FST is a really interesting structure that lets you query really huge arrays in an effective way. The goal of this new Clojure library is to make its usage as simple as possible. It is intended to be used by any developer that has to query very large sets of data with a computational- and memory-effective way.