Clojure, Cognonto, Semantic Web

Using Cognonto to Generate Domain Specific word2vec Models

word2vec is a two layer artificial neural network used to process text to learn relationships between words within a text corpus to create a model of all the relationships between the words of that corpus. The text corpus that a word2vec process uses to learn the relationships between words is called the training corpus.

In this article I will show you how Cognonto‘s knowledge base can be used to automatically create highly accurate domain specific training corpuses that can be used by word2vec to generate word relationship models. However you have to understand that what is being discussed here is not only applicable to word2vec, but to any method that uses corpuses of text for training. For example, in another article, I will show how this can be done with another algorithm called ESA (Explicit Semantic Analysis).

It is said about word2vec that “given enough data, usage and contexts, word2vec can make highly accurate guesses about a word’s meaning based on past appearances.” What I will show in this article is how to determine the context and we will see how this impacts the results.

Training Corpus

A training corpus is really just a set of text used to train unsupervised machine learning algorithms. Any kind of text can be used by word2vec. The only thing it does is to learn the relationships between the words that exist in the text. However, not all training corpuses are equal. Training corpuses are often dirty, biaised and ambiguous. Depending on the task at hand, it may be exactly what is required, but more often than not, their errors need to be fixed. Cognonto has the advantage of starting with clean text.

When we want to create a new training corpus, the first step is to find a source of text that could work to create that corpus. The second step is to select the text we want to add to it. The third step is to pre-process that corpus of text to perform different operations on the text, such as: removing HTML elements; removing punctuation; normalizing text; detecting named entities; etc. The final step is to train word2vec to generate the model.

word2vec is somewhat dumb. It only learns what exists in the training corpus. It does not do anything other than “reading” the text and analyzing the relationships between the words (which are really just group of characters separated by spaces). The word2vec process is highly subject to the Garbage In, Garbage Out principle, which means that if the training set is dirty, biaised and ambiguous, then the learned relationship will end-up being of little or no value.

Domain-specific Training Corpus

A domain-specific training corpus is a specialized training corpus where its text is related to a specific domain. Examples of domains are music, mathematics, cars, healthcare, etc. In contrast, a general training corpus is a corpus of text that may contain text that discusses totally different domains. By creating a corpus of text that covers a specific domain of interest, we limit the usage of words (that is, their co-occurrences) to texts that are meaningful to that domain.

As we will see in this article, a domain-specific training corpus can be quite useful, and much more powerful, than general ones, if the task at hand is in relation to a specific domain of expertise. The major problem with domain-specific training corpuses is that they are really costly to create. We not only have to find the source of data to use, but we also have to select each document that we want to include in the training corpus. This can work if we want a corpus with 100 or 200 documents, but what if you want a training corpus of 100,000 or 200,000 documents? Then it becomes a problem.

It is the kind of problem that Cognonto helps to resolve. Cognonto and KBpedia, its knowledge base, is a set of ~39,000 reference concepts that have ~138,000 links to schema of external data sources such as Wikipedia, Wikidata and USPTO. It is that structure and these links to external data sources that we use to create domain-specific training corpuses on the fly. We leverage the reference concept structure to select all of the concepts that should be part of the domain that is being defined. Then we use Cognonto’s inference capabilities to infer all the other hundred or thousands of concepts that define the full scope of the domain. Then we analyze the hundreds or thousands of concepts we selected that way to get all of the links to external data sources. Finally we use these references to create the training corpus. All of this is done automatically once the initial few concepts that define my domain got selected. The workflow looks like:

cognonto-workflow

The Process

To show you how this process works, I will create a domain-specific training set about musicians using Cognonto. Then I will use the Google News word2vec model created by Google and that has about 100 billion words. The Google model contains 300-dimensional vectors for 3 million words and phrases. I will use the Google News model as the general model to compare the results/performance between a domain specific and a general model.

Determining the Domain

The first step is to define the scope of the domain we want to create. For this article, I want a domain that is somewhat constrained to create a training corpus that is not too large for demo purposes. The domain I have chosen is musicians. This domain is related to people and bands that play music. It is also related to musical genres, instruments, music industry, etc.

To create my domain, I select a single KBpedia reference concept: Musician. If I wanted to broaden the scope of the domain, I could have included other concepts such as: Music, Musical Group, Musical Instrument, etc.

Aggregating the Domain-specific Training Corpus

Once we have determined the scope of the domain, the next step is to query the KBpedia knowledge base to aggregate all of the text that will belong to that training corpus. The end result of this operation is to create a training corpus with text that is only related to the scope of the domain we defined.

(defn create-domain-specific-training-set
  [target-kbpedia-class corpus-file]
  (let [step 1000
        entities-dataset "http://kbpedia.org/knowledge-base/"
        kbpedia-dataset "http://kbpedia.org/kko/"
        nb-entities (get-nb-entities-for-class-ws target-kbpedia-class entities-dataset kbpedia-dataset)]
    (loop [nb 0
           nb-processed 1]
      (when (< nb nb-entities)
        (doseq [entity (get-entities-slice target-kbpedia-class entities-dataset kbpedia-dataset :limit step :offset @nb-processed)]          
          (spit corpus-file (str (get-entity-content entity) "\n") :append true)
          (println (str nb-processed "/" nb-entities)))
        (recur (+ nb step)
               (inc nb-processed))))))

(create-domain-specific-training-set "http://kbpedia.org/kko/rc/Musician" "resources/musicians-corpus.txt")

What this code does is to query the KBpedia knowledge base to get all the named entities that are linked to it, for the scope of the domain we defined. Then the text related to each entity is appended to a text file where each line is the text of a single entity.

Given the scope of the current demo, the musicians training corpus is composed of 47,263 documents. This is the crux of the demo. With a simple function, we are able to aggregate 47,263 text documents highly related to a conceptual domain we defined on the fly. All of the hard work has been delegated to the knowledge base and its conceptual structure (in fact, this simple function leverages 8 years of hard work).

Normalizing Text

The next step is a natural step related to any NLP pipeline. Before learning from the training corpus, we should clean and normalize the text of its raw form.

(defn normalize-proper-name
  [name]
  (-> name
      (string/replace #" " "_")      
      (string/lower-case)))

(defn pre-process-line
  [line]  
  (-> (let [line (-> line
                     ;; 1. remove all underscores
                     (string/replace "_" " "))]
        ;; 2. detect named entities and change them with their underscore form, like: Fred Giasson -> fred_giasson
        (loop [entities (into [] (re-seq #"[\p{Lu}]([\p{Ll}]+|\.)(?:\s+[\p{Lu}]([\p{Ll}]+|\.))*(?:\s+[\p{Ll}][\p{Ll}\-]{1,3}){0,1}\s+[\p{Lu}]([\p{Ll}]+|\.)" line))
               line line]
          (if (empty? entities)
            line
            (let [entity (first (first entities))]
              (recur (rest entities)                     
                     (string/replace line entity (normalize-proper-name entity)))))))
      (string/replace (re-pattern stop-list) " ")
      ;; 4. remove everything between brackets like: [1] [edit] [show]
      (string/replace #"\[.*\]" " ")
      ;; 5. punctuation characters except the dot and the single quote, replace by nothing: (),[]-={}/\~!?%$@&*+:;<>
      (string/replace #"[\^\(\)\,\[\]\=\{\}\/\\\~\!\?\%\$\@\&\*\+:\;\<\>\"\p{Pd}]" " ")
      ;; 6. remove all numbers
      (string/replace #"[0-9]" " ")
      ;; 7. remove all words with 2 characters or less
      (string/replace #"\b[\p{L}]{0,2}\b" " ")
      ;; 10. normalize spaces
      (string/replace #"\s{2,}" " ")
      ;; 11. normalize dots with spaces
      (string/replace #"\s\." ".")
      ;; 12. normalize dots
      (string/replace #"\.{1,}" ".")
      ;; 13. normalize underscores
      (string/replace #"\_{1,}" "_")
      ;; 14. remove standalone single quotes
      (string/replace " ' " " ")
      ;; 15. re-normalize spaces
      (string/replace #"\s{2,}" " ")        
      ;; 16. put everything lowercase
      (string/lower-case)

      (str "\n")))

(defn pre-process-corpus
  [in-file out-file]
  (spit out-file "" :append true)
  (with-open [file (clojure.java.io/reader in-file)]
    (doseq [line (line-seq file)]
      (spit out-file (pre-process-line line) :append true))))

(pre-process-corpus "resources/musicians-corpus.txt" "resources/musicians-corpus.clean.txt")

We remove all of the characters that may cause issues to the tokenizer used by the word2vec implementation. We also remove unnecessary words and other words that appear too often or that add nothing to the model we want to generate (like the listing of days and months). We also drop all numbers.

Training word2vec

The last step is to train word2vec on our clean domain-specific training corpus to generate the model we will use. For this demo, I will use the DL4J (Deep Learning for Java) library that is a Java implementation of the word2vec algorithm. Training word2vec is as simple as using the DL4J API like this:

(defn train
  [training-set-file model-file]
  (let [sentence-iterator (new LineSentenceIterator (clojure.java.io/file training-set-file))
        tokenizer (new DefaultTokenizerFactory)
        vec (.. (new Word2Vec$Builder)
                (minWordFrequency 1)
                (windowSize 5)
                (layerSize 100)
                (iterate sentence-iterator)
                (tokenizerFactory tokenizer)
                build)]
    (.fit vec)
    (SerializationUtils/saveObject vec (io/file model-file))
    vec))

(def musicians-model (train "resources/musicians-corpus.clean.txt" "resources/musicians-corpus.model"))

What is important to notice here is the number of parameters that can be defined to train word2vec on a corpus. In fact, that algorithm can be sensitive to parametrization.

Importing the General Model

The goal of this demo is to demonstrate the difference between a domain-specific model and a general model. Remember that the general model we chose was the Google News model that is composed of billion of words, but which is highly general. DL4J can import that model without having to generate it ourselves (in fact, only the model is distributed by Google, not the training corpus):

(defn import-google-news-model
  []
  (org.deeplearning4j.models.embeddings.loader.WordVectorSerializer/loadGoogleModel (clojure.java.io/file "GoogleNews-vectors-negative300.bin.gz") true))

(def google-news-model (import-google-news-model))

Playing With Models

Now that we have a domain-specific model related to musicians and a general model related to news processed by Google, let’s start playing with both to see how they perform on different tasks. In the following examples, we will always compare the domain-specific training corpus with the general one.

Ambiguous Words

A characteristic of words is that their surface form can be ambiguous; they can have multiple meanings. An ambiguous word can co-occur with multiple other words that may not have any shared meaning. But all of this depends on the context. If we are in a general context, then this situation will happen more often than we think and will impact the similarity score of these ambiguous terms. However, as we will see, this phenomenum is greatly diminished when we use domain-specific models.

Similarity Between Piano, Organ and Violin

What we want to check is the relationship between 3 different music instruments: piano, organ and violin. We want to check the relationship between each of them.

(.similarity musicians-model "piano" "violin")
0.8422856330871582
(.similarity musicians-model "piano" "organ")
0.8573281764984131

As we can see, both tuples have a high likelihood of co-occurrence. This suggests that these terms of each tuple are probably highly related. In this case, it is probably because violins are often played along with a piano. And, it is probably that an organ looks like a piano (at least it has a keyboard).

Now let’s take a look at what the general model has to say about that:

(.similarity google-news-model "piano" "violin")
0.8228187561035156
(.similarity google-news-model "piano" "organ")
0.46168726682662964

The surprising fact here is the apparent dissimilarity between piano and organ compared with the results we got with the musicians domain-specific model. If we think a bit about this use case, we will probably conclude that these results makes sense. In fact, organ is an ambiguous word in a general context. An organ can be a musical instrument, but it can also be a part of an anatomy. This means that the word organ will co-occur beside piano, but also all kind of other words related to human and animal biology. This is why they are less similar in the general model than in the domain one, because it is an ambiguous word in a general context.

Similarity Between Album and Track

Now let’s see another similarity example between two other words album and track where track is an ambiguous word depending on the context.

(.similarity musicians-model "album" "track")
0.7943775653839111
(.similarity google-news-model "album" "track")
0.18461623787879944

As expected, because track is ambiguous, there is a big difference in terms of co-occurence probabilities depending on the context (domain-specific or general).

Similarity Between Pianist and Violinist

However, are domain-specific and general differences always the case? Let’s take a look at two words that are domain specific and unambiguous: pianist and violinist.

(.similarity musicians-model "pianist" "violinist")
0.8430571556091309
(.similarity google-news-model "pianist" "violinist")
0.8616064190864563

In this case, the similarity score between the two terms is almost the same. In both contexts (generals and domain specific), their co-occurrence is similar.

Nearest Words

Now let’s look at the similarity between two distinct words in two new and distinct contexts. Let’s take a look at a few words and see what other words occur most often with them.

Music

(.wordsNearest musicians-model ["music"] [] 7)
music revol samoilovich bunin musical amalgamating assam. voice dance.
(.wordsNearest google-news-model ["music"] [] 8)
music classical music jazz Music Without Donny Kirshner songs musicians tunes

One observation we can make is that the terms from the musicians model are more general than the ones from the general model.

Track

(.wordsNearest musicians-model ["track"] [] 10)
track released. album latest entitled released debut year. titled positive
(.wordsNearest google-news-model ["track"] [] 5)
track tracks Track racetrack horseshoe shaped section

As we know, track is ambiguous. The difference between these two sets of nearest related words is striking. There is a clear conceptual correlation in the musicians’ domain-specific model. But in the general model, it is really going in all directions.

Year

Now let’s take a look at a really general word: year

(.wordsNearest musicians-model ["year"] [] 11)
year ghantous. he was grammy naacap grammy award for best luces del alma year. grammy award grammy for best sitorai sol nominated
(.wordsNearest google-news-model ["year"] [] 10)
year month week months decade years summer year.The September weeks

This one is quite interesting too. Both groups of words makes sense, but only in their respective contexts. With the musicians’ model, year is mostly related to awards (like the Grammy Awards 2016), categories like “song of the year”, etc.

In the context of the general model, year is really related to time concepts: months, seasons, etc.

Playing With Co-Occurrences Vectors

Finally we will play with manipulating the co-occurrences vectors by manipulating them. A really popular word2vec equation is king - man + women = queen. What is happening under the hood with this equation is that we are adding and substracting the co-occurences vectors for each of these words, and we check the nearest word of the resulting co-occurence vector.

Now, let’s take a look at a few of these equations.

Pianist + Renowned = ?

(.wordsNearest musicians-model ["pianist" "renowned"] [] 9)
pianist renowned teacher. composer. prolific virtuoso teacher leading educator.
(.wordsNearest google-news-model ["pianist" "renowned"] [] 7)
renowned pianist pianist composer jazz pianist classical pianists composer pianist virtuoso pianist

These kind of operations are kind of interesting. If we add the two co-occurrence vectors for pianist and renowned then we get that a teacher, an educator, a composer or a virtuoso is a renowned pianist.

For unambiguous surface forms like pianist, then the two models score quite well. The difference between the two examples comes from the way the general training corpus has been created (pre-processed) compared to the musicians corpus.

Metal + Death = ?

(.wordsNearest musicians-model ["metal" "death"] [] 10)
metal death thrash deathcore melodic doom grindcore metalcore mathcore heavy
(.wordsNearest google-news-model ["metal" "death"] [] 5)
death metal Tunstallbled steel Death

This example uses two quite general words with no apparent relationship between them. The results with the musicians’ model are all the highly similar genre of music like trash metal, deathcore metal, etc.

However with the general model, it is a mix of multiple unrelated concepts.

Metal – Death + Smooth = ?

Let’s play some more with these equations. What if we want some kind of smooth metal?

(.wordsNearest musicians-model ["metal" "smooth"] ["death"] 5)
smooth fusion funk hard neo

This one is quite interesting. We substracted the death co-occurrence vector to the metal one, and then we added the smooth vector. What we end-up with is a bunch of music genres that are much smoother than death metal.

(.wordsNearest google-news-model ["metal" "smooth"] ["death"] 5)
smooth metal Brushed aluminum durable polycarbonate chromed steel

In the case of the general model, we end-up with “smooth metal”. The removal of the death vector has no effect on the results, probably since these are three ambiguous and really general terms.

What Is Next

The demo I presented in this article uses public datasets currently linked to KBpedia. You may wonder what are the other possibilities? Another possibility is to link your own private datasets to KBpedia. That way, these private datasets would then become usable, exactly in the same way, to create domain-specific training corpuses on the fly. Another possibility would be to take totally unstructured text like local text documents, or semi-structured text like a set of HTML web pages. Then, tag them using the Cognonto topics analyzer to tag each of the text document using KBpedia reference concepts. Then we could use the KBpedia structure exactly the same way to choose which of these documents we want to include in the domain-specific training corpus.

Conclusion

As we saw, creating domain-specific training corpuses to use with word2vec can have a dramatic impact on the results and how results can be much more meaningful within the scope of that domain. Another advantage of the domain-specific training corpuses is that they create much smaller models. This is quite an interesting characteristic since smaller models means they are faster to generate, faster to download/upload, faster to query, consumes less memory, etc.

Of the concepts in KBpedia, roughly 33,000 of them correspond to types (or classes) of various sorts. These pre-determined slices are available across all needs and domains to generate such domain-specific corpuses. Further, KBpedia is designed for rapid incorporation of your own domain information to add further to this discriminatory power.

Clojure, Literate Programming, Programming

Literate [Clojure] Programming: Anatomy of a Org-mode file

This blog post is the second of a series of blog posts about Literate [Clojure] Programming where I explain how I develop my [Clojure] applications using literate programming concepts and principles. In the previous blog post I outlined a project’s structure. In this blog post I will demonstrate how I normally structure an Org-mode file to discuss the problem I am trying to solve, to code it and to test it.

One of the benefits of Literate Programming is that the tools that implement its concepts (in this case Org-mode) give to the developer the possibility to write its code in the order (normally more human friendly) he wants. This is one of the aspects I will cover in this article.

If you want to look at a really simple [Clojure] literate application I created for my Creating And Running Unit Tests Directly In Source Files With Org-mode blog post, take a look at the org-mode-clj-tests-utils (for the rendered version). It should give you a good example of what a literate file that follows the structure discussed here looks like.

This series of blog posts about literate programming is composed of the following articles:

  1. Project folder structure
  2. Anatomy of a Org-mode file (this post)
  3. Tangling all project files
  4. Publishing documentation in multiple formats
  5. Unit Testing.

Structure

A literate programming file can have any kind of structure. Depending on the task at hand, it can take the form of a laboratory notebook or a software documentation file. The structure I will explain here is the structure I use to develop normal applications using the Clojure programming language. In other blog posts I will explain other styles, but I will stick to that one for now.

The usual structure of a literate programming file is composed of the following sections:

  1. introduction
  2. main section
    1. sub-section
      1. introduction
      2. code/explanation/…/code/explanation
      3. unit tests
    2. sub-section
  3. complete namespace definition
    1. unit tests

Each of the sub-sections has the same outline, but multiple levels of sub-sections can be created depending on the needs. Every code block is uniquely named (identified) and belongs to a section or a subsection. The portion of that outline that lets you write your application in the order more friendly to a compiler is the complete namespace definition section which is where we “reconstruct” the code to be tangled (written in a standard source code file).

Introduction

Every file starts with a title and a description of the problem you are trying to solve and an overview of how you are trying to solve it. If required, subsections can always be added to the introduction to properly describe the problem and the solution to that problem. In any case, no code blocks are defined in the introduction; only text, images, tables of data or anything else that helps define a problem and its solution are included.

Note that the title of the file is defined using the #+TITLE: Org-mode markup.

Main & Sub Sections

The main and sub-sections have the same outline. They only differ in the level of the details. You could have a series of main sections without any sub-sections in them. Or you could have a single main section with multiple levels of sub-sections. This split really depends on how you want to formulate the solution to the problem exposed in the introduction.

A section should define a portion of the solution you are developing to fix the problem. The scope of the section is only defined by the developer and it depends on how things are being solved. A more complex solution may require more refined solutions which would require subsections (or multiple levels of them).

In any case, for each of these sections, I almost always define the following portions:

  1. introduction
  2. code/explanation/…/code/explanation
  3. unit tests

For each section, I try to introduce the portion of the solution with some text, images or data tables. Then I start to code my application by adding the code of my application in code blocks intertwined with some text iteratively until that portion of the overall solution is completed. Then I define a third section where I create some unit tests where I iteratively test the functions I created in the section. The unit tests are also used to document how the API can be used by acting as usage examples.

It is also possible that you may want to define a section in your file, but that you don’t want to weave that section in the resulting documentation. This can easily be done by adding the :noexport: markup at the end of a section title.

Code Blocks

Each code block should be named with a unique name across all org-mode files of your project. A name is defined using the #NAME: markup before a code block starts. The name is quite important since it helps understand the flow of your application. It should be written as a short description of what the code block does.

Code blocks in Org-mode have numerous options. However we will only cover the few key ones here. Note that most of the other options will be defined in the Complete Namespace Definition section of the file. This is where we will reference the name of the code block and where we will order the code to tangle from this literate file.

One of the key options of a code block is the :results option which can have one of the following values: silent, value or output. Depending on what you want to output in the literate document, you can display the value returned by the code block, the output of the code processed in the code block, or you can make it silent. The value or the output of a code block’s execution will appear in an EXAMPLE block underneath the code block.

Another important header option of the code block is :export which is used to tell Org-mode how to weave the code block and its results. It has 4 options: code (default), which only exports the code box, results which only exports the results box, both which export both and none which exports nothing when weaving a document.

As I said, many other header options exist like the possibility to use the result of a code block to assign to a variable that can be passed to another block in your literate file which lets you create workflows within your literate files. However I won’t cover these options here since they are more used in a laboratory notebook style.

Unit Tests Blocks

At the end of each section, I usually define a Unit tests section which is where I define different unit tests for the code defined in a section or one of its sub-sections. These tests are defined in a named code block. They are used to unit test the functions I created in a document and they are also used as API usage examples. Each of the unit tests blocks is aggregated in a test suites in the Unit Tests sub-section of the Complete Namespace Definition section (see below).

These unit tests are executed directly into the Org-mode file while I am developing them. This means that any issues will be caught right away and fixed in the code of that section. Also, if the code is updated in the future, these unit tests will be re-executed right away and any issues will be output directly in the Org-mode file without having to switch to any other testing facilities.

Complete Namespace Definition

A the end of each literate programming file, I do create a Complete Namespace Definition section where I outline how the tangled code will be ordered in the generated source file. Generally we don’t want to export this section into the weaved document, so I define it with the :noexport: markup at the end of the section name.

This section is where I define the header of my source files (usually the namespace declaration, import statements and such), where I order code to be tangled and where we define the code block header parameters related to tangling the code into the source code files.

It is in this section that you will understand why it is important to spend some time properly naming your code blocks in your file: since it is these names that will appear in this section that will make the outline of our code understandable.

There are 3 header parameters that I normally use for that code block:

  1. :tangle ../../../
  2. :mkdirp yes
  3. :noweb yes
  4. :results silent

First we don’t want to output anything in the Org-mode file after executing the code block, so we put :results to silent. Then we want to use the WEB markup in the code block, so we define the parameter :noweb to yes. Then if one of the folders specified by :tangle is not existing, we want Org-mode to create it for us instead of failing. Finally, the :tangle parameter is defined with the path where the tangled document will be written to the file system. The location of the source file will comply with the structure of your application.

The code block of the Unit tests sub section will be tangled in the unit test folder of your project.

Structure Navigation & Conclusion

One of the benefits of writing application using Literate Programming principles is that what we end-up doing is to create a much more human readable outline of an application. We end-up creating sections, subsections, etc. just like when you write an article, or a book, where you create sections and subsections where each of the sections focus on the thing your are writing about. To me, this is a much more natural work style to solve a problem. It is also much easier to share with non-developers who need to understand how your applications behave. To these people, it is like reading a scientific article grounded into a mathematic framework: if you are not a mathematician (and even if you are but are not familiar with the concepts discussed in the article) you will most likely (at least for a first read) read the article, understand its structure and idea, but you will skip the boxes where you have the equations. Here the same minding applies, it is just that non-developers will skip the boxes were there is the code. But in any case, he should be able to understand what you are doing, the problem you are trying to solve, and how you are trying to solve it.

This is why the structure that gets created when developing applications is quite interesting and beneficial. This structure is what I really like with Org-mode, which at its core is nothing other than a plain text outliner. This means that Org-mode has several features to help you manipulate and navigate the outline structure of a text file. Like conventional programming with an IDE where you can extend and collapse blocks of code, with Org-mode you can extend and collapse the outline of the document (created by the sections and subsections of your files).

This is quite powerful since you can focus on a series of functions that solve a particular problem just by extending its section and collapsing all others. You can even display only the content of that section into the Emacs buffer by using C-x n s to focus on a Org-mode region and C-x n w to unfocus that region. This means that even if you have a single file with several thousand of lines it doesn’t really matter since you can see any section of that file like if it was its own tiny file. This may be appealing to developers that don’t like a proliferation of files in their projects (in fact, they could end up with a single master well structured Org-mode file that gets tangled as multiple source code files).

Clojure, Emacs, Literate Programming, Programming

Literate [Clojure] Programming Using Org-mode

Literate Programming is a great way to write computer software, particularly in fields like data science where data processing workflows are complex and often need much background information. I started to write about Literate Programming a few months ago, and now it is the time to formalize how I create Literate Programming applications.

This is the first post of a series of blog posts that will cover the full workflow. I will demonstrate how I do Literate Programming for developing a Clojure application, but exactly the same workflow would work for any other programming language supported by Org-mode (Python, R, etc.). The only thing that is required is to adapt the principles to the project structures in these other languages. The series of blog posts will cover:

  1. Project folder structure (this post)
  2. Anatomy of a Org-mode file
  3. Tangling all project files
  4. Publishing documentation in multiple formats
  5. Unit Testing

Clojure Project Folder Structure

The structure of a programming project can vary a lot. The structure I am using when developing Clojure is the one created by Leiningen which I use for creating and managing my Clojure projects. The structure of a simple project (in this case, the org-mode-clj-tests-utils project that I created for another blog post) looks like this:

- CHANGELOG.md
- LICENCE
- README.md
- resources
- pom.xml
- project.clj
- src
  - org_mode_clj_tests_utils
    - core.clj
- target
- test
  - org_mode_clj_tests_utils
    - core_test.clj

There are 4 main components to this structure:

  1. the project.clj file which is used by Leiningen to configure the project
  2. the src folder where the project’s code files [to be compiled] are located
  3. the target folder is where the compiled files will be available, and
  4. the test folder where the unit tests for the code sources are located

This kind of project outline is really simple and typical. Now let’s see what the structure would look like if this project would be created using Literate [Clojure] Programming.

Literate Clojure Folder Structure

The best way and cleanest way I found to create and manage the Org-mode files is to create a org directory at the same level as the src one. Then to replicate the same folder structure that exists in the src folder. The names of the source files should be the same except that they have the .org file extension. For example, the src/core.clj file would become org/core.org in the Org-mode folder, and the org/core.org file is used to tangle (create) the src/core.clj file.

The new structure would look like that:

- CHANGELOG.md
- LICENCE
- README.md
- resources
- org
  - project.org
  - org_mode_clj_tests_utils
    - core.org
- pom.xml
- project.clj
- src
  - org_mode_clj_tests_utils
    - core.clj
- target
- test
  - org_mode_clj_tests_utils
    - core_test.clj

The idea here is that all the files that needs to be modified related to the project would become a Org-mode file. Such files are the code source files, the test files, possible other documentation files and the project.clj file. When the Org-mode files will be tangled, then all the appropriate files, required by the Clojure project would be generated.

Anything I am writing for this project comes from a Org-mode file. All the development occurs in Org-mode. If someone would want to modify such a Literate Clojure application, then they would have to modify the Org-mode file and not the source files otherwise the changes would be overwritten by the next tangling operation.

Utilities Org-mode Files

Finally, I created a series of Org-mode files that are used to perform special tasks such as:

  1. Tangling all project files at once, and
  2. Publishing documentation in multiple formats

These are Org-mode files that can be executed to perform these tasks. In the case of tangling all project files at once, it would be necessary to use it if you haven’t changed the behavior of your Emacs to automatically tangle files on save.

The second file is to publish weaved documentation in multiple different formats (HTML, LaTeX, etc.) as required, all at once.

These two files are directly located into the /org/ folder. I will explain how they work in a subsequent post in that series. The final structure of a Literate Clojure project is:

- CHANGELOG.md
- LICENCE
- README.md
- resources
- org
  - project.org
  - publish.org
  - tangle-all.org
  - setup.org
  - org_mode_clj_tests_utils
    - core.org
- pom.xml
- project.clj
- src
  - org_mode_clj_tests_utils
    - core.clj
- target
- test
  - org_mode_clj_tests_utils
    - core_test.clj

Conclusion

As you can see, a Literate Clojure application is not much different. The way to program such an application is more profound than the small changes that occur at the level of the folder structure.

There is still an open question related to publishing this kind of Literate work on repositories such as Git: should only the org folder be added to a Git repository, or should we also add the files that get tangled as well? In an ideal World, only the org files would need to go into the repository. However, depending on the nature of work (work only accessible by you, or work accessible by a group of people that know Org-mode, or making the project public on GitHub, etc.) we may have to commit the tangled files too. In the case of an open source project, I think it is required since many people unfamiliar with Org-mode won’t be able to use the codebase because they won’t be able to tangle it from the Org files. For this specific reason, I tend to publish the org files along with all the files that get tangled from them. That way I am sure that even if the users of the library doesn’t know anything about Org-mode or Literate Programming they could still use the code. The only thing I try to take care of is to commit the Org file and the tangled file related to a specific change in the same commit, and I try not to create two commits, one for each file.

The next blog post of that series will explain how the Org-mode source files are actually created, what is their internal structure, how they are organized and used.