{"id":3344,"date":"2016-09-28T15:27:28","date_gmt":"2016-09-28T19:27:28","guid":{"rendered":"http:\/\/fgiasson.com\/blog\/?p=3344"},"modified":"2016-11-17T13:42:14","modified_gmt":"2016-11-17T18:42:14","slug":"using-cognonto-to-generate-domain-specific-word2vec-models","status":"publish","type":"post","link":"https:\/\/fgiasson.com\/blog\/index.php\/2016\/09\/28\/using-cognonto-to-generate-domain-specific-word2vec-models\/","title":{"rendered":"Using Cognonto to Generate Domain Specific word2vec Models"},"content":{"rendered":"<p><a href=\"https:\/\/en.wikipedia.org\/wiki\/Word2vec\">word2vec<\/a> is a two layer <a href=\"https:\/\/en.wikipedia.org\/wiki\/Artificial_neural_network\">artificial neural network<\/a> used to process text to learn relationships between words within a text corpus to create a <a href=\"https:\/\/en.wikipedia.org\/wiki\/Mathematical_model\">model<\/a> of all the relationships between the words of that corpus. The text corpus that a word2vec process uses to learn the relationships between words is called the <a href=\"https:\/\/en.wikipedia.org\/wiki\/Test_set\">training corpus<\/a>.<\/p>\n<p>In this article I will show you how <a href=\"http:\/\/cognonto.com\/\">Cognonto<\/a>&#8216;s knowledge base can be used to automatically create highly accurate domain specific training corpuses that can be used by word2vec to generate word relationship models. However you have to understand that what is being discussed here is not only applicable to word2vec, but to any method that uses corpuses of text for training. For example, in another article, I will show how this can be done with another algorithm called ESA (Explicit Semantic Analysis).<\/p>\n<p>It is said about word2vec that &#8220;given enough data, usage and contexts, word2vec can make highly accurate guesses about a word\u00e2\u20ac\u2122s meaning based on past appearances.&#8221; What I will show in this article is how to determine the context and we will see how this impacts the results.<\/p>\n<p><!--more--><\/p>\n<div id=\"outline-container-orgheadline1\" class=\"outline-2\">\n<h2 id=\"orgheadline1\">Training Corpus<\/h2>\n<div id=\"text-orgheadline1\" class=\"outline-text-2\">\n<p>A training corpus is really just a set of text used to train unsupervised machine learning algorithms. Any kind of text can be used by word2vec. The only thing it does is to learn the relationships between the words that exist in the text. However, not all training corpuses are equal. Training corpuses are often dirty, biaised and ambiguous. Depending on the task at hand, it may be exactly what is required, but more often than not, their errors need to be fixed. Cognonto has the advantage of starting with clean text.<\/p>\n<p>When we want to create a new training corpus, the first step is to find a source of text that could work to create that corpus. The second step is to select the text we want to add to it. The third step is to pre-process that corpus of text to perform different operations on the text, such as: removing HTML elements; removing punctuation; normalizing text; detecting named entities; etc. The final step is to train word2vec to generate the model.<\/p>\n<p>word2vec is somewhat dumb. It only learns what exists in the training corpus. It does not do anything other than &#8220;reading&#8221; the text and analyzing the relationships between the words (which are really just group of characters separated by spaces). The word2vec process is highly subject to the <a href=\"https:\/\/en.wikipedia.org\/wiki\/Garbage_in,_garbage_out\">Garbage In, Garbage Out<\/a> principle, which means that if the training set is dirty, biaised and ambiguous, then the learned relationship will end-up being of little or no value.<\/p>\n<\/div>\n<\/div>\n<div id=\"outline-container-orgheadline2\" class=\"outline-2\">\n<h2 id=\"orgheadline2\">Domain-specific Training Corpus<\/h2>\n<div id=\"text-orgheadline2\" class=\"outline-text-2\">\n<p>A domain-specific training corpus is a specialized training corpus where its text is related to a specific domain. Examples of domains are music, mathematics, cars, healthcare, etc. In contrast, a general training corpus is a corpus of text that may contain text that discusses totally different domains. By creating a corpus of text that covers a specific domain of interest, we limit the usage of words (that is, their co-occurrences) to texts that are meaningful to that domain.<\/p>\n<p>As we will see in this article, a domain-specific training corpus can be quite useful, and much more powerful, than general ones, if the task at hand is in relation to a specific domain of expertise. The major problem with domain-specific training corpuses is that they are really costly to create. We not only have to find the source of data to use, but we also have to select each document that we want to include in the training corpus. This can work if we want a corpus with 100 or 200 documents, but what if you want a training corpus of 100,000 or 200,000 documents? Then it becomes a problem.<\/p>\n<p>It is the kind of problem that Cognonto helps to resolve. Cognonto and KBpedia, its knowledge base, is a set of ~39,000 reference concepts that have ~138,000 links to schema of external data sources such as Wikipedia, Wikidata and USPTO. It is that structure and these links to external data sources that we use to create domain-specific training corpuses <i>on the fly<\/i>. We leverage the reference concept structure to select all of the concepts that should be part of the domain that is being defined. Then we use Cognonto&#8217;s inference capabilities to infer all the other hundred or thousands of concepts that define the full scope of the domain. Then we analyze the hundreds or thousands of concepts we selected that way to get all of the links to external data sources. Finally we use these references to create the training corpus. All of this is done automatically once the initial few concepts that define my domain got selected. The workflow looks like:<\/p>\n<div class=\"figure\">\n<p><a href=\"https:\/\/fgiasson.com\/blog\/wp-content\/uploads\/2016\/09\/cognonto-workflow.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-3347\" src=\"https:\/\/fgiasson.com\/blog\/wp-content\/uploads\/2016\/09\/cognonto-workflow.png\" alt=\"cognonto-workflow\" width=\"488\" height=\"440\" srcset=\"https:\/\/fgiasson.com\/blog\/wp-content\/uploads\/2016\/09\/cognonto-workflow.png 488w, https:\/\/fgiasson.com\/blog\/wp-content\/uploads\/2016\/09\/cognonto-workflow-300x270.png 300w\" sizes=\"auto, (max-width: 488px) 100vw, 488px\" \/><\/a><\/p>\n<\/div>\n<\/div>\n<\/div>\n<div id=\"outline-container-orgheadline3\" class=\"outline-2\">\n<h2 id=\"orgheadline3\">The Process<\/h2>\n<div id=\"text-orgheadline3\" class=\"outline-text-2\">\n<p>To show you how this process works, I will create a domain-specific training set about musicians using Cognonto. Then I will use the Google News word2vec model created by Google and that has about 100 billion words. The Google model contains 300-dimensional vectors for 3 million words and phrases. I will use the Google News model as the general model to compare the results\/performance between a domain specific and a general model.<\/p>\n<\/div>\n<div id=\"outline-container-orgheadline4\" class=\"outline-3\">\n<h3 id=\"orgheadline4\">Determining the Domain<\/h3>\n<div id=\"text-orgheadline4\" class=\"outline-text-3\">\n<p>The first step is to define the scope of the domain we want to create. For this article, I want a domain that is somewhat constrained to create a training corpus that is not too large for demo purposes. The domain I have chosen is <code>musicians<\/code>. This domain is related to people and bands that play music. It is also related to musical genres, instruments, music industry, etc.<\/p>\n<p>To create my domain, I select a single KBpedia reference concept: <a href=\"http:\/\/cognonto.com\/knowledge-graph\/reference-concept\/?uri=Musician\">Musician<\/a>. If I wanted to broaden the scope of the domain, I could have included other concepts such as: <a href=\"http:\/\/cognonto.com\/knowledge-graph\/reference-concept\/?uri=Music\">Music<\/a>, <a href=\"http:\/\/cognonto.com\/knowledge-graph\/reference-concept\/?uri=MusicPerformanceOrganization\">Musical Group<\/a>, <a href=\"http:\/\/cognonto.com\/knowledge-graph\/reference-concept\/?uri=MusicalInstrument\">Musical Instrument<\/a>, etc.<\/p>\n<\/div>\n<\/div>\n<div id=\"outline-container-orgheadline5\" class=\"outline-3\">\n<h3 id=\"orgheadline5\">Aggregating the Domain-specific Training Corpus<\/h3>\n<div id=\"text-orgheadline5\" class=\"outline-text-3\">\n<p>Once we have determined the scope of the domain, the next step is to query the KBpedia knowledge base to aggregate all of the text that will belong to that training corpus. The end result of this operation is to create a training corpus with text that is only related to the scope of the domain we defined.<\/p>\n<div class=\"org-src-container\">\n<pre class=\"src src-clojure\"><span style=\"color: #ae81ff;\">(<\/span><span style=\"color: #f92672;\">defn<\/span> <span style=\"color: #a6e22e;\">create-domain-specific-training-set<\/span>\n  <span style=\"color: #66d9ef;\">[<\/span>target-kbpedia-class corpus-file<span style=\"color: #66d9ef;\">]<\/span>\n  <span style=\"color: #66d9ef;\">(<\/span><span style=\"color: #f92672;\">let<\/span> <span style=\"color: #a6e22e;\">[<\/span>step 1000\n        entities-dataset <span style=\"color: #e6db74;\">\"http:\/\/kbpedia.org\/knowledge-base\/\"<\/span>\n        kbpedia-dataset <span style=\"color: #e6db74;\">\"http:\/\/kbpedia.org\/kko\/\"<\/span>\n        nb-entities <span style=\"color: #e6db74;\">(<\/span>get-nb-entities-for-class-ws target-kbpedia-class entities-dataset kbpedia-dataset<span style=\"color: #e6db74;\">)<\/span><span style=\"color: #a6e22e;\">]<\/span>\n    <span style=\"color: #a6e22e;\">(<\/span><span style=\"color: #f92672;\">loop<\/span> <span style=\"color: #e6db74;\">[<\/span>nb 0\n           nb-processed 1<span style=\"color: #e6db74;\">]<\/span>\n      <span style=\"color: #e6db74;\">(<\/span><span style=\"color: #f92672;\">when<\/span> <span style=\"color: #fd971f;\">(<\/span>&lt; nb nb-entities<span style=\"color: #fd971f;\">)<\/span>\n        <span style=\"color: #fd971f;\">(<\/span><span style=\"color: #f92672;\">doseq<\/span> <span style=\"color: #f92672;\">[<\/span>entity <span style=\"color: #ae81ff;\">(<\/span>get-entities-slice target-kbpedia-class entities-dataset kbpedia-dataset <span style=\"color: #ae81ff;\">:limit<\/span> step <span style=\"color: #ae81ff;\">:offset<\/span> @nb-processed<span style=\"color: #ae81ff;\">)<\/span><span style=\"color: #f92672;\">]<\/span>          \n          <span style=\"color: #f92672;\">(<\/span>spit corpus-file <span style=\"color: #ae81ff;\">(<\/span>str <span style=\"color: #66d9ef;\">(<\/span>get-entity-content entity<span style=\"color: #66d9ef;\">)<\/span> <span style=\"color: #e6db74;\">\"\\n\"<\/span><span style=\"color: #ae81ff;\">)<\/span> <span style=\"color: #ae81ff;\">:append<\/span> <span style=\"color: #ae81ff;\">true<\/span><span style=\"color: #f92672;\">)<\/span>\n          <span style=\"color: #f92672;\">(<\/span>println <span style=\"color: #ae81ff;\">(<\/span>str nb-processed <span style=\"color: #e6db74;\">\"\/\"<\/span> nb-entities<span style=\"color: #ae81ff;\">)<\/span><span style=\"color: #f92672;\">)<\/span><span style=\"color: #fd971f;\">)<\/span>\n        <span style=\"color: #fd971f;\">(<\/span><span style=\"color: #f92672;\">recur<\/span> <span style=\"color: #f92672;\">(<\/span>+ nb step<span style=\"color: #f92672;\">)<\/span>\n               <span style=\"color: #f92672;\">(<\/span>inc nb-processed<span style=\"color: #f92672;\">)<\/span><span style=\"color: #fd971f;\">)<\/span><span style=\"color: #e6db74;\">)<\/span><span style=\"color: #a6e22e;\">)<\/span><span style=\"color: #66d9ef;\">)<\/span><span style=\"color: #ae81ff;\">)<\/span>\n\n<span style=\"color: #ae81ff;\">(<\/span>create-domain-specific-training-set <span style=\"color: #e6db74;\">\"http:\/\/kbpedia.org\/kko\/rc\/Musician\"<\/span> <span style=\"color: #e6db74;\">\"resources\/musicians-corpus.txt\"<\/span><span style=\"color: #ae81ff;\">)<\/span>\n<\/pre>\n<\/div>\n<p>What this code does is to query the KBpedia knowledge base to get all the named entities that are linked to it, for the scope of the domain we defined. Then the text related to each entity is appended to a text file where each line is the text of a single entity.<\/p>\n<p>Given the scope of the current demo, the musicians training corpus is composed of <code>47,263<\/code> documents. This is the crux of the demo. With a simple function, we are able to aggregate 47,263 text documents highly related to a conceptual domain we defined on the fly. All of the hard work has been delegated to the knowledge base and its conceptual structure (in fact, this simple function <a href=\"https:\/\/fgiasson.com\/blog\/index.php\/2016\/09\/21\/cognonto\/\">leverages 8 years of hard work<\/a>).<\/p>\n<\/div>\n<\/div>\n<div id=\"outline-container-orgheadline6\" class=\"outline-3\">\n<h3 id=\"orgheadline6\">Normalizing Text<\/h3>\n<div id=\"text-orgheadline6\" class=\"outline-text-3\">\n<p>The next step is a natural step related to any <a href=\"https:\/\/en.wikipedia.org\/wiki\/Natural_language_processing\">NLP<\/a> pipeline. Before learning from the training corpus, we should clean and normalize the text of its raw form.<\/p>\n<div class=\"org-src-container\">\n<pre class=\"src src-clojure\"><span style=\"color: #ae81ff;\">(<\/span><span style=\"color: #f92672;\">defn<\/span> <span style=\"color: #a6e22e;\">normalize-proper-name<\/span>\n  <span style=\"color: #66d9ef;\">[<\/span>name<span style=\"color: #66d9ef;\">]<\/span>\n  <span style=\"color: #66d9ef;\">(<\/span><span style=\"color: #f92672;\">-&gt;<\/span> name\n      <span style=\"color: #a6e22e;\">(<\/span><span style=\"color: #66d9ef;\">string<\/span><span style=\"color: #66d9ef;\">\/<\/span>replace #<span style=\"color: #e6db74;\">\" \"<\/span> <span style=\"color: #e6db74;\">\"_\"<\/span><span style=\"color: #a6e22e;\">)<\/span>      \n      <span style=\"color: #a6e22e;\">(<\/span><span style=\"color: #66d9ef;\">string<\/span><span style=\"color: #66d9ef;\">\/<\/span>lower-case<span style=\"color: #a6e22e;\">)<\/span><span style=\"color: #66d9ef;\">)<\/span><span style=\"color: #ae81ff;\">)<\/span>\n\n<span style=\"color: #ae81ff;\">(<\/span><span style=\"color: #f92672;\">defn<\/span> <span style=\"color: #a6e22e;\">pre-process-line<\/span>\n  <span style=\"color: #66d9ef;\">[<\/span>line<span style=\"color: #66d9ef;\">]<\/span>  \n  <span style=\"color: #66d9ef;\">(<\/span><span style=\"color: #f92672;\">-&gt;<\/span> <span style=\"color: #a6e22e;\">(<\/span><span style=\"color: #f92672;\">let<\/span> <span style=\"color: #e6db74;\">[<\/span>line <span style=\"color: #fd971f;\">(<\/span><span style=\"color: #f92672;\">-&gt;<\/span> line\n                     <span style=\"color: #75715e; font-style: italic;\">;; <\/span><span style=\"color: #75715e; font-style: italic;\">1. remove all underscores<\/span>\n                     <span style=\"color: #f92672;\">(<\/span><span style=\"color: #66d9ef;\">string<\/span><span style=\"color: #66d9ef;\">\/<\/span>replace <span style=\"color: #e6db74;\">\"_\"<\/span> <span style=\"color: #e6db74;\">\" \"<\/span><span style=\"color: #f92672;\">)<\/span><span style=\"color: #fd971f;\">)<\/span><span style=\"color: #e6db74;\">]<\/span>\n        <span style=\"color: #75715e; font-style: italic;\">;; <\/span><span style=\"color: #75715e; font-style: italic;\">2. detect named entities and change them with their underscore form, like: Fred Giasson -&gt; fred_giasson<\/span>\n        <span style=\"color: #e6db74;\">(<\/span><span style=\"color: #f92672;\">loop<\/span> <span style=\"color: #fd971f;\">[<\/span>entities <span style=\"color: #f92672;\">(<\/span>into <span style=\"color: #ae81ff;\">[]<\/span> <span style=\"color: #ae81ff;\">(<\/span>re-seq #<span style=\"color: #e6db74;\">\"[\\p{Lu}]<\/span><span style=\"color: #e6db74;\">(<\/span><span style=\"color: #e6db74;\">[\\p{Ll}]+<\/span><span style=\"color: #e6db74;\">|<\/span><span style=\"color: #e6db74;\">\\.<\/span><span style=\"color: #e6db74;\">)(?:<\/span><span style=\"color: #e6db74;\">\\s+[\\p{Lu}]<\/span><span style=\"color: #e6db74;\">(<\/span><span style=\"color: #e6db74;\">[\\p{Ll}]+<\/span><span style=\"color: #e6db74;\">|<\/span><span style=\"color: #e6db74;\">\\.<\/span><span style=\"color: #e6db74;\">))<\/span><span style=\"color: #e6db74;\">*<\/span><span style=\"color: #e6db74;\">(?:<\/span><span style=\"color: #e6db74;\">\\s+[\\p{Ll}][\\p{Ll}\\-]{1,3}<\/span><span style=\"color: #e6db74;\">)<\/span><span style=\"color: #e6db74;\">{0,1}\\s+[\\p{Lu}]<\/span><span style=\"color: #e6db74;\">(<\/span><span style=\"color: #e6db74;\">[\\p{Ll}]+<\/span><span style=\"color: #e6db74;\">|<\/span><span style=\"color: #e6db74;\">\\.<\/span><span style=\"color: #e6db74;\">)<\/span><span style=\"color: #e6db74;\">\"<\/span> line<span style=\"color: #ae81ff;\">)<\/span><span style=\"color: #f92672;\">)<\/span>\n               line line<span style=\"color: #fd971f;\">]<\/span>\n          <span style=\"color: #fd971f;\">(<\/span><span style=\"color: #f92672;\">if<\/span> <span style=\"color: #f92672;\">(<\/span>empty? entities<span style=\"color: #f92672;\">)<\/span>\n            line\n            <span style=\"color: #f92672;\">(<\/span><span style=\"color: #f92672;\">let<\/span> <span style=\"color: #ae81ff;\">[<\/span>entity <span style=\"color: #66d9ef;\">(<\/span>first <span style=\"color: #a6e22e;\">(<\/span>first entities<span style=\"color: #a6e22e;\">)<\/span><span style=\"color: #66d9ef;\">)<\/span><span style=\"color: #ae81ff;\">]<\/span>\n              <span style=\"color: #ae81ff;\">(<\/span><span style=\"color: #f92672;\">recur<\/span> <span style=\"color: #66d9ef;\">(<\/span>rest entities<span style=\"color: #66d9ef;\">)<\/span>                     \n                     <span style=\"color: #66d9ef;\">(<\/span><span style=\"color: #66d9ef;\">string<\/span><span style=\"color: #66d9ef;\">\/<\/span>replace line entity <span style=\"color: #a6e22e;\">(<\/span>normalize-proper-name entity<span style=\"color: #a6e22e;\">)<\/span><span style=\"color: #66d9ef;\">)<\/span><span style=\"color: #ae81ff;\">)<\/span><span style=\"color: #f92672;\">)<\/span><span style=\"color: #fd971f;\">)<\/span><span style=\"color: #e6db74;\">)<\/span><span style=\"color: #a6e22e;\">)<\/span>\n      <span style=\"color: #a6e22e;\">(<\/span><span style=\"color: #66d9ef;\">string<\/span><span style=\"color: #66d9ef;\">\/<\/span>replace <span style=\"color: #e6db74;\">(<\/span>re-pattern stop-list<span style=\"color: #e6db74;\">)<\/span> <span style=\"color: #e6db74;\">\" \"<\/span><span style=\"color: #a6e22e;\">)<\/span>\n      <span style=\"color: #75715e; font-style: italic;\">;; <\/span><span style=\"color: #75715e; font-style: italic;\">4. remove everything between brackets like: [1] [edit] [show]<\/span>\n      <span style=\"color: #a6e22e;\">(<\/span><span style=\"color: #66d9ef;\">string<\/span><span style=\"color: #66d9ef;\">\/<\/span>replace #<span style=\"color: #e6db74;\">\"\\[.*\\]\"<\/span> <span style=\"color: #e6db74;\">\" \"<\/span><span style=\"color: #a6e22e;\">)<\/span>\n      <span style=\"color: #75715e; font-style: italic;\">;; <\/span><span style=\"color: #75715e; font-style: italic;\">5. punctuation characters except the dot and the single quote, replace by nothing: (),[]-={}\/\\~!?%$@&amp;*+:;&lt;&gt;<\/span>\n      <span style=\"color: #a6e22e;\">(<\/span><span style=\"color: #66d9ef;\">string<\/span><span style=\"color: #66d9ef;\">\/<\/span>replace #<span style=\"color: #e6db74;\">\"[\\^\\<\/span><span style=\"color: #e6db74;\">(<\/span><span style=\"color: #e6db74;\">\\<\/span><span style=\"color: #e6db74;\">)<\/span><span style=\"color: #e6db74;\">\\,\\[\\]\\=\\{\\}\\\/\\\\\\~\\!\\?\\%\\$\\@\\&amp;\\*\\+:\\;\\&lt;\\&gt;\\\"\\p{Pd}]\"<\/span> <span style=\"color: #e6db74;\">\" \"<\/span><span style=\"color: #a6e22e;\">)<\/span>\n      <span style=\"color: #75715e; font-style: italic;\">;; <\/span><span style=\"color: #75715e; font-style: italic;\">6. remove all numbers<\/span>\n      <span style=\"color: #a6e22e;\">(<\/span><span style=\"color: #66d9ef;\">string<\/span><span style=\"color: #66d9ef;\">\/<\/span>replace #<span style=\"color: #e6db74;\">\"[0-9]\"<\/span> <span style=\"color: #e6db74;\">\" \"<\/span><span style=\"color: #a6e22e;\">)<\/span>\n      <span style=\"color: #75715e; font-style: italic;\">;; <\/span><span style=\"color: #75715e; font-style: italic;\">7. remove all words with 2 characters or less<\/span>\n      <span style=\"color: #a6e22e;\">(<\/span><span style=\"color: #66d9ef;\">string<\/span><span style=\"color: #66d9ef;\">\/<\/span>replace #<span style=\"color: #e6db74;\">\"\\b[\\p{L}]{0,2}\\b\"<\/span> <span style=\"color: #e6db74;\">\" \"<\/span><span style=\"color: #a6e22e;\">)<\/span>\n      <span style=\"color: #75715e; font-style: italic;\">;; <\/span><span style=\"color: #75715e; font-style: italic;\">10. normalize spaces<\/span>\n      <span style=\"color: #a6e22e;\">(<\/span><span style=\"color: #66d9ef;\">string<\/span><span style=\"color: #66d9ef;\">\/<\/span>replace #<span style=\"color: #e6db74;\">\"\\s{2,}\"<\/span> <span style=\"color: #e6db74;\">\" \"<\/span><span style=\"color: #a6e22e;\">)<\/span>\n      <span style=\"color: #75715e; font-style: italic;\">;; <\/span><span style=\"color: #75715e; font-style: italic;\">11. normalize dots with spaces<\/span>\n      <span style=\"color: #a6e22e;\">(<\/span><span style=\"color: #66d9ef;\">string<\/span><span style=\"color: #66d9ef;\">\/<\/span>replace #<span style=\"color: #e6db74;\">\"\\s\\.\"<\/span> <span style=\"color: #e6db74;\">\".\"<\/span><span style=\"color: #a6e22e;\">)<\/span>\n      <span style=\"color: #75715e; font-style: italic;\">;; <\/span><span style=\"color: #75715e; font-style: italic;\">12. normalize dots<\/span>\n      <span style=\"color: #a6e22e;\">(<\/span><span style=\"color: #66d9ef;\">string<\/span><span style=\"color: #66d9ef;\">\/<\/span>replace #<span style=\"color: #e6db74;\">\"\\.{1,}\"<\/span> <span style=\"color: #e6db74;\">\".\"<\/span><span style=\"color: #a6e22e;\">)<\/span>\n      <span style=\"color: #75715e; font-style: italic;\">;; <\/span><span style=\"color: #75715e; font-style: italic;\">13. normalize underscores<\/span>\n      <span style=\"color: #a6e22e;\">(<\/span><span style=\"color: #66d9ef;\">string<\/span><span style=\"color: #66d9ef;\">\/<\/span>replace #<span style=\"color: #e6db74;\">\"\\_{1,}\"<\/span> <span style=\"color: #e6db74;\">\"_\"<\/span><span style=\"color: #a6e22e;\">)<\/span>\n      <span style=\"color: #75715e; font-style: italic;\">;; <\/span><span style=\"color: #75715e; font-style: italic;\">14. remove standalone single quotes<\/span>\n      <span style=\"color: #a6e22e;\">(<\/span><span style=\"color: #66d9ef;\">string<\/span><span style=\"color: #66d9ef;\">\/<\/span>replace <span style=\"color: #e6db74;\">\" ' \"<\/span> <span style=\"color: #e6db74;\">\" \"<\/span><span style=\"color: #a6e22e;\">)<\/span>\n      <span style=\"color: #75715e; font-style: italic;\">;; <\/span><span style=\"color: #75715e; font-style: italic;\">15. re-normalize spaces<\/span>\n      <span style=\"color: #a6e22e;\">(<\/span><span style=\"color: #66d9ef;\">string<\/span><span style=\"color: #66d9ef;\">\/<\/span>replace #<span style=\"color: #e6db74;\">\"\\s{2,}\"<\/span> <span style=\"color: #e6db74;\">\" \"<\/span><span style=\"color: #a6e22e;\">)<\/span>        \n      <span style=\"color: #75715e; font-style: italic;\">;; <\/span><span style=\"color: #75715e; font-style: italic;\">16. put everything lowercase<\/span>\n      <span style=\"color: #a6e22e;\">(<\/span><span style=\"color: #66d9ef;\">string<\/span><span style=\"color: #66d9ef;\">\/<\/span>lower-case<span style=\"color: #a6e22e;\">)<\/span>\n\n      <span style=\"color: #a6e22e;\">(<\/span>str <span style=\"color: #e6db74;\">\"\\n\"<\/span><span style=\"color: #a6e22e;\">)<\/span><span style=\"color: #66d9ef;\">)<\/span><span style=\"color: #ae81ff;\">)<\/span>\n\n<span style=\"color: #ae81ff;\">(<\/span><span style=\"color: #f92672;\">defn<\/span> <span style=\"color: #a6e22e;\">pre-process-corpus<\/span>\n  <span style=\"color: #66d9ef;\">[<\/span>in-file out-file<span style=\"color: #66d9ef;\">]<\/span>\n  <span style=\"color: #66d9ef;\">(<\/span>spit out-file <span style=\"color: #e6db74;\">\"\"<\/span> <span style=\"color: #ae81ff;\">:append<\/span> <span style=\"color: #ae81ff;\">true<\/span><span style=\"color: #66d9ef;\">)<\/span>\n  <span style=\"color: #66d9ef;\">(<\/span><span style=\"color: #f92672;\">with-open<\/span> <span style=\"color: #a6e22e;\">[<\/span>file <span style=\"color: #e6db74;\">(<\/span><span style=\"color: #66d9ef;\">clojure.java.io<\/span><span style=\"color: #66d9ef;\">\/<\/span>reader in-file<span style=\"color: #e6db74;\">)<\/span><span style=\"color: #a6e22e;\">]<\/span>\n    <span style=\"color: #a6e22e;\">(<\/span><span style=\"color: #f92672;\">doseq<\/span> <span style=\"color: #e6db74;\">[<\/span>line <span style=\"color: #fd971f;\">(<\/span>line-seq file<span style=\"color: #fd971f;\">)<\/span><span style=\"color: #e6db74;\">]<\/span>\n      <span style=\"color: #e6db74;\">(<\/span>spit out-file <span style=\"color: #fd971f;\">(<\/span>pre-process-line line<span style=\"color: #fd971f;\">)<\/span> <span style=\"color: #ae81ff;\">:append<\/span> <span style=\"color: #ae81ff;\">true<\/span><span style=\"color: #e6db74;\">)<\/span><span style=\"color: #a6e22e;\">)<\/span><span style=\"color: #66d9ef;\">)<\/span><span style=\"color: #ae81ff;\">)<\/span>\n\n<span style=\"color: #ae81ff;\">(<\/span>pre-process-corpus <span style=\"color: #e6db74;\">\"resources\/musicians-corpus.txt\"<\/span> <span style=\"color: #e6db74;\">\"resources\/musicians-corpus.clean.txt\"<\/span><span style=\"color: #ae81ff;\">)<\/span>\n<\/pre>\n<\/div>\n<p>We remove all of the characters that may cause issues to the tokenizer used by the word2vec implementation. We also remove unnecessary words and other words that appear too often or that add nothing to the model we want to generate (like the listing of days and months). We also drop all numbers.<\/p>\n<\/div>\n<\/div>\n<div id=\"outline-container-orgheadline7\" class=\"outline-3\">\n<h3 id=\"orgheadline7\">Training word2vec<\/h3>\n<div id=\"text-orgheadline7\" class=\"outline-text-3\">\n<p>The last step is to train word2vec on our clean domain-specific training corpus to generate the model we will use. For this demo, I will use the <a href=\"http:\/\/deeplearning4j.org\/\">DL4J<\/a> (Deep Learning for Java) library that is a Java implementation of the word2vec algorithm. Training word2vec is as simple as using the DL4J API like this:<\/p>\n<div class=\"org-src-container\">\n<pre class=\"src src-clojure\"><span style=\"color: #ae81ff;\">(<\/span><span style=\"color: #f92672;\">defn<\/span> <span style=\"color: #a6e22e;\">train<\/span>\n  <span style=\"color: #66d9ef;\">[<\/span>training-set-file model-file<span style=\"color: #66d9ef;\">]<\/span>\n  <span style=\"color: #66d9ef;\">(<\/span><span style=\"color: #f92672;\">let<\/span> <span style=\"color: #a6e22e;\">[<\/span>sentence-iterator <span style=\"color: #e6db74;\">(<\/span><span style=\"color: #f92672;\">new<\/span> <span style=\"color: #66d9ef;\">LineSentenceIterator<\/span> <span style=\"color: #fd971f;\">(<\/span><span style=\"color: #66d9ef;\">clojure.java.io<\/span><span style=\"color: #66d9ef;\">\/<\/span>file training-set-file<span style=\"color: #fd971f;\">)<\/span><span style=\"color: #e6db74;\">)<\/span>\n        tokenizer <span style=\"color: #e6db74;\">(<\/span><span style=\"color: #f92672;\">new<\/span> <span style=\"color: #66d9ef;\">DefaultTokenizerFactory<\/span><span style=\"color: #e6db74;\">)<\/span>\n        vec <span style=\"color: #e6db74;\">(<\/span><span style=\"color: #f92672;\">..<\/span> <span style=\"color: #fd971f;\">(<\/span><span style=\"color: #f92672;\">new<\/span> <span style=\"color: #66d9ef;\">Word2Vec$Builder<\/span><span style=\"color: #fd971f;\">)<\/span>\n                <span style=\"color: #fd971f;\">(<\/span><span style=\"color: #f92672;\">minWordFrequency<\/span> 1<span style=\"color: #fd971f;\">)<\/span>\n                <span style=\"color: #fd971f;\">(<\/span><span style=\"color: #f92672;\">windowSize<\/span> 5<span style=\"color: #fd971f;\">)<\/span>\n                <span style=\"color: #fd971f;\">(<\/span><span style=\"color: #f92672;\">layerSize<\/span> 100<span style=\"color: #fd971f;\">)<\/span>\n                <span style=\"color: #fd971f;\">(<\/span>iterate sentence-iterator<span style=\"color: #fd971f;\">)<\/span>\n                <span style=\"color: #fd971f;\">(<\/span><span style=\"color: #f92672;\">tokenizerFactory<\/span> tokenizer<span style=\"color: #fd971f;\">)<\/span>\n                build<span style=\"color: #e6db74;\">)<\/span><span style=\"color: #a6e22e;\">]<\/span>\n    <span style=\"color: #a6e22e;\">(<\/span><span style=\"color: #f92672;\">.fit<\/span> vec<span style=\"color: #a6e22e;\">)<\/span>\n    <span style=\"color: #a6e22e;\">(<\/span><span style=\"color: #66d9ef;\">SerializationUtils<\/span><span style=\"color: #66d9ef;\">\/<\/span><span style=\"color: #f92672;\">saveObject<\/span> vec <span style=\"color: #e6db74;\">(<\/span><span style=\"color: #66d9ef;\">io<\/span><span style=\"color: #66d9ef;\">\/<\/span>file model-file<span style=\"color: #e6db74;\">)<\/span><span style=\"color: #a6e22e;\">)<\/span>\n    vec<span style=\"color: #66d9ef;\">)<\/span><span style=\"color: #ae81ff;\">)<\/span>\n\n<span style=\"color: #ae81ff;\">(<\/span><span style=\"color: #f92672;\">def<\/span> <span style=\"color: #fd971f;\">musicians-model<\/span> <span style=\"color: #66d9ef;\">(<\/span>train <span style=\"color: #e6db74;\">\"resources\/musicians-corpus.clean.txt\"<\/span> <span style=\"color: #e6db74;\">\"resources\/musicians-corpus.model\"<\/span><span style=\"color: #66d9ef;\">)<\/span><span style=\"color: #ae81ff;\">)<\/span>\n<\/pre>\n<\/div>\n<p>What is important to notice here is the number of parameters that can be defined to train word2vec on a corpus. In fact, that algorithm can be sensitive to parametrization.<\/p>\n<\/div>\n<\/div>\n<div id=\"outline-container-orgheadline8\" class=\"outline-3\">\n<h3 id=\"orgheadline8\">Importing the General Model<\/h3>\n<div id=\"text-orgheadline8\" class=\"outline-text-3\">\n<p>The goal of this demo is to demonstrate the difference between a domain-specific model and a general model. Remember that the general model we chose was the Google News model that is composed of billion of words, but which is highly general. DL4J can import that model without having to generate it ourselves (in fact, only the model is distributed by Google, not the training corpus):<\/p>\n<div class=\"org-src-container\">\n<pre class=\"src src-clojure\"><span style=\"color: #ae81ff;\">(<\/span><span style=\"color: #f92672;\">defn<\/span> <span style=\"color: #a6e22e;\">import-google-news-model<\/span>\n  <span style=\"color: #66d9ef;\">[]<\/span>\n  <span style=\"color: #66d9ef;\">(<\/span><span style=\"color: #66d9ef;\">org.deeplearning4j.models.embeddings.loader.WordVectorSerializer<\/span><span style=\"color: #66d9ef;\">\/<\/span><span style=\"color: #f92672;\">loadGoogleModel<\/span> <span style=\"color: #a6e22e;\">(<\/span><span style=\"color: #66d9ef;\">clojure.java.io<\/span><span style=\"color: #66d9ef;\">\/<\/span>file <span style=\"color: #e6db74;\">\"GoogleNews-vectors-negative300.bin.gz\"<\/span><span style=\"color: #a6e22e;\">)<\/span> <span style=\"color: #ae81ff;\">true<\/span><span style=\"color: #66d9ef;\">)<\/span><span style=\"color: #ae81ff;\">)<\/span>\n\n<span style=\"color: #ae81ff;\">(<\/span><span style=\"color: #f92672;\">def<\/span> <span style=\"color: #fd971f;\">google-news-model<\/span> <span style=\"color: #66d9ef;\">(<\/span>import-google-news-model<span style=\"color: #66d9ef;\">)<\/span><span style=\"color: #ae81ff;\">)<\/span>\n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div id=\"outline-container-orgheadline9\" class=\"outline-2\">\n<h2 id=\"orgheadline9\">Playing With Models<\/h2>\n<div id=\"text-orgheadline9\" class=\"outline-text-2\">\n<p>Now that we have a domain-specific model related to <code>musicians<\/code> and a general model related to news processed by Google, let&#8217;s start playing with both to see how they perform on different tasks. In the following examples, we will always compare the domain-specific training corpus with the general one.<\/p>\n<\/div>\n<div id=\"outline-container-orgheadline10\" class=\"outline-3\">\n<h3 id=\"orgheadline10\">Ambiguous Words<\/h3>\n<div id=\"text-orgheadline10\" class=\"outline-text-3\">\n<p>A characteristic of words is that their surface form can be ambiguous; they can have multiple meanings. An ambiguous word can co-occur with multiple other words that may not have any shared meaning. But all of this depends on the context. If we are in a general context, then this situation will happen more often than we think and will impact the similarity score of these ambiguous terms. However, as we will see, this phenomenum is greatly diminished when we use domain-specific models.<\/p>\n<\/div>\n<div id=\"outline-container-orgheadline11\" class=\"outline-4\">\n<h4 id=\"orgheadline11\">Similarity Between Piano, Organ and Violin<\/h4>\n<div id=\"text-orgheadline11\" class=\"outline-text-4\">\n<p>What we want to check is the relationship between 3 different music instruments: <code>piano<\/code>, <code>organ<\/code> and <code>violin<\/code>. We want to check the relationship between each of them.<\/p>\n<div class=\"org-src-container\">\n<pre class=\"src src-clojure\"><span style=\"color: #ae81ff;\">(<\/span><span style=\"color: #f92672;\">.similarity<\/span> musicians-model <span style=\"color: #e6db74;\">\"piano\"<\/span> <span style=\"color: #e6db74;\">\"violin\"<\/span><span style=\"color: #ae81ff;\">)<\/span>\n<\/pre>\n<\/div>\n<pre class=\"example\">0.8422856330871582\n<\/pre>\n<div class=\"org-src-container\">\n<pre class=\"src src-clojure\"><span style=\"color: #ae81ff;\">(<\/span><span style=\"color: #f92672;\">.similarity<\/span> musicians-model <span style=\"color: #e6db74;\">\"piano\"<\/span> <span style=\"color: #e6db74;\">\"organ\"<\/span><span style=\"color: #ae81ff;\">)<\/span>\n<\/pre>\n<\/div>\n<pre class=\"example\">0.8573281764984131\n<\/pre>\n<p>As we can see, both tuples have a high likelihood of co-occurrence. This suggests that these terms of each tuple are probably highly related. In this case, it is probably because violins are often played along with a piano. And, it is probably that an organ looks like a piano (at least it has a keyboard).<\/p>\n<p>Now let&#8217;s take a look at what the general model has to say about that:<\/p>\n<div class=\"org-src-container\">\n<pre class=\"src src-clojure\"><span style=\"color: #ae81ff;\">(<\/span><span style=\"color: #f92672;\">.similarity<\/span> google-news-model <span style=\"color: #e6db74;\">\"piano\"<\/span> <span style=\"color: #e6db74;\">\"violin\"<\/span><span style=\"color: #ae81ff;\">)<\/span>\n<\/pre>\n<\/div>\n<pre class=\"example\">0.8228187561035156\n<\/pre>\n<div class=\"org-src-container\">\n<pre class=\"src src-clojure\"><span style=\"color: #ae81ff;\">(<\/span><span style=\"color: #f92672;\">.similarity<\/span> google-news-model <span style=\"color: #e6db74;\">\"piano\"<\/span> <span style=\"color: #e6db74;\">\"organ\"<\/span><span style=\"color: #ae81ff;\">)<\/span>\n<\/pre>\n<\/div>\n<pre class=\"example\">0.46168726682662964\n<\/pre>\n<p>The surprising fact here is the apparent dissimilarity between <code>piano<\/code> and <code>organ<\/code> compared with the results we got with the musicians domain-specific model. If we think a bit about this use case, we will probably conclude that these results makes sense. In fact, <code>organ<\/code> is an ambiguous word in a general context. An organ can be a musical instrument, but it can also be a part of an anatomy. This means that the word <code>organ<\/code> will co-occur beside <code>piano<\/code>, but also all kind of other words related to human and animal biology. This is why they are less similar in the general model than in the domain one, because it is an ambiguous word in a general context.<\/p>\n<\/div>\n<\/div>\n<div id=\"outline-container-orgheadline12\" class=\"outline-4\">\n<h4 id=\"orgheadline12\">Similarity Between Album and Track<\/h4>\n<div id=\"text-orgheadline12\" class=\"outline-text-4\">\n<p>Now let&#8217;s see another similarity example between two other words <code>album<\/code> and <code>track<\/code> where <code>track<\/code> is an ambiguous word depending on the context.<\/p>\n<div class=\"org-src-container\">\n<pre class=\"src src-clojure\"><span style=\"color: #ae81ff;\">(<\/span><span style=\"color: #f92672;\">.similarity<\/span> musicians-model <span style=\"color: #e6db74;\">\"album\"<\/span> <span style=\"color: #e6db74;\">\"track\"<\/span><span style=\"color: #ae81ff;\">)<\/span>\n<\/pre>\n<\/div>\n<pre class=\"example\">0.7943775653839111\n<\/pre>\n<div class=\"org-src-container\">\n<pre class=\"src src-clojure\"><span style=\"color: #ae81ff;\">(<\/span><span style=\"color: #f92672;\">.similarity<\/span> google-news-model <span style=\"color: #e6db74;\">\"album\"<\/span> <span style=\"color: #e6db74;\">\"track\"<\/span><span style=\"color: #ae81ff;\">)<\/span>\n<\/pre>\n<\/div>\n<pre class=\"example\">0.18461623787879944\n<\/pre>\n<p>As expected, because <code>track<\/code> is ambiguous, there is a big difference in terms of co-occurence probabilities depending on the context (domain-specific or general).<\/p>\n<\/div>\n<\/div>\n<div id=\"outline-container-orgheadline13\" class=\"outline-4\">\n<h4 id=\"orgheadline13\">Similarity Between Pianist and Violinist<\/h4>\n<div id=\"text-orgheadline13\" class=\"outline-text-4\">\n<p>However, are domain-specific and general differences always the case? Let&#8217;s take a look at two words that are domain specific and unambiguous: <code>pianist<\/code> and <code>violinist<\/code>.<\/p>\n<div class=\"org-src-container\">\n<pre class=\"src src-clojure\"><span style=\"color: #ae81ff;\">(<\/span><span style=\"color: #f92672;\">.similarity<\/span> musicians-model <span style=\"color: #e6db74;\">\"pianist\"<\/span> <span style=\"color: #e6db74;\">\"violinist\"<\/span><span style=\"color: #ae81ff;\">)<\/span>\n<\/pre>\n<\/div>\n<pre class=\"example\">0.8430571556091309\n<\/pre>\n<div class=\"org-src-container\">\n<pre class=\"src src-clojure\"><span style=\"color: #ae81ff;\">(<\/span><span style=\"color: #f92672;\">.similarity<\/span> google-news-model <span style=\"color: #e6db74;\">\"pianist\"<\/span> <span style=\"color: #e6db74;\">\"violinist\"<\/span><span style=\"color: #ae81ff;\">)<\/span>\n<\/pre>\n<\/div>\n<pre class=\"example\">0.8616064190864563\n<\/pre>\n<p>In this case, the similarity score between the two terms is almost the same. In both contexts (generals and domain specific), their co-occurrence is similar.<\/p>\n<\/div>\n<\/div>\n<\/div>\n<div id=\"outline-container-orgheadline14\" class=\"outline-3\">\n<h3 id=\"orgheadline14\">Nearest Words<\/h3>\n<div id=\"text-orgheadline14\" class=\"outline-text-3\">\n<p>Now let&#8217;s look at the similarity between two distinct words in two new and distinct contexts. Let&#8217;s take a look at a few words and see what other words occur most often with them.<\/p>\n<\/div>\n<div id=\"outline-container-orgheadline15\" class=\"outline-4\">\n<h4 id=\"orgheadline15\">Music<\/h4>\n<div id=\"text-orgheadline15\" class=\"outline-text-4\">\n<div class=\"org-src-container\">\n<pre class=\"src src-clojure\"><span style=\"color: #ae81ff;\">(<\/span><span style=\"color: #f92672;\">.wordsNearest<\/span> musicians-model <span style=\"color: #66d9ef;\">[<\/span><span style=\"color: #e6db74;\">\"music\"<\/span><span style=\"color: #66d9ef;\">]<\/span> <span style=\"color: #66d9ef;\">[]<\/span> 7<span style=\"color: #ae81ff;\">)<\/span>\n<\/pre>\n<\/div>\n<table border=\"2\" frame=\"hsides\" rules=\"groups\" cellspacing=\"0\" cellpadding=\"6\">\n<colgroup>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/> <\/colgroup>\n<tbody>\n<tr>\n<td class=\"org-left\">music<\/td>\n<td class=\"org-left\">revol samoilovich bunin<\/td>\n<td class=\"org-left\">musical<\/td>\n<td class=\"org-left\">amalgamating<\/td>\n<td class=\"org-left\">assam.<\/td>\n<td class=\"org-left\">voice<\/td>\n<td class=\"org-left\">dance.<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<div class=\"org-src-container\">\n<pre class=\"src src-clojure\"><span style=\"color: #ae81ff;\">(<\/span><span style=\"color: #f92672;\">.wordsNearest<\/span> google-news-model <span style=\"color: #66d9ef;\">[<\/span><span style=\"color: #e6db74;\">\"music\"<\/span><span style=\"color: #66d9ef;\">]<\/span> <span style=\"color: #66d9ef;\">[]<\/span> 8<span style=\"color: #ae81ff;\">)<\/span>\n<\/pre>\n<\/div>\n<table border=\"2\" frame=\"hsides\" rules=\"groups\" cellspacing=\"0\" cellpadding=\"6\">\n<colgroup>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/> <\/colgroup>\n<tbody>\n<tr>\n<td class=\"org-left\">music<\/td>\n<td class=\"org-left\">classical music<\/td>\n<td class=\"org-left\">jazz<\/td>\n<td class=\"org-left\">Music<\/td>\n<td class=\"org-left\">Without Donny Kirshner<\/td>\n<td class=\"org-left\">songs<\/td>\n<td class=\"org-left\">musicians<\/td>\n<td class=\"org-left\">tunes<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>One observation we can make is that the terms from the musicians model are more general than the ones from the general model.<\/p>\n<\/div>\n<\/div>\n<div id=\"outline-container-orgheadline16\" class=\"outline-4\">\n<h4 id=\"orgheadline16\">Track<\/h4>\n<div id=\"text-orgheadline16\" class=\"outline-text-4\">\n<div class=\"org-src-container\">\n<pre class=\"src src-clojure\"><span style=\"color: #ae81ff;\">(<\/span><span style=\"color: #f92672;\">.wordsNearest<\/span> musicians-model <span style=\"color: #66d9ef;\">[<\/span><span style=\"color: #e6db74;\">\"track\"<\/span><span style=\"color: #66d9ef;\">]<\/span> <span style=\"color: #66d9ef;\">[]<\/span> 10<span style=\"color: #ae81ff;\">)<\/span>\n<\/pre>\n<\/div>\n<table border=\"2\" frame=\"hsides\" rules=\"groups\" cellspacing=\"0\" cellpadding=\"6\">\n<colgroup>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/> <\/colgroup>\n<tbody>\n<tr>\n<td class=\"org-left\">track<\/td>\n<td class=\"org-left\">released.<\/td>\n<td class=\"org-left\">album<\/td>\n<td class=\"org-left\">latest<\/td>\n<td class=\"org-left\">entitled<\/td>\n<td class=\"org-left\">released<\/td>\n<td class=\"org-left\">debut<\/td>\n<td class=\"org-left\">year.<\/td>\n<td class=\"org-left\">titled<\/td>\n<td class=\"org-left\">positive<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<div class=\"org-src-container\">\n<pre class=\"src src-clojure\"><span style=\"color: #ae81ff;\">(<\/span><span style=\"color: #f92672;\">.wordsNearest<\/span> google-news-model <span style=\"color: #66d9ef;\">[<\/span><span style=\"color: #e6db74;\">\"track\"<\/span><span style=\"color: #66d9ef;\">]<\/span> <span style=\"color: #66d9ef;\">[]<\/span> 5<span style=\"color: #ae81ff;\">)<\/span>\n<\/pre>\n<\/div>\n<table border=\"2\" frame=\"hsides\" rules=\"groups\" cellspacing=\"0\" cellpadding=\"6\">\n<colgroup>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/> <\/colgroup>\n<tbody>\n<tr>\n<td class=\"org-left\">track<\/td>\n<td class=\"org-left\">tracks<\/td>\n<td class=\"org-left\">Track<\/td>\n<td class=\"org-left\">racetrack<\/td>\n<td class=\"org-left\">horseshoe shaped section<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>As we know, <code>track<\/code> is ambiguous. The difference between these two sets of nearest related words is striking. There is a clear conceptual correlation in the musicians&#8217; domain-specific model. But in the general model, it is really going in all directions.<\/p>\n<\/div>\n<\/div>\n<div id=\"outline-container-orgheadline17\" class=\"outline-4\">\n<h4 id=\"orgheadline17\">Year<\/h4>\n<div id=\"text-orgheadline17\" class=\"outline-text-4\">\n<p>Now let&#8217;s take a look at a really general word: <code>year<\/code><\/p>\n<div class=\"org-src-container\">\n<pre class=\"src src-clojure\"><span style=\"color: #ae81ff;\">(<\/span><span style=\"color: #f92672;\">.wordsNearest<\/span> musicians-model <span style=\"color: #66d9ef;\">[<\/span><span style=\"color: #e6db74;\">\"year\"<\/span><span style=\"color: #66d9ef;\">]<\/span> <span style=\"color: #66d9ef;\">[]<\/span> 11<span style=\"color: #ae81ff;\">)<\/span>\n<\/pre>\n<\/div>\n<table border=\"2\" frame=\"hsides\" rules=\"groups\" cellspacing=\"0\" cellpadding=\"6\">\n<colgroup>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/> <\/colgroup>\n<tbody>\n<tr>\n<td class=\"org-left\">year<\/td>\n<td class=\"org-left\">ghantous.<\/td>\n<td class=\"org-left\">he was grammy<\/td>\n<td class=\"org-left\">naacap<\/td>\n<td class=\"org-left\">grammy award for best<\/td>\n<td class=\"org-left\">luces del alma<\/td>\n<td class=\"org-left\">year.<\/td>\n<td class=\"org-left\">grammy award<\/td>\n<td class=\"org-left\">grammy for best<\/td>\n<td class=\"org-left\">sitorai sol<\/td>\n<td class=\"org-left\">nominated<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<div class=\"org-src-container\">\n<pre class=\"src src-clojure\"><span style=\"color: #ae81ff;\">(<\/span><span style=\"color: #f92672;\">.wordsNearest<\/span> google-news-model <span style=\"color: #66d9ef;\">[<\/span><span style=\"color: #e6db74;\">\"year\"<\/span><span style=\"color: #66d9ef;\">]<\/span> <span style=\"color: #66d9ef;\">[]<\/span> 10<span style=\"color: #ae81ff;\">)<\/span>\n<\/pre>\n<\/div>\n<table border=\"2\" frame=\"hsides\" rules=\"groups\" cellspacing=\"0\" cellpadding=\"6\">\n<colgroup>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/> <\/colgroup>\n<tbody>\n<tr>\n<td class=\"org-left\">year<\/td>\n<td class=\"org-left\">month<\/td>\n<td class=\"org-left\">week<\/td>\n<td class=\"org-left\">months<\/td>\n<td class=\"org-left\">decade<\/td>\n<td class=\"org-left\">years<\/td>\n<td class=\"org-left\">summer<\/td>\n<td class=\"org-left\">year.The<\/td>\n<td class=\"org-left\">September<\/td>\n<td class=\"org-left\">weeks<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>This one is quite interesting too. Both groups of words makes sense, but only in their respective contexts. With the musicians&#8217; model, <code>year<\/code> is mostly related to awards (like the Grammy Awards 2016), categories like &#8220;song of the year&#8221;, etc.<\/p>\n<p>In the context of the general model, year is really related to time concepts: months, seasons, etc.<\/p>\n<\/div>\n<\/div>\n<\/div>\n<div id=\"outline-container-orgheadline18\" class=\"outline-3\">\n<h3 id=\"orgheadline18\">Playing With Co-Occurrences Vectors<\/h3>\n<div id=\"text-orgheadline18\" class=\"outline-text-3\">\n<p>Finally we will play with manipulating the co-occurrences vectors by manipulating them. A really popular word2vec equation is <code>king - man + women = queen<\/code>. What is happening under the hood with this equation is that we are adding and substracting the co-occurences vectors for each of these words, and we check the nearest word of the resulting co-occurence vector.<\/p>\n<p>Now, let&#8217;s take a look at a few of these equations.<\/p>\n<\/div>\n<div id=\"outline-container-orgheadline19\" class=\"outline-4\">\n<h4 id=\"orgheadline19\">Pianist + Renowned = ?<\/h4>\n<div id=\"text-orgheadline19\" class=\"outline-text-4\">\n<div class=\"org-src-container\">\n<pre class=\"src src-clojure\"><span style=\"color: #ae81ff;\">(<\/span><span style=\"color: #f92672;\">.wordsNearest<\/span> musicians-model <span style=\"color: #66d9ef;\">[<\/span><span style=\"color: #e6db74;\">\"pianist\"<\/span> <span style=\"color: #e6db74;\">\"renowned\"<\/span><span style=\"color: #66d9ef;\">]<\/span> <span style=\"color: #66d9ef;\">[]<\/span> 9<span style=\"color: #ae81ff;\">)<\/span>\n<\/pre>\n<\/div>\n<table border=\"2\" frame=\"hsides\" rules=\"groups\" cellspacing=\"0\" cellpadding=\"6\">\n<colgroup>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/> <\/colgroup>\n<tbody>\n<tr>\n<td class=\"org-left\">pianist<\/td>\n<td class=\"org-left\">renowned<\/td>\n<td class=\"org-left\">teacher.<\/td>\n<td class=\"org-left\">composer.<\/td>\n<td class=\"org-left\">prolific<\/td>\n<td class=\"org-left\">virtuoso<\/td>\n<td class=\"org-left\">teacher<\/td>\n<td class=\"org-left\">leading<\/td>\n<td class=\"org-left\">educator.<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<div class=\"org-src-container\">\n<pre class=\"src src-clojure\"><span style=\"color: #ae81ff;\">(<\/span><span style=\"color: #f92672;\">.wordsNearest<\/span> google-news-model <span style=\"color: #66d9ef;\">[<\/span><span style=\"color: #e6db74;\">\"pianist\"<\/span> <span style=\"color: #e6db74;\">\"renowned\"<\/span><span style=\"color: #66d9ef;\">]<\/span> <span style=\"color: #66d9ef;\">[]<\/span> 7<span style=\"color: #ae81ff;\">)<\/span>\n<\/pre>\n<\/div>\n<table border=\"2\" frame=\"hsides\" rules=\"groups\" cellspacing=\"0\" cellpadding=\"6\">\n<colgroup>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/> <\/colgroup>\n<tbody>\n<tr>\n<td class=\"org-left\">renowned<\/td>\n<td class=\"org-left\">pianist<\/td>\n<td class=\"org-left\">pianist composer<\/td>\n<td class=\"org-left\">jazz pianist<\/td>\n<td class=\"org-left\">classical pianists<\/td>\n<td class=\"org-left\">composer pianist<\/td>\n<td class=\"org-left\">virtuoso pianist<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>These kind of operations are kind of interesting. If we add the two co-occurrence vectors for <code>pianist<\/code> and <code>renowned<\/code> then we get that a <code>teacher<\/code>, an <code>educator<\/code>, a <code>composer<\/code> or a <code>virtuoso<\/code> is a renowned pianist.<\/p>\n<p>For unambiguous surface forms like <code>pianist<\/code>, then the two models score quite well. The difference between the two examples comes from the way the general training corpus has been created (pre-processed) compared to the musicians corpus.<\/p>\n<\/div>\n<\/div>\n<div id=\"outline-container-orgheadline20\" class=\"outline-4\">\n<h4 id=\"orgheadline20\">Metal + Death = ?<\/h4>\n<div id=\"text-orgheadline20\" class=\"outline-text-4\">\n<div class=\"org-src-container\">\n<pre class=\"src src-clojure\"><span style=\"color: #ae81ff;\">(<\/span><span style=\"color: #f92672;\">.wordsNearest<\/span> musicians-model <span style=\"color: #66d9ef;\">[<\/span><span style=\"color: #e6db74;\">\"metal\"<\/span> <span style=\"color: #e6db74;\">\"death\"<\/span><span style=\"color: #66d9ef;\">]<\/span> <span style=\"color: #66d9ef;\">[]<\/span> 10<span style=\"color: #ae81ff;\">)<\/span>\n<\/pre>\n<\/div>\n<table border=\"2\" frame=\"hsides\" rules=\"groups\" cellspacing=\"0\" cellpadding=\"6\">\n<colgroup>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/> <\/colgroup>\n<tbody>\n<tr>\n<td class=\"org-left\">metal<\/td>\n<td class=\"org-left\">death<\/td>\n<td class=\"org-left\">thrash<\/td>\n<td class=\"org-left\">deathcore<\/td>\n<td class=\"org-left\">melodic<\/td>\n<td class=\"org-left\">doom<\/td>\n<td class=\"org-left\">grindcore<\/td>\n<td class=\"org-left\">metalcore<\/td>\n<td class=\"org-left\">mathcore<\/td>\n<td class=\"org-left\">heavy<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<div class=\"org-src-container\">\n<pre class=\"src src-clojure\"><span style=\"color: #ae81ff;\">(<\/span><span style=\"color: #f92672;\">.wordsNearest<\/span> google-news-model <span style=\"color: #66d9ef;\">[<\/span><span style=\"color: #e6db74;\">\"metal\"<\/span> <span style=\"color: #e6db74;\">\"death\"<\/span><span style=\"color: #66d9ef;\">]<\/span> <span style=\"color: #66d9ef;\">[]<\/span> 5<span style=\"color: #ae81ff;\">)<\/span>\n<\/pre>\n<\/div>\n<table border=\"2\" frame=\"hsides\" rules=\"groups\" cellspacing=\"0\" cellpadding=\"6\">\n<colgroup>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/> <\/colgroup>\n<tbody>\n<tr>\n<td class=\"org-left\">death<\/td>\n<td class=\"org-left\">metal<\/td>\n<td class=\"org-left\">Tunstallbled<\/td>\n<td class=\"org-left\">steel<\/td>\n<td class=\"org-left\">Death<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>This example uses two quite general words with no apparent relationship between them. The results with the musicians&#8217; model are all the highly similar genre of music like <code>trash metal<\/code>, <code>deathcore metal<\/code>, etc.<\/p>\n<p>However with the general model, it is a mix of multiple unrelated concepts.<\/p>\n<\/div>\n<\/div>\n<div id=\"outline-container-orgheadline21\" class=\"outline-4\">\n<h4 id=\"orgheadline21\">Metal &#8211; Death + Smooth = ?<\/h4>\n<div id=\"text-orgheadline21\" class=\"outline-text-4\">\n<p>Let&#8217;s play some more with these equations. What if we want some kind of smooth metal?<\/p>\n<div class=\"org-src-container\">\n<pre class=\"src src-clojure\"><span style=\"color: #ae81ff;\">(<\/span><span style=\"color: #f92672;\">.wordsNearest<\/span> musicians-model <span style=\"color: #66d9ef;\">[<\/span><span style=\"color: #e6db74;\">\"metal\"<\/span> <span style=\"color: #e6db74;\">\"smooth\"<\/span><span style=\"color: #66d9ef;\">]<\/span> <span style=\"color: #66d9ef;\">[<\/span><span style=\"color: #e6db74;\">\"death\"<\/span><span style=\"color: #66d9ef;\">]<\/span> 5<span style=\"color: #ae81ff;\">)<\/span>\n<\/pre>\n<\/div>\n<table border=\"2\" frame=\"hsides\" rules=\"groups\" cellspacing=\"0\" cellpadding=\"6\">\n<colgroup>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/> <\/colgroup>\n<tbody>\n<tr>\n<td class=\"org-left\">smooth<\/td>\n<td class=\"org-left\">fusion<\/td>\n<td class=\"org-left\">funk<\/td>\n<td class=\"org-left\">hard<\/td>\n<td class=\"org-left\">neo<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>This one is quite interesting. We substracted the <code>death<\/code> co-occurrence vector to the <code>metal<\/code> one, and then we added the <code>smooth<\/code> vector. What we end-up with is a bunch of music genres that are much smoother than <code>death metal<\/code>.<\/p>\n<div class=\"org-src-container\">\n<pre class=\"src src-clojure\"><span style=\"color: #ae81ff;\">(<\/span><span style=\"color: #f92672;\">.wordsNearest<\/span> google-news-model <span style=\"color: #66d9ef;\">[<\/span><span style=\"color: #e6db74;\">\"metal\"<\/span> <span style=\"color: #e6db74;\">\"smooth\"<\/span><span style=\"color: #66d9ef;\">]<\/span> <span style=\"color: #66d9ef;\">[<\/span><span style=\"color: #e6db74;\">\"death\"<\/span><span style=\"color: #66d9ef;\">]<\/span> 5<span style=\"color: #ae81ff;\">)<\/span>\n<\/pre>\n<\/div>\n<table border=\"2\" frame=\"hsides\" rules=\"groups\" cellspacing=\"0\" cellpadding=\"6\">\n<colgroup>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/>\n<col class=\"org-left\" \/> <\/colgroup>\n<tbody>\n<tr>\n<td class=\"org-left\">smooth<\/td>\n<td class=\"org-left\">metal<\/td>\n<td class=\"org-left\">Brushed aluminum<\/td>\n<td class=\"org-left\">durable polycarbonate<\/td>\n<td class=\"org-left\">chromed steel<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>In the case of the general model, we end-up with &#8220;smooth metal&#8221;. The removal of the <code>death<\/code> vector has no effect on the results, probably since these are three ambiguous and really general terms.<\/p>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<div id=\"outline-container-orgheadline22\" class=\"outline-2\">\n<h2 id=\"orgheadline22\">What Is Next<\/h2>\n<div id=\"text-orgheadline22\" class=\"outline-text-2\">\n<p>The demo I presented in this article uses public datasets currently linked to KBpedia. You may wonder what are the other possibilities? Another possibility is to <a href=\"http:\/\/cognonto.com\/services\/data-integration-and-mapping\/\">link your own private datasets to KBpedia<\/a>. That way, these private datasets would then become usable, exactly in the same way, to create domain-specific training corpuses on the fly. Another possibility would be to take totally unstructured text like local text documents, or semi-structured text like a set of HTML web pages. Then, tag them using the <a href=\"http:\/\/cognonto.com\/docs\/about-the-demo\/#topic-analysis-sub-pabel\">Cognonto topics analyzer<\/a> to tag each of the text document using KBpedia reference concepts. Then we could use the KBpedia structure exactly the same way to choose which of these documents we want to include in the domain-specific training corpus.<\/p>\n<\/div>\n<\/div>\n<div id=\"outline-container-orgheadline23\" class=\"outline-2\">\n<h2 id=\"orgheadline23\">Conclusion<\/h2>\n<div id=\"text-orgheadline23\" class=\"outline-text-2\">\n<p>As we saw, creating domain-specific training corpuses to use with word2vec can have a dramatic impact on the results and how results can be much more meaningful within the scope of that domain. Another advantage of the domain-specific training corpuses is that they create much smaller models. This is quite an interesting characteristic since smaller models means they are faster to generate, faster to download\/upload, faster to query, consumes less memory, etc.<\/p>\n<p>Of the concepts in KBpedia, roughly 33,000 of them correspond to types (or classes) of various sorts. These pre-determined slices are available across all needs and domains to generate such domain-specific corpuses. Further, KBpedia is designed for rapid incorporation of your own domain information to add further to this discriminatory power.<\/p>\n<\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>word2vec is a two layer artificial neural network used to process text to learn relationships between words within a text corpus to create a model of all the relationships between the words of that corpus. The text corpus that a word2vec process uses to learn the relationships between words is called the training corpus. In [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[293,251,287,84],"tags":[288,197,289,193,231,291],"class_list":["post-3344","post","type-post","status-publish","format-standard","hentry","category-artificial-intelligence","category-clojure","category-cognonto","category-semantic-web","tag-cognonto","tag-data","tag-machinelearning","tag-nlp","tag-semanticweb","tag-word2vec"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/fgiasson.com\/blog\/index.php\/wp-json\/wp\/v2\/posts\/3344","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/fgiasson.com\/blog\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/fgiasson.com\/blog\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/fgiasson.com\/blog\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/fgiasson.com\/blog\/index.php\/wp-json\/wp\/v2\/comments?post=3344"}],"version-history":[{"count":7,"href":"https:\/\/fgiasson.com\/blog\/index.php\/wp-json\/wp\/v2\/posts\/3344\/revisions"}],"predecessor-version":[{"id":3498,"href":"https:\/\/fgiasson.com\/blog\/index.php\/wp-json\/wp\/v2\/posts\/3344\/revisions\/3498"}],"wp:attachment":[{"href":"https:\/\/fgiasson.com\/blog\/index.php\/wp-json\/wp\/v2\/media?parent=3344"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/fgiasson.com\/blog\/index.php\/wp-json\/wp\/v2\/categories?post=3344"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/fgiasson.com\/blog\/index.php\/wp-json\/wp\/v2\/tags?post=3344"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}