Artificial Intelligence, Cognonto, Semantic Web

Mapping Datasets, Schema and Ontologies Using the Cognonto Mapper

There are many situations were we want to link named entities from two different datasets or to find duplicate entities to remove in a single dataset. The same is true for vocabulary terms or ontology classes that we want to integrate and map together. Sometimes we want to use such a linkage system to help save time when creating gold standards for named entity recognition tasks.

There exist multiple data linkage & deduplication frameworks developed in several different programming languages. At Cognonto, we have our own system called the Cognonto Mapper.

Most mapping frameworks work more or less the same way. They use one or two datasets as sources of entities (or classes or vocabulary terms) to compare. The datasets can be managed by a conventional relational database management system, a triple store, a spreadsheet, etc. Then they have complex configuration options that let the user define all kinds of comparators that will try to match the values of different properties that describe the entities in each dataset. (Comparator types may be simple string comparisons, the added use of alternative labels or definitions, attribute values, or various structural relationships and linkages within the dataset.) Then the comparison is made for all the entities (or classes or vocabulary terms) existing in each dataset. Finally, an entity similarity score is calculated, with some threshold conditions used to signal whether the two entities (or classes or vocabulary terms) are the same or not.

The Cognonto Mapper works in this same general way. However, as you may suspect, it has a special trick in its toolbox: the SuperType Comparator. The SuperType Comparator leverages the KBpedia Knowledge Ontology to help disambiguate two given entities (or classes or vocabulary terms) based on their type and the analysis of their types in the KBPedia Knowledge Ontology. When we perform a deduplication or a linkage task between two large datasets of entities, it is often the case that two entities will be considered a nearly perfect match based on common properties like names, alternative names and other common properties even if they are two completely different things. This happens because entities are often ambiguous when only considering these basic properties. The SuperType Comparator’s role is to disambiguate the entities based on their type(s) by leveraging the disjointedness of the SuperType structure that governs the overall KBpedia structure. The SuperType Comparator greatly reduces the time needed to curate the deduplication or linkage tasks in order to determine the final mappings.

We first present a series of use cases for the Mapper below, followed by an explanation of how the Cognonto Mapper works, and then some conclusions.

Usages Of The Cognonto Mapper

When should the Cognonto Mapper, or other deduplication and mapping services, be used? While there are many tasks that warrant the usage of such a system, let’s focus for now on some use cases related to Cognonto and machine learning in general.

Mapping Across Schema

One of Cognonto’s most important use cases is to use the Mapper to link new vocabularies, schemas or ontologies to the KBpedia Knowledge Ontology (KKO). This is exactly what we did for the 24 external ontologies and schemas that we have integrated into KBpedia. Creating such a mapping can be a long and painstaking process. The Mapper greatly helps linking similar concepts together by narrowing the candidate pool of initial set of mappings, thereby increasing the efficiency of the analyst charged with selecting the final mappings between the two ontologies.

Creating ‘Gold Standards’

In my last article, I created a gold standard of 511 random web pages where I determined the publisher of the web page by hand. That gold standard was used to measure the performance of a named entities recognition task. However, to create the actual gold standard, I had to check in each dataset (5 of them with million of entities) if that publisher was existing in any of them. Performing such a task by hand means that I would have to send at least 2555 search queries to try find a matching entity. Let’s say that I am fast, and that I can write the query, send it, look at the results, and copy/paste the URI of the good entity in the gold standard within 30 seconds, it still means that I would complete such a task in roughly 21 hours. It is also clearly impossible to do that 8 hours per day for ~3 days for a sane person, so this task would probably take at least 1 week to complete.

This is why automating this mapping process is really important and this is what the Cognonto Mapper does. The only thing that is needed is to configure 5 mapper sessions. Each session tries to map the entities I identified by hand from the 511 web pages to each of the other datasets. Then I only need run the mapper for each dataset, review the matches, find the missing ones by hand and then merge the results into the final gold standard.

Curating Unknown Entities

In Cognonto, we have an unknown entities tagger that is used to detect possible publisher organizations that are not currently existing in the KBpedia knowledge base. In some cases, what we want to do is to save these detected unknown entities in an unknown entities dataset. Then this dataset will be used to review detected entities to include them back into the KBpedia knowledge base (such that they become new). In the review workflow, one of the steps should be to try to find similar entities to make sure that what was detected by the entities tagger was a totally new entity, and not a new surface form for that entity (which would become an alternative label for the entity and not an entirely new one). Such a checkup in the review workflow would be performed by the Cognonto Mapper.

How Does the SuperType Comparator Work?

As I mentioned in the introduction, the Cognonto Mapper is yet another linkage & deduplication framework. However, it has a special twist: its SuperType Comparator and the leveraging of the KBpedia Knowledge Ontology. Good, but how does it work? There is no better way to understand how it works than studying how two entities can be disambiguated based on their type. So, let’s do this.

Let’s consider this use case. We want to map two datasets together: Wikipedia and Musicbrainz. One of the Musicbrainz entity we want to map to Wikipedia is a music group called Attila with Billy Joel and Jon Small. Attila also exists in Wikipedia, but it is highly ambiguous and may refer to multiple different things. If we setup our linkage task to only work on the preferred and possible alternative labels, they we would have a match between the name of that album and multiple other things in Wikipedia with a matching likelihood that is probably nearly identical. However, how could we update the configuration to try to solve this issue? We have no choice, we will have to use the Cognonto Mapper SuperType Comparator.

Musicbrainz RDF dumps normally map a Musicbrainz group to a mo:MusicGroup. In the Wikipedia RDF dump the Attila rock band has a type dbo:Band. Both of these classes are linked to the KBpedia reference concept kbpedia:Band-MusicGroup. This means that the entities of both of these datasets are well connected into KBpedia.

Let’s say that the Cognonto Mapper does detect that the Attila entity in the Musicbrainz dataset has 4 candidates in Wikipedia:

  1. Attila, the rock band
  2. Attila, the bird
  3. Attila, the film
  4. Attila, the album

If the comparison is only based on the preferred label, the likelihood will be the same for all these entities. However, what happens when we start using the SuperType Comparator and the KBpedia Knowledge Ontology?

First we have to understand the context of each type. Using KBpedia, we can determine that rock bands, birds, albums and films are disjoint according to their super types: kko:Animals, kko:Organizations kko:AudioInfo and kko:VisualInfo.

Now that we understand each of the entities the system is trying to link together, and their context within the KBpedia Knowledge Ontology, let’s see how the Cognonto Mapper will score each of these entities based on their type to help disambiguate where labels are identical.

 "mo:MusicGroup -> dbo:Band"
 (.compare stc-ex-compare "" ""))
 "mo:MusicGroup -> dbo:Bird"
 (.compare stc-ex-compare "" ""))
 "mo:MusicGroup -> dbo:Film"
 (.compare stc-ex-compare "" ""))
 "mo:MusicGroup -> dbo:Album"
 (.compare stc-ex-compare "" ""))
Classes Similarity
mo:MusicGroup -> dbo:Band 1.0
mo:MusicGroup -> dbo:Bird 0.2
mo:MusicGroup -> dbo:Film 0.2
mo:MusicGroup -> dbo:Album 0.2

In these cases, the SuperType Comparator did assign a similarity of 1.0 to the mo:MusicGroup and the dbo:Band entities since those two classes are equivalent. All the other checks returns 0.20. When the comparator finds two entities that have disjoint SuperTypes, then it assigns the similarity value 0.20 to them. Why not 0.00 if they are disjoint? Well, there may be errors in the knowledge base, so that setting the comparator score to a very low level, it is still available for evaluation, even though its score is much reduced.

In this case the matching is unambiguous and the selection of the right linkage to perform is obvious. However you will see below that it is not (and often not) that simple to make such a clear selection.


Now let’s say that the next entity to match from the Musicbrainz dataset is another entity called Attila, but this time it refers to Attila, the album by Mina. Since the basis of the comparison (comparing the Musicbrainz Attila album instead of the band), the entire process will yield different results. The main difference is that the album will be compared to a film and an album from the Wikipedia dataset. As you can notice in the graph below, these two entities belong to the super types kko:AudioInfo and kko:VisualInfo which are not disjoint.

 "mo:MusicalWork -> dbo:Band"
 (.compare stc-ex-compare "" ""))
 "mo:MusicalWork -> dbo:Bird"
 (.compare stc-ex-compare "" ""))
 "mo:MusicalWork -> dbo:Film"
 (.compare stc-ex-compare "" ""))
 "mo:MusicalWork -> dbo:Album"
 (.compare stc-ex-compare "" ""))
Classes Similarity
mo:MusicalWork -> dbo:Band 0.2
mo:MusicalWork -> dbo:Bird 0.2
mo:MusicalWork -> dbo:Film 0.8762886597938144
mo:MusicalWork -> dbo:Album 0.9555555555555556

As you can see, the main difference is that we don’t have a perfect match between the entities. We thus need to compare between their types, and two of the entities are ambiguous based on their SuperType (their super types are non-disjoint). In this case, what the SuperType Comparator does is to check the set of super classes of both entities, and calculate a similarity measure between the two sets of classes and compute a similarity measure. It is why we have 0.8762 for one and 0.9555 for another.

A musical work and an album are two nearly identical concepts. In fact, a musical work is a conceptual work of an album (a record). A musical work is also strongly related to films since films includes musical works, etc. However, the relationship between a musical work and an album is stronger than with a film and this is what the similarity measure shows.


In this case, even if we have two ambiguous entities of an album and a film for which we don’t have disjoint super types, we are still able to determine which one to choose to create the mappiing based on the calculation of the similarity measure.


As we saw, there are multiple reasons why we would want to leverage the KBpedia Knowledge Ontology to help mapping and deduplication frameworks such as the Cognonto Mapper to disambiguate possible entity matches. KBpedia is not only good for mapping datasets together, it is also quite effective to help with some machine learning tasks such as creating gold standards or curating detected unknown entities. In the context of Cognonto, it is quite effective to map external ontologies, schemas or vocabularies to the KBpedia Knowledge Ontology. It is an essential tool for extending KBpedia to domain- and enterprise-specific needs.

In this article I focused on the SuperType Comparator that is leveraging the type structure of the KBpedia Knowledge Ontology. However, we can also use other structural features in KBpedia (such as an Aspects Comparator based on the aspects structure of KBpedia), singly or in combination, to achieve other mapping or disambiguation objectives.

Artificial Intelligence, Cognonto, Semantic Web

Improving Machine Learning Tasks By Integrating Private Datasets

In the last decade, we have seen the emergence of two big families of datasets: the public and the private ones. Invaluable public datasets like Wikipedia, Wikidata, Open Corporates and others have been created and leveraged by organizations world-wide. However, as great as they are, most organization still rely on private datasets of their own curated data.

In this article, I want to demonstrate how high-value private datasets may be integrated into the Cognonto’s KBpedia knowledge base to produce a significant impact on the quality of the results of some machine learning tasks. To demonstrate this impact, I have created a demo that is supported by a “gold standard” of 511 web pages taken at random, to which we have tagged the organization that published the web page. This demo is related to the publisher analysis portion of the Cognonto demo. We will use this gold standard to calculate the performance metrics of the publisher analyzer but more precisely, we will analyze the performance of the analyzer depending on the datasets it has access to perform its predictions.

Cognonto Publisher’s Analyzer

The Cognonto publisher’s analyzer is a portion of the overall Cognonto demo that tries to determine the publisher of a web page from analyzing the web page’s content. There are multiple moving parts to this analyzer, but its general internal workflow works as follows:

  1. It crawls a given webpage URL
  2. It extracts the page’s content and extracts its meta-data
  3. It tags all of the organizations (anything that is considered an organization in KBpedia) across the extracted content using the organization entities that exist in the knowledge base
  4. It tries to detect unknown entities that will eventually be added to the knowledge base after curation
  5. It performs an in-depth analysis of the organization entities (known or unknown) that got tagged in the content of the web page, and analyzes which of these is the most likely to be the publisher of the web page.

Such a machine learning system leverages existing algorithms to calculate the likelihood that an organization is the publisher of a web page and to detect unknown organizations. These are conventional uses of these algorithms. What differentiates the Cognonto analyzer is its knowledge base. We leverage Cognonto to detect known organization entities. We use the knowledge in the KB for each of these entities to improve the analysis process. We constrain the analysis to certain types (by inference) of named entities, etc. The special sauce of this entire process is the fully integrated set of datasets that compose the Cognonto knowledge base, and the KBpedia conceptual reference structure composed of roughly ~39,000 reference concepts.

Given the central role of the knowledge base in such an analysis process, we want to have a better idea of the impact of the datasets in the performance of such a system.

For this demo, I use three public datasets already in KBpedia and that are used by the Cognonto demo: Wikipedia (via DBpedia), Freebase and USPTO. Then I add two private datasets of high quality, highly curated and domain-related information augment the listing of potential organizations. What I will do is to run the Cognonto publisher analyzer on each of these 511 web pages. Then I will check which one got properly identified given the gold standard and finally I will calculate different performance metrics to see the impact of including or excluding a certain dataset.

Gold Standard

The gold standard is composed of 511 randomly selected web pages that got crawled and cached. When we run the tests below, the cached version of the HTML pages is used to make sure that we get the same HTML for each page for each test. When the pages are crawled, we execute any possible JavaScript code that the pages may contain before caching the HTML code of the page. That way, if some information in the page was injected by some JavaScript code, then that additional information will be cached as well.

The gold standard is really simple. For each of the URLs we have in the standard, we determine the publishing organization manually. Then once the organization is determined, we search in each dataset to see if the entity is already existing. If it is, then we add the URI (unique identifier) of the entity in the knowledge base into the gold standard. It is this URI reference that is used the determine if the publisher analyzer properly detects the actual publisher of the web page.

We also add a set of 10 web pages manually for which we are sure that no publisher can be determined for the web page. These are the 10 True Negative (see below) instances of the gold standard.

The gold standard also includes the identifier of possible unknown entities that are the publishers of the web pages. These are used to calculate the metrics when considering the unknown entities detected by the system.


The goal of this analysis is to determine how good the analyzer is to perform the task (detecting the organization that published a web page on the Web). What we have to do is to use a set of metrics that will help us understanding the performance of the system. The metrics calculation is based on the confusion matrix.


The True Positive, False Positive, True Negative and False Negative (see Type I and type II errors for definitions) should be interpreted that way in the context of a named entities recognition task:

  1. True Positive (TP): test identifies the same entity as in the gold standard
  2. False Positive (FP): test identifies a different entity than what is in the gold standard
  3. True Negative (TN): test identifies no entity; gold standard has no entity
  4. False Negative (FN): test identifies no entity, but gold standard has one

The we have a series of metrics that can be used to measure the performance of of the system:

  1. Precision: is the proportion of properly predicted publishers amongst all of the publishers that exists in the gold standard (TP / TP + FP)
  2. Recall: is the proportion of properly predicted publishers amongst all the predictions that have been made (good and bad) (TP / TP + FN)
  3. Accuracy: it is the proportion of correctly classified test instances; the publishers that could be identified by the system, and the ones that couldn’t (the web pages for which no publisher could be identified). ((TP + TN) / (TP + TN + FP + FN))
  4. f1: the test’s equally weighted combination of precision and recall
  5. f2: the test’s weighted combination of precision and recall, with a preference for recall
  6. f0.5: the test’s weighted combination of precision and recall, with a preference for precision.

The F-score test the accuracy of the general prediction system. The F-score is a measure that combines precision and recall is the harmonic mean of precision and recall. The f2 measure weighs recall higher than precision (by placing more emphasis on false negatives), and the f0.5 measure weighs recall lower than precision (by attenuating the influence of false negatives). Cognonto includes all three F-measures in its standard reports to give a general overview of what happens when we put an emphasis on precision or recall.

In general, I think that the metric that gives the best overall performance of this named entities recognition system is the accuracy one. I emphasize those test results below.

Running The Tests

The goal with these tests is to run the gold standard calculation procedure with different datasets that exist in the Cognonto knowledge base to see the impact of including/excluding these datasets on the gold standard metrics.

Baseline: No Dataset

The first step is to create the starting basis that includes no dataset. Then we will add different datasets, and try different combinations, when computing against the gold standard such that we know the impact of each on the metrics.

(table (generate-stats :js :execute :datasets []))
True positives:  2
False positives:  5
True negatives:  19
False negatives:  485

| key          | value        |
| :precision   | 0.2857143    |
| :recall      | 0.0041067763 |
| :accuracy    | 0.04109589   |
| :f1          | 0.008097166  |
| :f2          | 0.0051150895 |
| :f0.5        | 0.019417476  |

One Dataset Only

Now, let’s see the impact of each of the datasets that exist in the knowledge base we created to perform these tests. This will gives us an indicator of the inherent impact of each dataset on the prediction task.

Wikipedia (via DBpedia) Only

Let’s test the impact of adding a single general purpose dataset, the publicly available: Wikipedia (via DBpedia):

(table (generate-stats :js :execute :datasets [""]))
True positives:  121
False positives:  57
True negatives:  19
False negatives:  314

| key          | value      |
| :precision   | 0.6797753  |
| :recall      | 0.27816093 |
| :accuracy    | 0.2739726  |
| :f1          | 0.39477977 |
| :f2          | 0.31543276 |
| :f0.5        | 0.52746296 |

Freebase Only

Now, let’s test the impact of adding another single general purpose dataset, this one the publicly available: Freebase:

(table (generate-stats :js :execute :datasets [""]))
True positives:  11
False positives:  14
True negatives:  19
False negatives:  467

| key          | value       |
| :precision   | 0.44        |
| :recall      | 0.023012552 |
| :accuracy    | 0.058708414 |
| :f1          | 0.043737575 |
| :f2          | 0.028394425 |
| :f0.5        | 0.09515571  |


Now, let’s test the impact of adding still a different publicly available specialized dataset: USPTO:

(table (generate-stats :js :execute :datasets [""]))
True positives:  6
False positives:  13
True negatives:  19
False negatives:  473

| key          | value       |
| :precision   | 0.31578946  |
| :recall      | 0.012526096 |
| :accuracy    | 0.04892368  |
| :f1          | 0.024096385 |
| :f2          | 0.015503876 |
| :f0.5        | 0.054054055 |

Private Dataset #1

Now, let’s test the first private dataset:

(table (generate-stats :js :execute :datasets [""]))
True positives:  231
False positives:  109
True negatives:  19
False negatives:  152

| key          | value      |
| :precision   | 0.67941177 |
| :recall      | 0.60313314 |
| :accuracy    | 0.4892368  |
| :f1          | 0.6390042  |
| :f2          | 0.61698717 |
| :f0.5        | 0.6626506  |

Private Dataset #2

And, then, the second private dataset:

(table (generate-stats :js :execute :datasets [""]))
True positives:  24
False positives:  21
True negatives:  19
False negatives:  447

| key          | value       |
| :precision   | 0.53333336  |
| :recall      | 0.050955415 |
| :accuracy    | 0.08414873  |
| :f1          | 0.093023255 |
| :f2          | 0.0622084   |
| :f0.5        | 0.1843318   |

Combined Datasets – Public Only

A more realistic analysis is to use a combination of datasets. Let’s see what happens to the performance metrics if we start combining public datasets.

Wikipedia + Freebase

First, let’s start by combining Wikipedia and Freebase.

(table (generate-stats :js :execute :datasets [""
True positives:  126
False positives:  60
True negatives:  19
False negatives:  306

| key          | value      |
| :precision   | 0.67741936 |
| :recall      | 0.29166666 |
| :accuracy    | 0.28375733 |
| :f1          | 0.407767   |
| :f2          | 0.3291536  |
| :f0.5        | 0.53571427 |

Adding the Freebase dataset to the DBpedia one had the following effects on the different metrics:

metric Impact in %
precision -0.03%
recall +4.85%
accuracy +3.57%
f1 +3.29%
f2 +4.34%
f0.5 +1.57%

As we can see, the impact of adding Freebase to the knowledge base is positive even if not ground breaking considering the size of the dataset.

Wikipedia + USPTO

Let’s switch Freebase for the other specialized public dataset, USPTO.

(table (generate-stats :js :execute :datasets [""
True positives:  122
False positives:  59
True negatives:  19
False negatives:  311

| key          | value      |
| :precision   | 0.67403316 |
| :recall      | 0.2817552  |
| :accuracy    | 0.27592954 |
| :f1          | 0.39739415 |
| :f2          | 0.31887087 |
| :f0.5        | 0.52722555 |

Adding the USPTO dataset to the DBpedia instead of Freebase had the following effects on the different metrics:

metric Impact in %
precision -0.83%
recall +1.29%
accuracy +0.73%
f1 +0.65%
f2 +1.07%
f0.5 +0.03%

As we may have expected the gains are smaller than Freebase. Maybe partly because it is smaller and more specialized. Because it is more specialized (enterprises that have patents registered in US), maybe the gold standard doesn’t represent well the organizations belonging to this dataset. But in any case, these are still gains.

Wikipedia + Freebase + USPTO

Let’s continue and now include all three datasets.

(table (generate-stats :js :execute :datasets [""
True positives:  127
False positives:  62
True negatives:  19
False negatives:  303

| key          | value      |
| :precision   | 0.6719577  |
| :recall      | 0.29534882 |
| :accuracy    | 0.2857143  |
| :f1          | 0.41033927 |
| :f2          | 0.3326349  |
| :f0.5        | 0.53541315 |

Now let’s see the impact of adding both Freebase and USPTO to the Wikipedia dataset:

metric Impact in %
precision +1.14%
recall +6.18%
accuracy +4.30%
f1 +3.95%
f2 +5.45%
f0.5 +1.51%

Now let’s see the impact of using highly curated, domain related, private datasets.

Combined Datasets – Public enhanced with private datasets

The next step is to add the private datasets of highly curated data that are specific to the domain of identifying web page publisher organizations. As the baseline, we will use the three public datasets: Wikipedia, Freebase and USPTO and then we will add the private datasets.

Wikipedia + Freebase + USPTO + PD #1

(table (generate-stats :js :execute :datasets [""
True positives:  279
False positives:  102
True negatives:  19
False negatives:  111

| key          | value      |
| :precision   | 0.7322835  |
| :recall      | 0.7153846  |
| :accuracy    | 0.58317024 |
| :f1          | 0.7237354  |
| :f2          | 0.7187017  |
| :f0.5        | 0.7288401  |

Now, let’s see the impact of adding the private dataset #1 along with Wikipedia, Freebase and USPTO:

metric Impact in %
precision +8.97%
recall +142.22%
accuracy +104.09%
f1 +76.38%
f2 +116.08%
f0.5 +36.12%

Adding the highly curated and domain specific private dataset #1 had a dramatic impact on all the metrics of the combined public datasets. Now let’s see what is the impact of the public datasets on the private dataset #1 metrics when it is used alone:

metric Impact in %
precision +7.77%
recall +18.60%
accuracy +19.19%
f1 +13.25%
f2 +16.50%
f0.5 +9.99%

As we can see, the public datasets does significantly increase the performance of the highly curated and domain specific private dataset #1.

Wikipedia + Freebase + USPTO + PD #2

(table (generate-stats :js :execute :datasets [""
True positives:  138
False positives:  69
True negatives:  19
False negatives:  285

| key          | value      |
| :precision   | 0.6666667  |
| :recall      | 0.32624114 |
| :accuracy    | 0.3072407  |
| :f1          | 0.43809524 |
| :f2          | 0.36334914 |
| :f0.5        | 0.55155873 |

Not all of the private datasets have equivalent impact. Let’s see the impact of adding the private dataset #2 instead of the #1:

metric Impact in %
precision -0.78%
recall +10.46%
accuracy +7.52%
f1 +6.75%
f2 +9.23%
f0.5 +3.00%

Wikipedia + Freebase + USPTO + PD #1 + PD #2

Now let’s see what happens when we use all the public and private datasets.

(table (generate-stats :js :execute :datasets [""
True positives:  285
False positives:  102
True negatives:  19
False negatives:  105

| key          | value      |
| :precision   | 0.7364341  |
| :recall      | 0.7307692  |
| :accuracy    | 0.59491193 |
| :f1          | 0.7335907  |
| :f2          | 0.7318952  |
| :f0.5        | 0.7352941  |

Let’s see the impact of adding the private datasets #1 and #2 to the public datasets:

metric Impact in %
precision +9.60%
recall +147.44%
accuracy +108.22%
f1 +78.77%
f2 +120.02%
f0.5 +37.31%

Adding Unknown Entities Tagger

There is one last feature with the Cognonto publisher analyzer: it is possible for it to identify unknown entities from the web page. (An “unknown entity” is identified as a likely organization entity, but which does not already exist in the KB.) Sometimes, it is the unknown entity that is the publisher of the web page.

(table (generate-stats :js :execute :datasets :all))
True positives:  345
False positives:  104
True negatives:  19
False negatives:  43

| key          | value      |
| :precision   | 0.76837415 |
| :recall      | 0.88917524 |
| :accuracy    | 0.7123288  |
| :f1          | 0.82437277 |
| :f2          | 0.86206895 |
| :f0.5        | 0.78983516 |

As we can see, the overall accuracy improved by 19.73% when considering the unknown entities compared to the public and private datasets.

metric Impact in %
precision +4.33%
recall +21.67%
accuracy +19.73%
f1 +12.37%
f2 +17.79%
f0.5 +7.42%


When we first tested the system with single datasets, some of them were scoring better than others for most of the metrics. However, does that mean that we could only use them and be done with it? No, what this analysis is telling us is that some datasets score better for this set of web pages. They cover more entities found in those web pages. However, even if a dataset was scoring lower it does not mean it is useless. In fact, that worse dataset may in fact cover one prediction area not covered in a better one, which means that by combining the two, we could improve the general prediction power of the system. This is what we can see by adding the private datasets to the public ones.

Even if the highly curated and domain-specific private datasets score much better than the more general public datasets, the system still greatly benefits from the contribution of the public dataset by significantly improving the accuracy of the system. We got a gain 19.19% in accuracy by adding the public datasets to the better scoring private dataset #1. Nearly 20% of improvement in such a predictive system is highly significant.

Another thing that this series of tests tends to demonstrate is that the more knowledge we have, the more we can improve the accuracy of the system. Adding datasets doesn’t appear to lower the overall performance of the system (even if I am sure that some could), but generally the more the better (but more doesn’t necessarely produce significant accuracy increases).

Finally, adding a feature to the system can also greatly improve its overall accuracy. In this case, we added the feature of detecting unknown entities (organization entities that are not existing in the datasets that compose the knowledge base), which improved the overall accuracy by another 19.73%. How is that possible? To understand this we have to consider the domain: random web pages that exist on the Web. A web page can be published by anybody and any organization. This means that the long tail of web page publisher is probably pretty long. Considering this fact, it is normal that existing knowledge bases may not contain all of the obscure organizations that publish web pages. It is most likely why having a system that can detect and predict unknown entities as the publishers of web page will have a significant impact on the overall accuracy of the system. The flagging of such “unknown” entities tells us where to focus efforts to add to the known database of existing publishers.


As we saw in this analysis, adding high quality and domain-specific private datasets can greatly improve the accuracy of such a prediction system. Some datasets may have a more significan impact than others, but overall, each dataset contribute to the overall improvement of the predictions.

Artificial Intelligence, Clojure, Cognonto, Semantic Web

Using Cognonto to Generate Domain Specific word2vec Models

word2vec is a two layer artificial neural network used to process text to learn relationships between words within a text corpus to create a model of all the relationships between the words of that corpus. The text corpus that a word2vec process uses to learn the relationships between words is called the training corpus.

In this article I will show you how Cognonto‘s knowledge base can be used to automatically create highly accurate domain specific training corpuses that can be used by word2vec to generate word relationship models. However you have to understand that what is being discussed here is not only applicable to word2vec, but to any method that uses corpuses of text for training. For example, in another article, I will show how this can be done with another algorithm called ESA (Explicit Semantic Analysis).

It is said about word2vec that “given enough data, usage and contexts, word2vec can make highly accurate guesses about a word’s meaning based on past appearances.” What I will show in this article is how to determine the context and we will see how this impacts the results.

Training Corpus

A training corpus is really just a set of text used to train unsupervised machine learning algorithms. Any kind of text can be used by word2vec. The only thing it does is to learn the relationships between the words that exist in the text. However, not all training corpuses are equal. Training corpuses are often dirty, biaised and ambiguous. Depending on the task at hand, it may be exactly what is required, but more often than not, their errors need to be fixed. Cognonto has the advantage of starting with clean text.

When we want to create a new training corpus, the first step is to find a source of text that could work to create that corpus. The second step is to select the text we want to add to it. The third step is to pre-process that corpus of text to perform different operations on the text, such as: removing HTML elements; removing punctuation; normalizing text; detecting named entities; etc. The final step is to train word2vec to generate the model.

word2vec is somewhat dumb. It only learns what exists in the training corpus. It does not do anything other than “reading” the text and analyzing the relationships between the words (which are really just group of characters separated by spaces). The word2vec process is highly subject to the Garbage In, Garbage Out principle, which means that if the training set is dirty, biaised and ambiguous, then the learned relationship will end-up being of little or no value.

Domain-specific Training Corpus

A domain-specific training corpus is a specialized training corpus where its text is related to a specific domain. Examples of domains are music, mathematics, cars, healthcare, etc. In contrast, a general training corpus is a corpus of text that may contain text that discusses totally different domains. By creating a corpus of text that covers a specific domain of interest, we limit the usage of words (that is, their co-occurrences) to texts that are meaningful to that domain.

As we will see in this article, a domain-specific training corpus can be quite useful, and much more powerful, than general ones, if the task at hand is in relation to a specific domain of expertise. The major problem with domain-specific training corpuses is that they are really costly to create. We not only have to find the source of data to use, but we also have to select each document that we want to include in the training corpus. This can work if we want a corpus with 100 or 200 documents, but what if you want a training corpus of 100,000 or 200,000 documents? Then it becomes a problem.

It is the kind of problem that Cognonto helps to resolve. Cognonto and KBpedia, its knowledge base, is a set of ~39,000 reference concepts that have ~138,000 links to schema of external data sources such as Wikipedia, Wikidata and USPTO. It is that structure and these links to external data sources that we use to create domain-specific training corpuses on the fly. We leverage the reference concept structure to select all of the concepts that should be part of the domain that is being defined. Then we use Cognonto’s inference capabilities to infer all the other hundred or thousands of concepts that define the full scope of the domain. Then we analyze the hundreds or thousands of concepts we selected that way to get all of the links to external data sources. Finally we use these references to create the training corpus. All of this is done automatically once the initial few concepts that define my domain got selected. The workflow looks like:


The Process

To show you how this process works, I will create a domain-specific training set about musicians using Cognonto. Then I will use the Google News word2vec model created by Google and that has about 100 billion words. The Google model contains 300-dimensional vectors for 3 million words and phrases. I will use the Google News model as the general model to compare the results/performance between a domain specific and a general model.

Determining the Domain

The first step is to define the scope of the domain we want to create. For this article, I want a domain that is somewhat constrained to create a training corpus that is not too large for demo purposes. The domain I have chosen is musicians. This domain is related to people and bands that play music. It is also related to musical genres, instruments, music industry, etc.

To create my domain, I select a single KBpedia reference concept: Musician. If I wanted to broaden the scope of the domain, I could have included other concepts such as: Music, Musical Group, Musical Instrument, etc.

Aggregating the Domain-specific Training Corpus

Once we have determined the scope of the domain, the next step is to query the KBpedia knowledge base to aggregate all of the text that will belong to that training corpus. The end result of this operation is to create a training corpus with text that is only related to the scope of the domain we defined.

(defn create-domain-specific-training-set
  [target-kbpedia-class corpus-file]
  (let [step 1000
        entities-dataset ""
        kbpedia-dataset ""
        nb-entities (get-nb-entities-for-class-ws target-kbpedia-class entities-dataset kbpedia-dataset)]
    (loop [nb 0
           nb-processed 1]
      (when (< nb nb-entities)
        (doseq [entity (get-entities-slice target-kbpedia-class entities-dataset kbpedia-dataset :limit step :offset @nb-processed)]          
          (spit corpus-file (str (get-entity-content entity) "\n") :append true)
          (println (str nb-processed "/" nb-entities)))
        (recur (+ nb step)
               (inc nb-processed))))))

(create-domain-specific-training-set "" "resources/musicians-corpus.txt")

What this code does is to query the KBpedia knowledge base to get all the named entities that are linked to it, for the scope of the domain we defined. Then the text related to each entity is appended to a text file where each line is the text of a single entity.

Given the scope of the current demo, the musicians training corpus is composed of 47,263 documents. This is the crux of the demo. With a simple function, we are able to aggregate 47,263 text documents highly related to a conceptual domain we defined on the fly. All of the hard work has been delegated to the knowledge base and its conceptual structure (in fact, this simple function leverages 8 years of hard work).

Normalizing Text

The next step is a natural step related to any NLP pipeline. Before learning from the training corpus, we should clean and normalize the text of its raw form.

(defn normalize-proper-name
  (-> name
      (string/replace #" " "_")      

(defn pre-process-line
  (-> (let [line (-> line
                     ;; 1. remove all underscores
                     (string/replace "_" " "))]
        ;; 2. detect named entities and change them with their underscore form, like: Fred Giasson -> fred_giasson
        (loop [entities (into [] (re-seq #"[\p{Lu}]([\p{Ll}]+|\.)(?:\s+[\p{Lu}]([\p{Ll}]+|\.))*(?:\s+[\p{Ll}][\p{Ll}\-]{1,3}){0,1}\s+[\p{Lu}]([\p{Ll}]+|\.)" line))
               line line]
          (if (empty? entities)
            (let [entity (first (first entities))]
              (recur (rest entities)                     
                     (string/replace line entity (normalize-proper-name entity)))))))
      (string/replace (re-pattern stop-list) " ")
      ;; 4. remove everything between brackets like: [1] [edit] [show]
      (string/replace #"\[.*\]" " ")
      ;; 5. punctuation characters except the dot and the single quote, replace by nothing: (),[]-={}/\~!?%$@&*+:;<>
      (string/replace #"[\^\(\)\,\[\]\=\{\}\/\\\~\!\?\%\$\@\&\*\+:\;\<\>\"\p{Pd}]" " ")
      ;; 6. remove all numbers
      (string/replace #"[0-9]" " ")
      ;; 7. remove all words with 2 characters or less
      (string/replace #"\b[\p{L}]{0,2}\b" " ")
      ;; 10. normalize spaces
      (string/replace #"\s{2,}" " ")
      ;; 11. normalize dots with spaces
      (string/replace #"\s\." ".")
      ;; 12. normalize dots
      (string/replace #"\.{1,}" ".")
      ;; 13. normalize underscores
      (string/replace #"\_{1,}" "_")
      ;; 14. remove standalone single quotes
      (string/replace " ' " " ")
      ;; 15. re-normalize spaces
      (string/replace #"\s{2,}" " ")        
      ;; 16. put everything lowercase

      (str "\n")))

(defn pre-process-corpus
  [in-file out-file]
  (spit out-file "" :append true)
  (with-open [file ( in-file)]
    (doseq [line (line-seq file)]
      (spit out-file (pre-process-line line) :append true))))

(pre-process-corpus "resources/musicians-corpus.txt" "resources/musicians-corpus.clean.txt")

We remove all of the characters that may cause issues to the tokenizer used by the word2vec implementation. We also remove unnecessary words and other words that appear too often or that add nothing to the model we want to generate (like the listing of days and months). We also drop all numbers.

Training word2vec

The last step is to train word2vec on our clean domain-specific training corpus to generate the model we will use. For this demo, I will use the DL4J (Deep Learning for Java) library that is a Java implementation of the word2vec algorithm. Training word2vec is as simple as using the DL4J API like this:

(defn train
  [training-set-file model-file]
  (let [sentence-iterator (new LineSentenceIterator ( training-set-file))
        tokenizer (new DefaultTokenizerFactory)
        vec (.. (new Word2Vec$Builder)
                (minWordFrequency 1)
                (windowSize 5)
                (layerSize 100)
                (iterate sentence-iterator)
                (tokenizerFactory tokenizer)
    (.fit vec)
    (SerializationUtils/saveObject vec (io/file model-file))

(def musicians-model (train "resources/musicians-corpus.clean.txt" "resources/musicians-corpus.model"))

What is important to notice here is the number of parameters that can be defined to train word2vec on a corpus. In fact, that algorithm can be sensitive to parametrization.

Importing the General Model

The goal of this demo is to demonstrate the difference between a domain-specific model and a general model. Remember that the general model we chose was the Google News model that is composed of billion of words, but which is highly general. DL4J can import that model without having to generate it ourselves (in fact, only the model is distributed by Google, not the training corpus):

(defn import-google-news-model
  (org.deeplearning4j.models.embeddings.loader.WordVectorSerializer/loadGoogleModel ( "GoogleNews-vectors-negative300.bin.gz") true))

(def google-news-model (import-google-news-model))

Playing With Models

Now that we have a domain-specific model related to musicians and a general model related to news processed by Google, let’s start playing with both to see how they perform on different tasks. In the following examples, we will always compare the domain-specific training corpus with the general one.

Ambiguous Words

A characteristic of words is that their surface form can be ambiguous; they can have multiple meanings. An ambiguous word can co-occur with multiple other words that may not have any shared meaning. But all of this depends on the context. If we are in a general context, then this situation will happen more often than we think and will impact the similarity score of these ambiguous terms. However, as we will see, this phenomenum is greatly diminished when we use domain-specific models.

Similarity Between Piano, Organ and Violin

What we want to check is the relationship between 3 different music instruments: piano, organ and violin. We want to check the relationship between each of them.

(.similarity musicians-model "piano" "violin")
(.similarity musicians-model "piano" "organ")

As we can see, both tuples have a high likelihood of co-occurrence. This suggests that these terms of each tuple are probably highly related. In this case, it is probably because violins are often played along with a piano. And, it is probably that an organ looks like a piano (at least it has a keyboard).

Now let’s take a look at what the general model has to say about that:

(.similarity google-news-model "piano" "violin")
(.similarity google-news-model "piano" "organ")

The surprising fact here is the apparent dissimilarity between piano and organ compared with the results we got with the musicians domain-specific model. If we think a bit about this use case, we will probably conclude that these results makes sense. In fact, organ is an ambiguous word in a general context. An organ can be a musical instrument, but it can also be a part of an anatomy. This means that the word organ will co-occur beside piano, but also all kind of other words related to human and animal biology. This is why they are less similar in the general model than in the domain one, because it is an ambiguous word in a general context.

Similarity Between Album and Track

Now let’s see another similarity example between two other words album and track where track is an ambiguous word depending on the context.

(.similarity musicians-model "album" "track")
(.similarity google-news-model "album" "track")

As expected, because track is ambiguous, there is a big difference in terms of co-occurence probabilities depending on the context (domain-specific or general).

Similarity Between Pianist and Violinist

However, are domain-specific and general differences always the case? Let’s take a look at two words that are domain specific and unambiguous: pianist and violinist.

(.similarity musicians-model "pianist" "violinist")
(.similarity google-news-model "pianist" "violinist")

In this case, the similarity score between the two terms is almost the same. In both contexts (generals and domain specific), their co-occurrence is similar.

Nearest Words

Now let’s look at the similarity between two distinct words in two new and distinct contexts. Let’s take a look at a few words and see what other words occur most often with them.


(.wordsNearest musicians-model ["music"] [] 7)
music revol samoilovich bunin musical amalgamating assam. voice dance.
(.wordsNearest google-news-model ["music"] [] 8)
music classical music jazz Music Without Donny Kirshner songs musicians tunes

One observation we can make is that the terms from the musicians model are more general than the ones from the general model.


(.wordsNearest musicians-model ["track"] [] 10)
track released. album latest entitled released debut year. titled positive
(.wordsNearest google-news-model ["track"] [] 5)
track tracks Track racetrack horseshoe shaped section

As we know, track is ambiguous. The difference between these two sets of nearest related words is striking. There is a clear conceptual correlation in the musicians’ domain-specific model. But in the general model, it is really going in all directions.


Now let’s take a look at a really general word: year

(.wordsNearest musicians-model ["year"] [] 11)
year ghantous. he was grammy naacap grammy award for best luces del alma year. grammy award grammy for best sitorai sol nominated
(.wordsNearest google-news-model ["year"] [] 10)
year month week months decade years summer year.The September weeks

This one is quite interesting too. Both groups of words makes sense, but only in their respective contexts. With the musicians’ model, year is mostly related to awards (like the Grammy Awards 2016), categories like “song of the year”, etc.

In the context of the general model, year is really related to time concepts: months, seasons, etc.

Playing With Co-Occurrences Vectors

Finally we will play with manipulating the co-occurrences vectors by manipulating them. A really popular word2vec equation is king - man + women = queen. What is happening under the hood with this equation is that we are adding and substracting the co-occurences vectors for each of these words, and we check the nearest word of the resulting co-occurence vector.

Now, let’s take a look at a few of these equations.

Pianist + Renowned = ?

(.wordsNearest musicians-model ["pianist" "renowned"] [] 9)
pianist renowned teacher. composer. prolific virtuoso teacher leading educator.
(.wordsNearest google-news-model ["pianist" "renowned"] [] 7)
renowned pianist pianist composer jazz pianist classical pianists composer pianist virtuoso pianist

These kind of operations are kind of interesting. If we add the two co-occurrence vectors for pianist and renowned then we get that a teacher, an educator, a composer or a virtuoso is a renowned pianist.

For unambiguous surface forms like pianist, then the two models score quite well. The difference between the two examples comes from the way the general training corpus has been created (pre-processed) compared to the musicians corpus.

Metal + Death = ?

(.wordsNearest musicians-model ["metal" "death"] [] 10)
metal death thrash deathcore melodic doom grindcore metalcore mathcore heavy
(.wordsNearest google-news-model ["metal" "death"] [] 5)
death metal Tunstallbled steel Death

This example uses two quite general words with no apparent relationship between them. The results with the musicians’ model are all the highly similar genre of music like trash metal, deathcore metal, etc.

However with the general model, it is a mix of multiple unrelated concepts.

Metal – Death + Smooth = ?

Let’s play some more with these equations. What if we want some kind of smooth metal?

(.wordsNearest musicians-model ["metal" "smooth"] ["death"] 5)
smooth fusion funk hard neo

This one is quite interesting. We substracted the death co-occurrence vector to the metal one, and then we added the smooth vector. What we end-up with is a bunch of music genres that are much smoother than death metal.

(.wordsNearest google-news-model ["metal" "smooth"] ["death"] 5)
smooth metal Brushed aluminum durable polycarbonate chromed steel

In the case of the general model, we end-up with “smooth metal”. The removal of the death vector has no effect on the results, probably since these are three ambiguous and really general terms.

What Is Next

The demo I presented in this article uses public datasets currently linked to KBpedia. You may wonder what are the other possibilities? Another possibility is to link your own private datasets to KBpedia. That way, these private datasets would then become usable, exactly in the same way, to create domain-specific training corpuses on the fly. Another possibility would be to take totally unstructured text like local text documents, or semi-structured text like a set of HTML web pages. Then, tag them using the Cognonto topics analyzer to tag each of the text document using KBpedia reference concepts. Then we could use the KBpedia structure exactly the same way to choose which of these documents we want to include in the domain-specific training corpus.


As we saw, creating domain-specific training corpuses to use with word2vec can have a dramatic impact on the results and how results can be much more meaningful within the scope of that domain. Another advantage of the domain-specific training corpuses is that they create much smaller models. This is quite an interesting characteristic since smaller models means they are faster to generate, faster to download/upload, faster to query, consumes less memory, etc.

Of the concepts in KBpedia, roughly 33,000 of them correspond to types (or classes) of various sorts. These pre-determined slices are available across all needs and domains to generate such domain-specific corpuses. Further, KBpedia is designed for rapid incorporation of your own domain information to add further to this discriminatory power.