There are many situations were we want to link named entities from two different datasets or to find duplicate entities to remove in a single dataset. The same is true for vocabulary terms or ontology classes that we want to integrate and map together. Sometimes we want to use such a linkage system to help save time when creating gold standards for named entity recognition tasks.
There exist multiple data linkage & deduplication frameworks developed in several different programming languages. At Cognonto, we have our own system called the Cognonto Mapper.
Most mapping frameworks work more or less the same way. They use one or two datasets as sources of entities (or classes or vocabulary terms) to compare. The datasets can be managed by a conventional relational database management system, a triple store, a spreadsheet, etc. Then they have complex configuration options that let the user define all kinds of comparators that will try to match the values of different properties that describe the entities in each dataset. (Comparator types may be simple string comparisons, the added use of alternative labels or definitions, attribute values, or various structural relationships and linkages within the dataset.) Then the comparison is made for all the entities (or classes or vocabulary terms) existing in each dataset. Finally, an entity similarity score is calculated, with some threshold conditions used to signal whether the two entities (or classes or vocabulary terms) are the same or not.
The Cognonto Mapper works in this same general way. However, as you may suspect, it has a special trick in its toolbox: the SuperType Comparator. The SuperType Comparator leverages the KBpedia Knowledge Ontology to help disambiguate two given entities (or classes or vocabulary terms) based on their type and the analysis of their types in the KBPedia Knowledge Ontology. When we perform a deduplication or a linkage task between two large datasets of entities, it is often the case that two entities will be considered a nearly perfect match based on common properties like names, alternative names and other common properties even if they are two completely different things. This happens because entities are often ambiguous when only considering these basic properties. The SuperType Comparator’s role is to disambiguate the entities based on their type(s) by leveraging the disjointedness of the SuperType structure that governs the overall KBpedia structure. The SuperType Comparator greatly reduces the time needed to curate the deduplication or linkage tasks in order to determine the final mappings.
We first present a series of use cases for the Mapper below, followed by an explanation of how the Cognonto Mapper works, and then some conclusions.
Usages Of The Cognonto Mapper
When should the Cognonto Mapper, or other deduplication and mapping services, be used? While there are many tasks that warrant the usage of such a system, let’s focus for now on some use cases related to Cognonto and machine learning in general.
Mapping Across Schema
One of Cognonto’s most important use cases is to use the Mapper to link new vocabularies, schemas or ontologies to the KBpedia Knowledge Ontology (KKO). This is exactly what we did for the 24 external ontologies and schemas that we have integrated into KBpedia. Creating such a mapping can be a long and painstaking process. The Mapper greatly helps linking similar concepts together by narrowing the candidate pool of initial set of mappings, thereby increasing the efficiency of the analyst charged with selecting the final mappings between the two ontologies.
Creating ‘Gold Standards’
In my last article, I created a gold standard of 511 random web pages where I determined the publisher of the web page by hand. That gold standard was used to measure the performance of a named entities recognition task. However, to create the actual gold standard, I had to check in each dataset (5 of them with million of entities) if that publisher was existing in any of them. Performing such a task by hand means that I would have to send at least 2555 search queries to try find a matching entity. Let’s say that I am fast, and that I can write the query, send it, look at the results, and copy/paste the URI of the good entity in the gold standard within 30 seconds, it still means that I would complete such a task in roughly 21 hours. It is also clearly impossible to do that 8 hours per day for ~3 days for a sane person, so this task would probably take at least 1 week to complete.
This is why automating this mapping process is really important and this is what the Cognonto Mapper does. The only thing that is needed is to configure 5 mapper sessions. Each session tries to map the entities I identified by hand from the 511 web pages to each of the other datasets. Then I only need run the mapper for each dataset, review the matches, find the missing ones by hand and then merge the results into the final gold standard.
Curating Unknown Entities
In Cognonto, we have an unknown entities tagger that is used to detect possible publisher organizations that are not currently existing in the KBpedia knowledge base. In some cases, what we want to do is to save these detected unknown entities in an unknown entities dataset. Then this dataset will be used to review detected entities to include them back into the KBpedia knowledge base (such that they become new). In the review workflow, one of the steps should be to try to find similar entities to make sure that what was detected by the entities tagger was a totally new entity, and not a new surface form for that entity (which would become an alternative label for the entity and not an entirely new one). Such a checkup in the review workflow would be performed by the Cognonto Mapper.
How Does the SuperType Comparator Work?
As I mentioned in the introduction, the Cognonto Mapper is yet another linkage & deduplication framework. However, it has a special twist: its SuperType Comparator and the leveraging of the KBpedia Knowledge Ontology. Good, but how does it work? There is no better way to understand how it works than studying how two entities can be disambiguated based on their type. So, let’s do this.
Let’s consider this use case. We want to map two datasets together: Wikipedia and Musicbrainz. One of the Musicbrainz entity we want to map to Wikipedia is a music group called Attila with Billy Joel and Jon Small. Attila also exists in Wikipedia, but it is highly ambiguous and may refer to multiple different things. If we setup our linkage task to only work on the preferred and possible alternative labels, they we would have a match between the name of that album and multiple other things in Wikipedia with a matching likelihood that is probably nearly identical. However, how could we update the configuration to try to solve this issue? We have no choice, we will have to use the Cognonto Mapper SuperType Comparator.
Musicbrainz RDF dumps normally map a Musicbrainz group to a
mo:MusicGroup. In the Wikipedia RDF dump the Attila rock band has a type
dbo:Band. Both of these classes are linked to the KBpedia reference concept kbpedia:Band-MusicGroup. This means that the entities of both of these datasets are well connected into KBpedia.
Let’s say that the Cognonto Mapper does detect that the
Attila entity in the Musicbrainz dataset has 4 candidates in Wikipedia:
If the comparison is only based on the preferred label, the likelihood will be the same for all these entities. However, what happens when we start using the SuperType Comparator and the KBpedia Knowledge Ontology?
First we have to understand the context of each type. Using KBpedia, we can determine that rock bands, birds, albums and films are disjoint according to their super types: kko:Animals, kko:Organizations kko:AudioInfo and kko:VisualInfo.
Now that we understand each of the entities the system is trying to link together, and their context within the KBpedia Knowledge Ontology, let’s see how the Cognonto Mapper will score each of these entities based on their type to help disambiguate where labels are identical.
(println "mo:MusicGroup -> dbo:Band" (.compare stc-ex-compare "http://purl.org/ontology/mo/MusicGroup" "http://dbpedia.org/ontology/Band")) (println "mo:MusicGroup -> dbo:Bird" (.compare stc-ex-compare "http://purl.org/ontology/mo/MusicGroup" "http://dbpedia.org/ontology/Bird")) (println "mo:MusicGroup -> dbo:Film" (.compare stc-ex-compare "http://purl.org/ontology/mo/MusicGroup" "http://dbpedia.org/ontology/Film")) (println "mo:MusicGroup -> dbo:Album" (.compare stc-ex-compare "http://purl.org/ontology/mo/MusicGroup" "http://dbpedia.org/ontology/Album"))
|mo:MusicGroup -> dbo:Band||1.0|
|mo:MusicGroup -> dbo:Bird||0.2|
|mo:MusicGroup -> dbo:Film||0.2|
|mo:MusicGroup -> dbo:Album||0.2|
In these cases, the SuperType Comparator did assign a similarity of
1.0 to the
mo:MusicGroup and the
dbo:Band entities since those two classes are equivalent. All the other checks returns
0.20. When the comparator finds two entities that have disjoint SuperTypes, then it assigns the similarity value
0.20 to them. Why not
0.00 if they are disjoint? Well, there may be errors in the knowledge base, so that setting the comparator score to a very low level, it is still available for evaluation, even though its score is much reduced.
In this case the matching is unambiguous and the selection of the right linkage to perform is obvious. However you will see below that it is not (and often not) that simple to make such a clear selection.
Now let’s say that the next entity to match from the Musicbrainz dataset is another entity called Attila, but this time it refers to Attila, the album by Mina. Since the basis of the comparison (comparing the Musicbrainz Attila album instead of the band), the entire process will yield different results. The main difference is that the album will be compared to a film and an album from the Wikipedia dataset. As you can notice in the graph below, these two entities belong to the super types
kko:VisualInfo which are not disjoint.
(println "mo:MusicalWork -> dbo:Band" (.compare stc-ex-compare "http://purl.org/ontology/mo/MusicalWork" "http://dbpedia.org/ontology/Band")) (println "mo:MusicalWork -> dbo:Bird" (.compare stc-ex-compare "http://purl.org/ontology/mo/MusicalWork" "http://dbpedia.org/ontology/Bird")) (println "mo:MusicalWork -> dbo:Film" (.compare stc-ex-compare "http://purl.org/ontology/mo/MusicalWork" "http://dbpedia.org/ontology/Film")) (println "mo:MusicalWork -> dbo:Album" (.compare stc-ex-compare "http://purl.org/ontology/mo/MusicalWork" "http://dbpedia.org/ontology/Album"))
|mo:MusicalWork -> dbo:Band||0.2|
|mo:MusicalWork -> dbo:Bird||0.2|
|mo:MusicalWork -> dbo:Film||0.8762886597938144|
|mo:MusicalWork -> dbo:Album||0.9555555555555556|
As you can see, the main difference is that we don’t have a perfect match between the entities. We thus need to compare between their types, and two of the entities are ambiguous based on their SuperType (their super types are non-disjoint). In this case, what the SuperType Comparator does is to check the set of super classes of both entities, and calculate a similarity measure between the two sets of classes and compute a similarity measure. It is why we have
0.8762 for one and
0.9555 for another.
A musical work and an album are two nearly identical concepts. In fact, a musical work is a conceptual work of an album (a record). A musical work is also strongly related to films since films includes musical works, etc. However, the relationship between a musical work and an album is stronger than with a film and this is what the similarity measure shows.
In this case, even if we have two ambiguous entities of an album and a film for which we don’t have disjoint super types, we are still able to determine which one to choose to create the mappiing based on the calculation of the similarity measure.
As we saw, there are multiple reasons why we would want to leverage the KBpedia Knowledge Ontology to help mapping and deduplication frameworks such as the Cognonto Mapper to disambiguate possible entity matches. KBpedia is not only good for mapping datasets together, it is also quite effective to help with some machine learning tasks such as creating gold standards or curating detected unknown entities. In the context of Cognonto, it is quite effective to map external ontologies, schemas or vocabularies to the KBpedia Knowledge Ontology. It is an essential tool for extending KBpedia to domain- and enterprise-specific needs.
In this article I focused on the SuperType Comparator that is leveraging the type structure of the KBpedia Knowledge Ontology. However, we can also use other structural features in KBpedia (such as an Aspects Comparator based on the aspects structure of KBpedia), singly or in combination, to achieve other mapping or disambiguation objectives.