Mapping Datasets, Schema and Ontologies Using the Cognonto Mapper

There are many situations were we want to link named entities from two different datasets or to find duplicate entities to remove in a single dataset. The same is true for vocabulary terms or ontology classes that we want to integrate and map together. Sometimes we want to use such a linkage system to help save time when creating gold standards for named entity recognition tasks.

There exist multiple data linkage & deduplication frameworks developed in several different programming languages. At Cognonto, we have our own system called the Cognonto Mapper.

Most mapping frameworks work more or less the same way. They use one or two datasets as sources of entities (or classes or vocabulary terms) to compare. The datasets can be managed by a conventional relational database management system, a triple store, a spreadsheet, etc. Then they have complex configuration options that let the user define all kinds of comparators that will try to match the values of different properties that describe the entities in each dataset. (Comparator types may be simple string comparisons, the added use of alternative labels or definitions, attribute values, or various structural relationships and linkages within the dataset.) Then the comparison is made for all the entities (or classes or vocabulary terms) existing in each dataset. Finally, an entity similarity score is calculated, with some threshold conditions used to signal whether the two entities (or classes or vocabulary terms) are the same or not.

The Cognonto Mapper works in this same general way. However, as you may suspect, it has a special trick in its toolbox: the SuperType Comparator. The SuperType Comparator leverages the KBpedia Knowledge Ontology to help disambiguate two given entities (or classes or vocabulary terms) based on their type and the analysis of their types in the KBPedia Knowledge Ontology. When we perform a deduplication or a linkage task between two large datasets of entities, it is often the case that two entities will be considered a nearly perfect match based on common properties like names, alternative names and other common properties even if they are two completely different things. This happens because entities are often ambiguous when only considering these basic properties. The SuperType Comparator’s role is to disambiguate the entities based on their type(s) by leveraging the disjointedness of the SuperType structure that governs the overall KBpedia structure. The SuperType Comparator greatly reduces the time needed to curate the deduplication or linkage tasks in order to determine the final mappings.

We first present a series of use cases for the Mapper below, followed by an explanation of how the Cognonto Mapper works, and then some conclusions.

Continue reading “Mapping Datasets, Schema and Ontologies Using the Cognonto Mapper”

Improving Machine Learning Tasks By Integrating Private Datasets

In the last decade, we have seen the emergence of two big families of datasets: the public and the private ones. Invaluable public datasets like Wikipedia, Wikidata, Open Corporates and others have been created and leveraged by organizations world-wide. However, as great as they are, most organization still rely on private datasets of their own curated data.

In this article, I want to demonstrate how high-value private datasets may be integrated into the Cognonto’s KBpedia knowledge base to produce a significant impact on the quality of the results of some machine learning tasks. To demonstrate this impact, I have created a demo that is supported by a “gold standard” of 511 web pages taken at random, to which we have tagged the organization that published the web page. This demo is related to the publisher analysis portion of the Cognonto demo. We will use this gold standard to calculate the performance metrics of the publisher analyzer but more precisely, we will analyze the performance of the analyzer depending on the datasets it has access to perform its predictions.

[extoc]

Continue reading “Improving Machine Learning Tasks By Integrating Private Datasets”

Using Cognonto to Generate Domain Specific word2vec Models

word2vec is a two layer artificial neural network used to process text to learn relationships between words within a text corpus to create a model of all the relationships between the words of that corpus. The text corpus that a word2vec process uses to learn the relationships between words is called the training corpus.

In this article I will show you how Cognonto‘s knowledge base can be used to automatically create highly accurate domain specific training corpuses that can be used by word2vec to generate word relationship models. However you have to understand that what is being discussed here is not only applicable to word2vec, but to any method that uses corpuses of text for training. For example, in another article, I will show how this can be done with another algorithm called ESA (Explicit Semantic Analysis).

It is said about word2vec that “given enough data, usage and contexts, word2vec can make highly accurate guesses about a word’s meaning based on past appearances.” What I will show in this article is how to determine the context and we will see how this impacts the results.

Continue reading “Using Cognonto to Generate Domain Specific word2vec Models”

Cognonto

I am proud to announce the start of a new venture called Cognonto. I am particularly proud of it because even if it is just starting, it is in fact more than eight years old. It is the embodiment of eight years of research, of experimentation, of a big deal of frustration and of great joy with my long-time partner Mike. cognonto_logo-square

Eight years ago, we set a 5-to-10-year vision for our work as partners. We defined an initial series of technological goals for which we outlined a series of yearly milestones. The goals were related to help solving decades old problems with data integration and interoperability using a completely new research field (at the time): the Semantic Web.

And there we are eight years later, after working for an endless number of hours to create all kinds of different projects and services to pay for the research and the pieces of technologies we develop for these purposes. Cognonto is the embodiment of that effort, but it also created a series of other purposeful projects such as the creation of Stuctured Dynamics, UMBEL, the Open Semantic Framework and a series of other open source collaterals.

We spent eight years to create, sanitize, to make coherent and consistent, to generate and regenerate a conceptual structure of now 38,930 reference concepts with 138,868 mapping links to 27 external schemas, vocabularies and datasets. This led to the creation of KBpedia, which is the knowledge graph that drives Cognonto. The full statistics are available here.

I can’t thank Mike enough for this long and wonderful journey that led to the creation of Cognonto. I sent him an endless number of concepts lists that he diligently screened, assessed and mapped. We spent hundred of hours to discuss the knots and bolts of the structure, to argue about its core concepts and how it should be defined and used. It was not without pain, but I believe that the result is truly astonishing.

I won’t copy/paste the Cognonto press release here, a link will suffice. I it is just not possible for me to write a better introduction than the two pagers that Mike wrote for the press release. I would also suggest that you read his Cognonto introduction blog post: Cognonto is on the Hunt for Big AI Game.

In the coming weeks, I will write a lot about Cognonto, what it is, how it can be used, what are its use cases, how the information that is presented in the demo and the knowledge graph sections should be interpreted and what these pages tell you.

Winnipeg City’s NOW [Data] Portal

The Winnipeg City’s NOW (Neighbourhoods Of Winnipeg) Portal is an initiative to create a complete neighbourhood web portal for its citizens. At the core of the project we have a set of about 47 fully linked, integrated and structured datasets of things of interests to Winnipegers. The focal point of the portal is Winnipeg’s 236 neighbourhoods, which define the main structure of the portal. The portal has six main sections: topics of interests, maps, history, census, images and economic development. The portal is meant to be used by citizens to find things of interest in their neibourhood, to learn their history, to see the images of the things of interest, to find tools to help economic development, etc.

The NOW portal is not new; Structured Dynamics was also its main technical contractor for its first release in 2013. However we just finished to help Winnipeg City’s NOW team to migrate their older NOW portal from OSF 1.x to OSF 3.x and from Drupal 6 to Drupal 7; we also trained them on the new system. Major improvements accompany this upgrade, but the user interface design is essentially the same.

The first thing I will do is to introduce each major section of the portal and I will explain the main features of each. Then I will discuss the new improvements of the portal.

[extoc]

Continue reading “Winnipeg City’s NOW [Data] Portal”