A common task required by systems that automatically analyze text is to classify an input text into one or multiple classes. A model needs to be created to scope the class (what belongs to it and what does not) and then a classification algorithm uses this model to classify an input text.
Multiple classification algorithms exists to perform such a task: Support Vector Machine (SVM), K-Nearest Neigbours (KNN), C4.5 and others. What is hard with any such text classification task is not so much how to use these algorithms: they are generally easy to configure and use once implemented in a programming language. The hard â€“ and time-consuming â€“ part is to create a sound training corpus that will properly define the class you want to predict. Further, the steps required to create such a training corpus must be duplicated for each class you want to predict.
Since creating the training corpus is what is time consuming, this is where Cognonto provides its advantages.
In this article, we will show you how Cognonto’s KBpedia Knowledge Graph can be used to automatically generate training corpuses that are used to generate classification models. First, we define (scope) a domain with one or multiple KBpedia reference concepts. Second, we aggregate the training corpus for that domain using the KBpedia Knowledge Graph and its linkages to external public datasets that are then used to populate the training corpus of the domain. Third, we use the Explicit Semantic Analysis (ESA) algorithm to create a vectorial representation of the training corpus. Fourth, we create a model using (in this use case) an SVM classifier. Finally, we predict if an input text belongs to the class (scoped domain) or not.
This use case can be used in any workflow that needs to pre-process any set of input texts where the objective is to classify relevant ones into a defined domain.
Unlike more traditional topic taggers where topics are tagged in an input text with weights provided for each of them, we will see how it is possible to use the semantic interpreter to tag main concepts related to an input text even if the surface form of the topic is not mentioned in the text. We accomplish this by leveraging ESA’s semantic interpreter.
Continue reading “Create a Domain Text Classifier Using Cognonto”
There are many situations were we want to link named entities from two different datasets or to find duplicate entities to remove in a single dataset. The same is true for vocabulary terms or ontology classes that we want to integrate and map together. Sometimes we want to use such a linkage system to help save time when creating gold standards for named entity recognition tasks.
There exist multiple data linkage & deduplication frameworks developed in several different programming languages. At Cognonto, we have our own system called the Cognonto Mapper.
Most mapping frameworks work more or less the same way. They use one or two datasets as sources of entities (or classes or vocabulary terms) to compare. The datasets can be managed by a conventional relational database management system, a triple store, a spreadsheet, etc. Then they have complex configuration options that let the user define all kinds of comparators that will try to match the values of different properties that describe the entities in each dataset. (Comparator types may be simple string comparisons, the added use of alternative labels or definitions, attribute values, or various structural relationships and linkages within the dataset.) Then the comparison is made for all the entities (or classes or vocabulary terms) existing in each dataset. Finally, an entity similarity score is calculated, with some threshold conditions used to signal whether the two entities (or classes or vocabulary terms) are the same or not.
The Cognonto Mapper works in this same general way. However, as you may suspect, it has a special trick in its toolbox: the SuperType Comparator. The SuperType Comparator leverages the KBpedia Knowledge Ontology to help disambiguate two given entities (or classes or vocabulary terms) based on their type and the analysis of their types in the KBPedia Knowledge Ontology. When we perform a deduplication or a linkage task between two large datasets of entities, it is often the case that two entities will be considered a nearly perfect match based on common properties like names, alternative names and other common properties even if they are two completely different things. This happens because entities are often ambiguous when only considering these basic properties. The SuperType Comparator’s role is to disambiguate the entities based on their type(s) by leveraging the disjointedness of the SuperType structure that governs the overall KBpedia structure. The SuperType Comparator greatly reduces the time needed to curate the deduplication or linkage tasks in order to determine the final mappings.
We first present a series of use cases for the Mapper below, followed by an explanation of how the Cognonto Mapper works, and then some conclusions.
Continue reading “Mapping Datasets, Schema and Ontologies Using the Cognonto Mapper”
In the last decade, we have seen the emergence of two big families of datasets: the public and the private ones. Invaluable public datasets like Wikipedia, Wikidata, Open Corporates and others have been created and leveraged by organizations world-wide. However, as great as they are, most organization still rely on private datasets of their own curated data.
In this article, I want to demonstrate how high-value private datasets may be integrated into the Cognonto’s KBpedia knowledge base to produce a significant impact on the quality of the results of some machine learning tasks. To demonstrate this impact, I have created a demo that is supported by a “gold standard” of 511 web pages taken at random, to which we have tagged the organization that published the web page. This demo is related to the
publisher analysis portion of the Cognonto demo. We will use this gold standard to calculate the performance metrics of the
publisher analyzer but more precisely, we will analyze the performance of the analyzer depending on the datasets it has access to perform its predictions.
Continue reading “Improving Machine Learning Tasks By Integrating Private Datasets”
word2vec is a two layer artificial neural network used to process text to learn relationships between words within a text corpus to create a model of all the relationships between the words of that corpus. The text corpus that a word2vec process uses to learn the relationships between words is called the training corpus.
In this article I will show you how Cognonto‘s knowledge base can be used to automatically create highly accurate domain specific training corpuses that can be used by word2vec to generate word relationship models. However you have to understand that what is being discussed here is not only applicable to word2vec, but to any method that uses corpuses of text for training. For example, in another article, I will show how this can be done with another algorithm called ESA (Explicit Semantic Analysis).
It is said about word2vec that “given enough data, usage and contexts, word2vec can make highly accurate guesses about a wordâ€™s meaning based on past appearances.” What I will show in this article is how to determine the context and we will see how this impacts the results.
Continue reading “Using Cognonto to Generate Domain Specific word2vec Models”
Extract Structured Content, Tag Concepts & Entities
Cognonto is brand new. At its core, it uses a structure of nearly 40 000 concepts. It has about 138,000 links to external classes and concepts that defines huge public datasets such as Wikipedia, DBpedia and USPTO. Cognonto is not a children’s toy. It is huge and complexâ€¦ but it is very usable. Before digging into the structure itself, before starting to write about all the use cases that Cognonto can support, I will first cover all of the tools that currently exist to help you understand Cognonto and its conceptual structure and linkages (called KBpedia).
The embodiment of Cognonto that people can see are the tools we created and that we made available on the cognonto.com web site. Their goal is to show the structure at work, what ties where, how the conceptual structure and its links to external schemas and datasets help discover new facts, how it can drive other services, etc.
This initial blog post will discuss the demo section of the web site. What we call the Cognonto demo is a web page crawler that analyzes web pages to tag concepts, to tag named entities, to extract structured data, to detect language, to identity topics, and so forth. The demo uses the KBpedia structure and its linkages to Wikipedia, Wikidata, Freebase and USPTO to tag content that appears in the analyzed web pages. But there is one thing to keep in mind: the purpose of Cognonto is to link public or private datasets to the structure to expand its knowledge and make these tools (like the demo) even more powerful. This means that a private organization could use Cognonto, add their own datasets and link their own schemas, to improve their own version of Cognonto or to tailor it for their own purpose.
Let’s see what the demo looks like, what is the information it extracts and analyzes from any web page, and how it ties into the KBpedia structure.
Continue reading “Web Page Analysis With Cognonto”