Measuring the Influence of Expanded Knowledge Graphs on Machine Learning

Mike Bergman and I will release a new version 1.40 of the KBpedia Knowledge Graph in the coming month. This new version of the knowledge graph will include roughly 15,000 new concepts and 150,000 new alternative labels and 5,000 new definitions for existing KBpedia reference concepts. This new release will substantially increase the size of the current KBpedia Knowledge Graph.

This extension is based on a new methodology that we began to cover in the Extending KBpedia With Wikipedia Categories Cognonto use case. The extension uses graph embeddings for each KBpedia reference concept and its linkage to the Wikipedia category structure to pre-select the Wikipedia categories that are most likely to be good candidates to fill [current gaps] in the KBpedia graph structure. The new reference concept candidates scored through this automated process were then reviewed for likely selection. These selections were then analyzed by re-generating the KBpedia Knowledge Graph, which includes routines for identifying, reporting and fixing consistency and coherency issues using the KBpedia Generator. Problematic assignments are either dropped or fixed. These steps reflect the general process Cognonto follows in mapping and incorporating new schema and ontologies.

In the coming month or two, I will write a series of blog posts that will analyze the impact of different important versions of KBpedia on different machine learning models that we have previously created for the Cognonto use cases. All of the current use cases have been created using version 1.20 of KBpedia. We are about to finalize the creation of an intermediate version 1.30 (for internal analysis only). We are separately identifying thousands of reference concepts that will be temporarily removed, since they are more properly characterized as ‘aspects‘ and not true sub-classes. This removal will allow us to then define a third variant for machine learning comparisons. Some of these ‘aspects’ will be re-introduced into the graph where proper parent-child relationships can be established. The next public release of KBpedia, tentatively identified as The version 1.40, will include all of these updates.

Each of these three variants (versions 1.20, 1.30 and 1.40) will enable us to analyze and report on the influence that different version of the KBpedia knowledge graph can have on different machine learning tasks. The following tasks will be covered:

  1. Creating graph embeddings to disambiguate tagged concepts
  2. Creating domain specific training corpuses to train word embeddings
  3. Creating domain specific training sets to classify text, and
  4. Checking relatedness between Knowledge Graph concepts and Wikipedia categories based on their graph embeddings.

Our goal at Cognonto is to make available the power of knowledge-based artificial intelligence (KBAI) to any organization. Whether if it is for help populating search or tagging indexes, for performing semantic query expansion, or for help with a broad series of machine learning tasks, knowledge graphs plus KBAI provide a nearly automated way for doing so. Our research and expertise is geared toward creating, linking, extending, and leveraging knowledge graphs and knowledge bases to empower new and existing systems. We will continue to report in specific detail how and with what impact knowledge graphs and knowledge bases lead to better machine learning results.

Disambiguating KBpedia Knowledge Graph Concepts

One of the most important natural language processing tasks is to “tag” concepts in text. Tagging a concept means determining whether words or phrases in a text document matches any of the concepts that exist in some kind of a knowledge structure (such as a knowledge graph, an ontology, a taxonomy, a vocabulary, etc.). (BTW, a similar definition and process applies to tagging an entity.) What is usually performed is that the input text is parsed and normalized in some manner. Then all of the surface forms of the concepts within the input knowledge structure (based on their preferred and alternative labels) are matched against the words within the text. “Tagging” is when a match occurs between a concept in the knowledge structure and one of its surface forms in the input text.

But here is the problem. Given the ambiguous world we live in, often this surface form, which after all is only a word or phrase, may be associated with multiple different concepts. When we identify the surface form of “bank”, does that term refer to a financial institution, the shore of a river, a plane turning, or a pool shot? identical surface forms may refer to totally different concepts. Further, sometimes a single concept will be identified but it won’t be the right concept, possibly because the right one is missing from the knowledge structure or other issues.

A good way to view this problem of ambiguity is to analyze a random Web page using the Cognonto Demo online application. The demo usea the Cognonto Concepts Tagger service to tag all of the existing KBpedia knowledge graph concepts found in the target Web page. Often, when you analyze what has been tagged by the demo, you may see some of these amgibuities or wrongly tagged concepts yourself. For instance, check out this example. If you mouse over the tagged concepts, you will notice that many of the individual “tagged” terms refer to multiple KBpedia concepts. Clearly, in its basic form, this Cognonto demo is not disambiguating the concepts.

The purpose of this article is thus to explain how we can “disambiguate” (that is, suggest the proper concept from an ambiguous list) the concepts that have been tagged. What we will do is to show how we can leverage the KBpedia knowledge graph structure as-is to perform this disambiguation. What we will do is to create graph embeddings for each of the KBpedia concepts using the DeepWalk algorithm. Then we will perform simple linear algebra equations on the graph embeddings to determine if the tagged concept(s) is the right one given its context or not. We will test multiple different algorithms and strategies to analyze the impact on the overall disambiguation performance of the system.

Continue reading “Disambiguating KBpedia Knowledge Graph Concepts”

Extended KBpedia With Wikipedia Categories

A knowledge graph is an ever evolving structure. It needs to be extended to be able to cope with new kinds of knowledge; it needs to be fixed and improved in all kinds of different ways. It also needs to be linked to other sources of data and to other knowledge representations such as schemas, ontologies and vocabularies. One of the core tasks related to knowledge graphs is to extend its scope. This idea seems simple enough, but how can we extend a general knowledge graph that has nearly 40,000 concepts with potentially multiple thousands more? How can we do this while keeping it consistent, coherent and meaningful? How can we do this without spending undue effort on such a task? These are the questions we will try to answer with the methods we cover in this article.

The methods we are presenting in this article are how we can extend Cognonto‘s KBpedia Knowledge Graph using an external source of knowledge, one which has a completely different structure than KBpedia and one which has been built completely differently with a different purpose in mind than KBpedia. In this use case, this external resource is the Wikipedia categories structure. What we will show in this article is how we may automatically select the right Wikipedia categories that could lead to new KBpedia concepts. These selections are made using a SVM classifier trained over graph embedding vectors generated by a DeepWalk model based on the KBpedia Knowledge Graph structure linked to the Wikipedia categories. Once appropriate candidate categories are selected using this model, the results are then inspected by a human to take the final selection decisions. This semi-automated process takes 5% of the time it would normally take to conduct this task by comparable manual means.

Continue reading “Extended KBpedia With Wikipedia Categories”

Leveraging KBpedia Aspects To Generate Training Sets Automatically

In previous articles I have covered multiple ways to create training corpuses for unsupervised learning and positive and negative training sets for supervised learning 1 , 2 , 3 using Cognonto and KBpedia. Different structures inherent to a knowledge graph like KBpedia can lead to quite different corpuses and sets. Each of these corpuses or sets may yield different predictive powers depending on the task at hand.

So far we have covered two ways to leverage the KBpedia Knowledge Graph to automatically create positive and negative training corpuses:

  1. Using the links that exist between each KBpedia reference concept and their related Wikipedia pages
  2. Using the linkages between KBpedia reference concepts and external vocabularies to create training corpuses out of
    named entities.

Now we will introduce a third way to create a different kind of training corpus:

  1. Using the KBpedia aspects linkages.

Aspects are aggregations of entities that are grouped according to their characteristics different from their direct types. Aspects help to group related entities by situation, and not by identity nor definition. It is another way to organize the knowledge graph and to leverage it. KBpedia has about 80 aspects that provide this secondary means for placing entities into related real-world contexts. Not all aspects relate to a given entity.

[extoc]

Continue reading “Leveraging KBpedia Aspects To Generate Training Sets Automatically”

Dynamic Machine Learning Using the KBpedia Knowledge Graph – Part 2

In the first part of this series we found the good hyperparameters for a single linear SVM classifier. In part 2, we will try another technique to improve the performance of the system: ensemble learning.

So far, we already reached 95% of accuracy with some tweaking the hyperparameters and the training corpuses but the F1 score is still around ~70% with the full gold standard which can be improved. There are also situations when precision should be nearly perfect (because false positives are really not acceptable) or when the recall should be optimized.

Here we will try to improve this situation by using ensemble learning. It uses multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone. In our examples, each model will have a vote and the weight of the vote will be equal for each mode. We will use five different strategies to create the models that will belong to the ensemble:

  1. Bootstrap aggregating (bagging)
  2. Asymmetric bagging 1
  3. Random subspace method (feature bagging)
  4. Asymmetric bagging + random subspace method (ABRS) 1
  5. Bootstrap aggregating + random subspace method (BRS)

Different strategies will be used depending on different things like: are the positive and negative training documents unbalanced? How many features does the model have? etc. Let’s introduce each of these different strategies.

Note that in this article I am only creating ensembles with linear SVM learners. However an ensemble can be composed of multiple different kind of learners, like SVM with non-linear kernels, decisions trees, etc. However, to simplify this article, we will stick to a single linear SVM with multiple different training corpuses and features.

[extoc]

Continue reading “Dynamic Machine Learning Using the KBpedia Knowledge Graph – Part 2”