{"id":3354,"date":"2016-10-04T11:00:29","date_gmt":"2016-10-04T15:00:29","guid":{"rendered":"http:\/\/fgiasson.com\/blog\/?p=3354"},"modified":"2016-11-17T13:43:35","modified_gmt":"2016-11-17T18:43:35","slug":"improving-machine-learning-tasks-by-integrating-private-datasets","status":"publish","type":"post","link":"https:\/\/fgiasson.com\/blog\/index.php\/2016\/10\/04\/improving-machine-learning-tasks-by-integrating-private-datasets\/","title":{"rendered":"Improving Machine Learning Tasks By Integrating Private Datasets"},"content":{"rendered":"<p>In the last decade, we have seen the emergence of two big families of datasets: the public and the private ones. Invaluable public datasets like <a href=\"http:\/\/wikipedia.org\/\">Wikipedia<\/a>, <a href=\"http:\/\/wikidata.org\/\">Wikidata<\/a>, <a href=\"http:\/\/opencorporates.com\/\">Open Corporates<\/a> and others have been created and leveraged by organizations world-wide. However, as great as they are, most organization still rely on private datasets of their own curated data.<\/p>\n<p>In this article, I want to demonstrate how high-value private datasets may be integrated into the <a href=\"http:\/\/cognonto.com\/\">Cognonto&#8217;s<\/a> KBpedia knowledge base to produce a significant impact on the quality of the results of some machine learning tasks. To demonstrate this impact, I have created a demo that is supported by a &#8220;gold standard&#8221; of 511 web pages taken at random, to which we have tagged the organization that published the web page. This demo is related to the <code>publisher analysis<\/code> portion of the <a href=\"http:\/\/cognonto.com\/\">Cognonto demo<\/a>. We will use this gold standard to calculate the performance metrics of the <code>publisher analyzer<\/code> but more precisely, we will analyze the performance of the analyzer depending on the datasets it has access to perform its predictions.<\/p>\n<p>[extoc]<\/p>\n<p><!--more--><\/p>\n<div id=\"outline-container-orgheadline1\" class=\"outline-2\">\n<h2 id=\"orgheadline1\">Cognonto Publisher&#8217;s Analyzer<\/h2>\n<div id=\"text-orgheadline1\" class=\"outline-text-2\">\n<p>The Cognonto publisher&#8217;s analyzer is a portion of the overall Cognonto demo that tries to determine the publisher of a web page from analyzing the web page&#8217;s content. There are multiple moving parts to this analyzer, but its general internal workflow works as follows:<\/p>\n<ol class=\"org-ol\">\n<li>It crawls a given webpage URL<\/li>\n<li>It extracts the page&#8217;s content and extracts its meta-data<\/li>\n<li>It tags all of the organizations (anything that is considered an <a href=\"http:\/\/cognonto.com\/knowledge-graph\/reference-concept\/?uri=Organization\">organization in KBpedia<\/a>) across the extracted content using the organization entities that exist in the knowledge base<\/li>\n<li>It tries to detect unknown entities that will eventually be added to the knowledge base after curation<\/li>\n<li>It performs an in-depth analysis of the organization entities (known or unknown) that got tagged in the content of the web page, and analyzes which of these is the most likely to be the publisher of the web page.<\/li>\n<\/ol>\n<p>Such a machine learning system leverages existing algorithms to calculate the likelihood that an organization is the publisher of a web page and to detect unknown organizations. These are conventional uses of these algorithms. What differentiates the Cognonto analyzer is its knowledge base. We leverage Cognonto to detect known organization entities. We use the knowledge in the KB for each of these entities to improve the analysis process. We constrain the analysis to certain types (by inference) of named entities, etc. The special sauce of this entire process is the fully integrated set of datasets that compose the Cognonto knowledge base, and the KBpedia conceptual reference structure composed of roughly ~39,000 reference concepts.<\/p>\n<p>Given the central role of the knowledge base in such an analysis process, we want to have a better idea of the impact of the datasets in the performance of such a system.<\/p>\n<p>For this demo, I use three public datasets already in KBpedia and that are used by the Cognonto demo: <a href=\"http:\/\/wikipedia.org\/\">Wikipedia<\/a> (via <a href=\"http:\/\/dbpedia.org\/\">DBpedia<\/a>), <a href=\"http:\/\/freebase.com\/\">Freebase<\/a> and <a href=\"http:\/\/www.uspto.gov\/\">USPTO<\/a>. Then I add two private datasets of high quality, highly curated and domain-related information augment the listing of potential organizations. What I will do is to run the Cognonto publisher analyzer on each of these 511 web pages. Then I will check which one got properly identified given the gold standard and finally I will calculate different performance metrics to see the impact of including or excluding a certain dataset.<\/p>\n<\/div>\n<\/div>\n<div id=\"outline-container-orgheadline2\" class=\"outline-2\">\n<h2 id=\"orgheadline2\">Gold Standard<\/h2>\n<div id=\"text-orgheadline2\" class=\"outline-text-2\">\n<p>The <a href=\"https:\/\/en.wikipedia.org\/wiki\/Gold_standard_(test)\">gold standard<\/a> is composed of 511 randomly selected web pages that got crawled and cached. When we run the tests below, the cached version of the HTML pages is used to make sure that we get the same HTML for each page for each test. When the pages are crawled, we execute any possible JavaScript code that the pages may contain before caching the HTML code of the page. That way, if some information in the page was injected by some JavaScript code, then that additional information will be cached as well.<\/p>\n<p>The gold standard is really simple. For each of the URLs we have in the standard, we determine the publishing organization manually. Then once the organization is determined, we search in each dataset to see if the entity is already existing. If it is, then we add the URI (unique identifier) of the entity in the knowledge base into the gold standard. It is this URI reference that is used the determine if the publisher analyzer properly detects the actual publisher of the web page.<\/p>\n<p>We also add a set of 10 web pages manually for which we are sure that <b>no<\/b> publisher can be determined for the web page. These are the 10 <code>True Negative<\/code> (see below) instances of the gold standard.<\/p>\n<p>The gold standard also includes the identifier of possible unknown entities that are the publishers of the web pages. These are used to calculate the metrics when considering the unknown entities detected by the system.<\/p>\n<\/div>\n<\/div>\n<div id=\"outline-container-orgheadline3\" class=\"outline-2\">\n<h2 id=\"orgheadline3\">Metrics<\/h2>\n<div id=\"text-orgheadline3\" class=\"outline-text-2\">\n<p>The goal of this analysis is to determine how good the analyzer is to perform the task (detecting the organization that published a web page on the Web). What we have to do is to use a set of metrics that will help us understanding the performance of the system. The metrics calculation is based on the <a href=\"https:\/\/en.wikipedia.org\/wiki\/Confusion_matrix\">confusion matrix<\/a>.<\/p>\n<div class=\"figure\">\n<p><a href=\"https:\/\/fgiasson.com\/blog\/wp-content\/uploads\/2016\/10\/confusion-matrix-wikipedia.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-3356 size-medium_large\" src=\"https:\/\/fgiasson.com\/blog\/wp-content\/uploads\/2016\/10\/confusion-matrix-wikipedia-768x268.png\" alt=\"confusion-matrix-wikipedia\" width=\"768\" height=\"268\" srcset=\"https:\/\/fgiasson.com\/blog\/wp-content\/uploads\/2016\/10\/confusion-matrix-wikipedia-768x268.png 768w, https:\/\/fgiasson.com\/blog\/wp-content\/uploads\/2016\/10\/confusion-matrix-wikipedia-300x105.png 300w, https:\/\/fgiasson.com\/blog\/wp-content\/uploads\/2016\/10\/confusion-matrix-wikipedia-1024x358.png 1024w, https:\/\/fgiasson.com\/blog\/wp-content\/uploads\/2016\/10\/confusion-matrix-wikipedia.png 1532w\" sizes=\"auto, (max-width: 768px) 100vw, 768px\" \/><\/a><\/p>\n<\/div>\n<p>The <code>True Positive<\/code>, <code>False Positive<\/code>, <code>True Negative<\/code> and <code>False Negative<\/code> (see <a href=\"https:\/\/en.wikipedia.org\/wiki\/Type_I_and_type_II_errors\">Type I and type II<\/a> errors for definitions) should be interpreted that way <a href=\"https:\/\/en.wikipedia.org\/wiki\/Named-entity_recognition#Formal_evaluation\">in the context of a named entities recognition task<\/a>:<\/p>\n<ol class=\"org-ol\">\n<li><code>True Positive (TP)<\/code>: test identifies the same entity as in the gold standard<\/li>\n<li><code>False Positive (FP)<\/code>: test identifies a different entity than what is in the gold standard<\/li>\n<li><code>True Negative (TN)<\/code>: test identifies no entity; gold standard has no entity<\/li>\n<li><code>False Negative (FN)<\/code>: test identifies no entity, but gold standard has one<\/li>\n<\/ol>\n<p>The we have a series of metrics that can be used to measure the performance of of the system:<\/p>\n<ol class=\"org-ol\">\n<li><a href=\"https:\/\/en.wikipedia.org\/wiki\/Precision_and_recall#Precision\">Precision<\/a>: is the proportion of properly predicted publishers amongst all of the publishers that exists in the gold standard <code>(TP \/ TP + FP)<\/code><\/li>\n<li><a href=\"https:\/\/en.wikipedia.org\/wiki\/Precision_and_recall#Recall\">Recall<\/a>: is the proportion of properly predicted publishers amongst all the predictions that have been made (good and bad) <code>(TP \/ TP + FN)<\/code><\/li>\n<li><a href=\"https:\/\/en.wikipedia.org\/wiki\/Accuracy_and_precision\">Accuracy<\/a>: it is the proportion of correctly classified test instances; the publishers that could be identified by the system, and the ones that couldn&#8217;t (the web pages for which no publisher could be identified). <code>((TP + TN) \/ (TP + TN + FP + FN))<\/code><\/li>\n<li><a href=\"https:\/\/en.wikipedia.org\/wiki\/F1_score\">f1<\/a>: the test&#8217;s equally weighted combination of precision and recall<\/li>\n<li><a href=\"https:\/\/en.wikipedia.org\/wiki\/Precision_and_recall#F-measure\">f2<\/a>: the test&#8217;s weighted combination of precision and recall, with a preference for recall<\/li>\n<li><a href=\"https:\/\/en.wikipedia.org\/wiki\/Precision_and_recall#F-measure\">f0.5<\/a>: the test&#8217;s weighted combination of precision and recall, with a preference for precision.<\/li>\n<\/ol>\n<p>The <a href=\"https:\/\/en.wikipedia.org\/wiki\/F1_score\">F-score<\/a> test the accuracy of the general prediction system. The F-score is a measure that combines precision and recall is the <a href=\"https:\/\/en.wikipedia.org\/wiki\/Harmonic_mean\">harmonic mean<\/a> of precision and recall. The <code>f2<\/code> measure weighs recall higher than precision (by placing more emphasis on false negatives), and the <code>f0.5<\/code> measure weighs recall lower than precision (by attenuating the influence of false negatives). Cognonto includes all three F-measures in its standard reports to give a general overview of what happens when we put an emphasis on precision or recall.<\/p>\n<p>In general, I think that the metric that gives the best overall performance of this named entities recognition system is the <code>accuracy<\/code> one. I emphasize those test results below.<\/p>\n<\/div>\n<\/div>\n<div id=\"outline-container-orgheadline4\" class=\"outline-2\">\n<h2 id=\"orgheadline4\">Running The Tests<\/h2>\n<div id=\"text-orgheadline4\" class=\"outline-text-2\">\n<p>The goal with these tests is to run the gold standard calculation procedure with different datasets that exist in the Cognonto knowledge base to see the impact of including\/excluding these datasets on the gold standard metrics.<\/p>\n<\/div>\n<div id=\"outline-container-orgheadline5\" class=\"outline-3\">\n<h3 id=\"orgheadline5\">Baseline: No Dataset<\/h3>\n<div id=\"text-orgheadline5\" class=\"outline-text-3\">\n<p>The first step is to create the starting basis that includes no dataset. Then we will add different datasets, and try different combinations, when computing against the gold standard such that we know the impact of each on the metrics.<\/p>\n<div class=\"org-src-container\">\n<pre class=\"src src-clojure\"><span style=\"color: #ae81ff;\">(<\/span>table <span style=\"color: #66d9ef;\">(<\/span>generate-stats <span style=\"color: #ae81ff;\">:js<\/span> <span style=\"color: #ae81ff;\">:execute<\/span> <span style=\"color: #ae81ff;\">:datasets<\/span> <span style=\"color: #a6e22e;\">[]<\/span><span style=\"color: #66d9ef;\">)<\/span><span style=\"color: #ae81ff;\">)<\/span>\n<\/pre>\n<\/div>\n<pre class=\"example\">True positives:  2\nFalse positives:  5\nTrue negatives:  19\nFalse negatives:  485\n\n+--------------+--------------+\n| key          | value        |\n+--------------+--------------+\n| :precision   | 0.2857143    |\n| :recall      | 0.0041067763 |\n| :accuracy    | 0.04109589   |\n| :f1          | 0.008097166  |\n| :f2          | 0.0051150895 |\n| :f0.5        | 0.019417476  |\n+--------------+--------------+\n<\/pre>\n<\/div>\n<\/div>\n<div id=\"outline-container-orgheadline6\" class=\"outline-3\">\n<h3 id=\"orgheadline6\">One Dataset Only<\/h3>\n<div id=\"text-orgheadline6\" class=\"outline-text-3\">\n<p>Now, let&#8217;s see the impact of each of the datasets that exist in the knowledge base we created to perform these tests. This will gives us an indicator of the inherent impact of each dataset on the prediction task.<\/p>\n<\/div>\n<div id=\"outline-container-orgheadline7\" class=\"outline-4\">\n<h4 id=\"orgheadline7\">Wikipedia (via DBpedia) Only<\/h4>\n<div id=\"text-orgheadline7\" class=\"outline-text-4\">\n<p>Let&#8217;s test the impact of adding a single general purpose dataset, the publicly available: <a href=\"http:\/\/wikipedia.org\/\">Wikipedia<\/a> (via <a href=\"http:\/\/dbpedia.org\/\">DBpedia<\/a>):<\/p>\n<div class=\"org-src-container\">\n<pre class=\"src src-clojure\"><span style=\"color: #ae81ff;\">(<\/span>table <span style=\"color: #66d9ef;\">(<\/span>generate-stats <span style=\"color: #ae81ff;\">:js<\/span> <span style=\"color: #ae81ff;\">:execute<\/span> <span style=\"color: #ae81ff;\">:datasets<\/span> <span style=\"color: #a6e22e;\">[<\/span><span style=\"color: #e6db74;\">\"http:\/\/dbpedia.org\/resource\/\"<\/span><span style=\"color: #a6e22e;\">]<\/span><span style=\"color: #66d9ef;\">)<\/span><span style=\"color: #ae81ff;\">)<\/span>\n<\/pre>\n<\/div>\n<pre class=\"example\">True positives:  121\nFalse positives:  57\nTrue negatives:  19\nFalse negatives:  314\n\n+--------------+------------+\n| key          | value      |\n+--------------+------------+\n| :precision   | 0.6797753  |\n| :recall      | 0.27816093 |\n| :accuracy    | 0.2739726  |\n| :f1          | 0.39477977 |\n| :f2          | 0.31543276 |\n| :f0.5        | 0.52746296 |\n+--------------+------------+\n<\/pre>\n<\/div>\n<\/div>\n<div id=\"outline-container-orgheadline8\" class=\"outline-4\">\n<h4 id=\"orgheadline8\">Freebase Only<\/h4>\n<div id=\"text-orgheadline8\" class=\"outline-text-4\">\n<p>Now, let&#8217;s test the impact of adding another single general purpose dataset, this one the publicly available: <a href=\"http:\/\/freebase.com\/\">Freebase<\/a>:<\/p>\n<div class=\"org-src-container\">\n<pre class=\"src src-clojure\"><span style=\"color: #ae81ff;\">(<\/span>table <span style=\"color: #66d9ef;\">(<\/span>generate-stats <span style=\"color: #ae81ff;\">:js<\/span> <span style=\"color: #ae81ff;\">:execute<\/span> <span style=\"color: #ae81ff;\">:datasets<\/span> <span style=\"color: #a6e22e;\">[<\/span><span style=\"color: #e6db74;\">\"http:\/\/rdf.freebase.com\/ns\/\"<\/span><span style=\"color: #a6e22e;\">]<\/span><span style=\"color: #66d9ef;\">)<\/span><span style=\"color: #ae81ff;\">)<\/span>\n<\/pre>\n<\/div>\n<pre class=\"example\">True positives:  11\nFalse positives:  14\nTrue negatives:  19\nFalse negatives:  467\n\n+--------------+-------------+\n| key          | value       |\n+--------------+-------------+\n| :precision   | 0.44        |\n| :recall      | 0.023012552 |\n| :accuracy    | 0.058708414 |\n| :f1          | 0.043737575 |\n| :f2          | 0.028394425 |\n| :f0.5        | 0.09515571  |\n+--------------+-------------+\n<\/pre>\n<\/div>\n<\/div>\n<div id=\"outline-container-orgheadline9\" class=\"outline-4\">\n<h4 id=\"orgheadline9\">USPTO Only<\/h4>\n<div id=\"text-orgheadline9\" class=\"outline-text-4\">\n<p>Now, let&#8217;s test the impact of adding still a different publicly available specialized dataset: <a href=\"http:\/\/www.uspto.gov\/\">USPTO<\/a>:<\/p>\n<div class=\"org-src-container\">\n<pre class=\"src src-clojure\"><span style=\"color: #ae81ff;\">(<\/span>table <span style=\"color: #66d9ef;\">(<\/span>generate-stats <span style=\"color: #ae81ff;\">:js<\/span> <span style=\"color: #ae81ff;\">:execute<\/span> <span style=\"color: #ae81ff;\">:datasets<\/span> <span style=\"color: #a6e22e;\">[<\/span><span style=\"color: #e6db74;\">\"http:\/\/www.uspto.gov\"<\/span><span style=\"color: #a6e22e;\">]<\/span><span style=\"color: #66d9ef;\">)<\/span><span style=\"color: #ae81ff;\">)<\/span>\n<\/pre>\n<\/div>\n<pre class=\"example\">True positives:  6\nFalse positives:  13\nTrue negatives:  19\nFalse negatives:  473\n\n+--------------+-------------+\n| key          | value       |\n+--------------+-------------+\n| :precision   | 0.31578946  |\n| :recall      | 0.012526096 |\n| :accuracy    | 0.04892368  |\n| :f1          | 0.024096385 |\n| :f2          | 0.015503876 |\n| :f0.5        | 0.054054055 |\n+--------------+-------------+\n<\/pre>\n<\/div>\n<\/div>\n<div id=\"outline-container-orgheadline10\" class=\"outline-4\">\n<h4 id=\"orgheadline10\">Private Dataset #1<\/h4>\n<div id=\"text-orgheadline10\" class=\"outline-text-4\">\n<p>Now, let&#8217;s test the first private dataset:<\/p>\n<div class=\"org-src-container\">\n<pre class=\"src src-clojure\"><span style=\"color: #ae81ff;\">(<\/span>table <span style=\"color: #66d9ef;\">(<\/span>generate-stats <span style=\"color: #ae81ff;\">:js<\/span> <span style=\"color: #ae81ff;\">:execute<\/span> <span style=\"color: #ae81ff;\">:datasets<\/span> <span style=\"color: #a6e22e;\">[<\/span><span style=\"color: #e6db74;\">\"http:\/\/cognonto.com\/datasets\/private\/1\/\"<\/span><span style=\"color: #a6e22e;\">]<\/span><span style=\"color: #66d9ef;\">)<\/span><span style=\"color: #ae81ff;\">)<\/span>\n<\/pre>\n<\/div>\n<pre class=\"example\">True positives:  231\nFalse positives:  109\nTrue negatives:  19\nFalse negatives:  152\n\n+--------------+------------+\n| key          | value      |\n+--------------+------------+\n| :precision   | 0.67941177 |\n| :recall      | 0.60313314 |\n| :accuracy    | 0.4892368  |\n| :f1          | 0.6390042  |\n| :f2          | 0.61698717 |\n| :f0.5        | 0.6626506  |\n+--------------+------------+\n<\/pre>\n<\/div>\n<\/div>\n<div id=\"outline-container-orgheadline11\" class=\"outline-4\">\n<h4 id=\"orgheadline11\">Private Dataset #2<\/h4>\n<div id=\"text-orgheadline11\" class=\"outline-text-4\">\n<p>And, then, the second private dataset:<\/p>\n<div class=\"org-src-container\">\n<pre class=\"src src-clojure\"><span style=\"color: #ae81ff;\">(<\/span>table <span style=\"color: #66d9ef;\">(<\/span>generate-stats <span style=\"color: #ae81ff;\">:js<\/span> <span style=\"color: #ae81ff;\">:execute<\/span> <span style=\"color: #ae81ff;\">:datasets<\/span> <span style=\"color: #a6e22e;\">[<\/span><span style=\"color: #e6db74;\">\"http:\/\/cognonto.com\/datasets\/private\/2\/\"<\/span><span style=\"color: #a6e22e;\">]<\/span><span style=\"color: #66d9ef;\">)<\/span><span style=\"color: #ae81ff;\">)<\/span>\n<\/pre>\n<\/div>\n<pre class=\"example\">True positives:  24\nFalse positives:  21\nTrue negatives:  19\nFalse negatives:  447\n\n+--------------+-------------+\n| key          | value       |\n+--------------+-------------+\n| :precision   | 0.53333336  |\n| :recall      | 0.050955415 |\n| :accuracy    | 0.08414873  |\n| :f1          | 0.093023255 |\n| :f2          | 0.0622084   |\n| :f0.5        | 0.1843318   |\n+--------------+-------------+\n<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<div id=\"outline-container-orgheadline12\" class=\"outline-3\">\n<h3 id=\"orgheadline12\">Combined Datasets &#8211; Public Only<\/h3>\n<div id=\"text-orgheadline12\" class=\"outline-text-3\">\n<p>A more realistic analysis is to use a combination of datasets. Let&#8217;s see what happens to the performance metrics if we start combining <b>public<\/b> datasets.<\/p>\n<\/div>\n<div id=\"outline-container-orgheadline13\" class=\"outline-4\">\n<h4 id=\"orgheadline13\">Wikipedia + Freebase<\/h4>\n<div id=\"text-orgheadline13\" class=\"outline-text-4\">\n<p>First, let&#8217;s start by combining Wikipedia and Freebase.<\/p>\n<div class=\"org-src-container\">\n<pre class=\"src src-clojure\"><span style=\"color: #ae81ff;\">(<\/span>table <span style=\"color: #66d9ef;\">(<\/span>generate-stats <span style=\"color: #ae81ff;\">:js<\/span> <span style=\"color: #ae81ff;\">:execute<\/span> <span style=\"color: #ae81ff;\">:datasets<\/span> <span style=\"color: #a6e22e;\">[<\/span><span style=\"color: #e6db74;\">\"http:\/\/dbpedia.org\/resource\/\"<\/span>\n                                               <span style=\"color: #e6db74;\">\"http:\/\/rdf.freebase.com\/ns\/\"<\/span><span style=\"color: #a6e22e;\">]<\/span><span style=\"color: #66d9ef;\">)<\/span><span style=\"color: #ae81ff;\">)<\/span>\n<\/pre>\n<\/div>\n<pre class=\"example\">True positives:  126\nFalse positives:  60\nTrue negatives:  19\nFalse negatives:  306\n\n+--------------+------------+\n| key          | value      |\n+--------------+------------+\n| :precision   | 0.67741936 |\n| :recall      | 0.29166666 |\n| :accuracy    | 0.28375733 |\n| :f1          | 0.407767   |\n| :f2          | 0.3291536  |\n| :f0.5        | 0.53571427 |\n+--------------+------------+\n<\/pre>\n<p>Adding the Freebase dataset to the DBpedia one had the following effects on the different metrics:<\/p>\n<table border=\"2\" frame=\"hsides\" rules=\"groups\" cellspacing=\"0\" cellpadding=\"6\">\n<colgroup>\n<col class=\"org-left\" \/>\n<col class=\"org-right\" \/> <\/colgroup>\n<thead>\n<tr>\n<th class=\"org-left\" scope=\"col\">metric<\/th>\n<th class=\"org-right\" scope=\"col\">Impact in %<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td class=\"org-left\">precision<\/td>\n<td class=\"org-right\">-0.03%<\/td>\n<\/tr>\n<tr>\n<td class=\"org-left\">recall<\/td>\n<td class=\"org-right\">+4.85%<\/td>\n<\/tr>\n<tr>\n<td class=\"org-left\">accuracy<\/td>\n<td class=\"org-right\">+3.57%<\/td>\n<\/tr>\n<tr>\n<td class=\"org-left\">f1<\/td>\n<td class=\"org-right\">+3.29%<\/td>\n<\/tr>\n<tr>\n<td class=\"org-left\">f2<\/td>\n<td class=\"org-right\">+4.34%<\/td>\n<\/tr>\n<tr>\n<td class=\"org-left\">f0.5<\/td>\n<td class=\"org-right\">+1.57%<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>As we can see, the impact of adding Freebase to the knowledge base is positive even if not ground breaking considering the size of the dataset.<\/p>\n<\/div>\n<\/div>\n<div id=\"outline-container-orgheadline14\" class=\"outline-4\">\n<h4 id=\"orgheadline14\">Wikipedia + USPTO<\/h4>\n<div id=\"text-orgheadline14\" class=\"outline-text-4\">\n<p>Let&#8217;s switch Freebase for the other specialized public dataset, USPTO.<\/p>\n<div class=\"org-src-container\">\n<pre class=\"src src-clojure\"><span style=\"color: #ae81ff;\">(<\/span>table <span style=\"color: #66d9ef;\">(<\/span>generate-stats <span style=\"color: #ae81ff;\">:js<\/span> <span style=\"color: #ae81ff;\">:execute<\/span> <span style=\"color: #ae81ff;\">:datasets<\/span> <span style=\"color: #a6e22e;\">[<\/span><span style=\"color: #e6db74;\">\"http:\/\/dbpedia.org\/resource\/\"<\/span>\n                                               <span style=\"color: #e6db74;\">\"http:\/\/www.uspto.gov\"<\/span><span style=\"color: #a6e22e;\">]<\/span><span style=\"color: #66d9ef;\">)<\/span><span style=\"color: #ae81ff;\">)<\/span>\n<\/pre>\n<\/div>\n<pre class=\"example\">True positives:  122\nFalse positives:  59\nTrue negatives:  19\nFalse negatives:  311\n\n+--------------+------------+\n| key          | value      |\n+--------------+------------+\n| :precision   | 0.67403316 |\n| :recall      | 0.2817552  |\n| :accuracy    | 0.27592954 |\n| :f1          | 0.39739415 |\n| :f2          | 0.31887087 |\n| :f0.5        | 0.52722555 |\n+--------------+------------+\n<\/pre>\n<p>Adding the USPTO dataset to the DBpedia instead of Freebase had the following effects on the different metrics:<\/p>\n<table border=\"2\" frame=\"hsides\" rules=\"groups\" cellspacing=\"0\" cellpadding=\"6\">\n<colgroup>\n<col class=\"org-left\" \/>\n<col class=\"org-right\" \/> <\/colgroup>\n<thead>\n<tr>\n<th class=\"org-left\" scope=\"col\">metric<\/th>\n<th class=\"org-right\" scope=\"col\">Impact in %<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td class=\"org-left\">precision<\/td>\n<td class=\"org-right\">-0.83%<\/td>\n<\/tr>\n<tr>\n<td class=\"org-left\">recall<\/td>\n<td class=\"org-right\">+1.29%<\/td>\n<\/tr>\n<tr>\n<td class=\"org-left\">accuracy<\/td>\n<td class=\"org-right\">+0.73%<\/td>\n<\/tr>\n<tr>\n<td class=\"org-left\">f1<\/td>\n<td class=\"org-right\">+0.65%<\/td>\n<\/tr>\n<tr>\n<td class=\"org-left\">f2<\/td>\n<td class=\"org-right\">+1.07%<\/td>\n<\/tr>\n<tr>\n<td class=\"org-left\">f0.5<\/td>\n<td class=\"org-right\">+0.03%<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>As we may have expected the gains are smaller than Freebase. Maybe partly because it is smaller and more specialized. Because it is more specialized (enterprises that have patents registered in US), maybe the gold standard doesn&#8217;t represent well the organizations belonging to this dataset. But in any case, these are still gains.<\/p>\n<\/div>\n<\/div>\n<div id=\"outline-container-orgheadline15\" class=\"outline-4\">\n<h4 id=\"orgheadline15\">Wikipedia + Freebase + USPTO<\/h4>\n<div id=\"text-orgheadline15\" class=\"outline-text-4\">\n<p>Let&#8217;s continue and now include all three datasets.<\/p>\n<div class=\"org-src-container\">\n<pre class=\"src src-clojure\"><span style=\"color: #ae81ff;\">(<\/span>table <span style=\"color: #66d9ef;\">(<\/span>generate-stats <span style=\"color: #ae81ff;\">:js<\/span> <span style=\"color: #ae81ff;\">:execute<\/span> <span style=\"color: #ae81ff;\">:datasets<\/span> <span style=\"color: #a6e22e;\">[<\/span><span style=\"color: #e6db74;\">\"http:\/\/dbpedia.org\/resource\/\"<\/span>\n                                               <span style=\"color: #e6db74;\">\"http:\/\/www.uspto.gov\"<\/span>\n                                               <span style=\"color: #e6db74;\">\"http:\/\/rdf.freebase.com\/ns\/\"<\/span><span style=\"color: #a6e22e;\">]<\/span><span style=\"color: #66d9ef;\">)<\/span><span style=\"color: #ae81ff;\">)<\/span>\n<\/pre>\n<\/div>\n<pre class=\"example\">True positives:  127\nFalse positives:  62\nTrue negatives:  19\nFalse negatives:  303\n\n+--------------+------------+\n| key          | value      |\n+--------------+------------+\n| :precision   | 0.6719577  |\n| :recall      | 0.29534882 |\n| :accuracy    | 0.2857143  |\n| :f1          | 0.41033927 |\n| :f2          | 0.3326349  |\n| :f0.5        | 0.53541315 |\n+--------------+------------+\n<\/pre>\n<p>Now let&#8217;s see the impact of adding both Freebase and USPTO to the Wikipedia dataset:<\/p>\n<table border=\"2\" frame=\"hsides\" rules=\"groups\" cellspacing=\"0\" cellpadding=\"6\">\n<colgroup>\n<col class=\"org-left\" \/>\n<col class=\"org-right\" \/> <\/colgroup>\n<thead>\n<tr>\n<th class=\"org-left\" scope=\"col\">metric<\/th>\n<th class=\"org-right\" scope=\"col\">Impact in %<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td class=\"org-left\">precision<\/td>\n<td class=\"org-right\">+1.14%<\/td>\n<\/tr>\n<tr>\n<td class=\"org-left\">recall<\/td>\n<td class=\"org-right\">+6.18%<\/td>\n<\/tr>\n<tr>\n<td class=\"org-left\">accuracy<\/td>\n<td class=\"org-right\">+4.30%<\/td>\n<\/tr>\n<tr>\n<td class=\"org-left\">f1<\/td>\n<td class=\"org-right\">+3.95%<\/td>\n<\/tr>\n<tr>\n<td class=\"org-left\">f2<\/td>\n<td class=\"org-right\">+5.45%<\/td>\n<\/tr>\n<tr>\n<td class=\"org-left\">f0.5<\/td>\n<td class=\"org-right\">+1.51%<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Now let&#8217;s see the impact of using highly curated, domain related, private datasets.<\/p>\n<\/div>\n<\/div>\n<\/div>\n<div id=\"outline-container-orgheadline16\" class=\"outline-3\">\n<h3 id=\"orgheadline16\">Combined Datasets &#8211; Public enhanced with private datasets<\/h3>\n<div id=\"text-orgheadline16\" class=\"outline-text-3\">\n<p>The next step is to add the private datasets of highly curated data that are specific to the domain of identifying web page publisher organizations. As the baseline, we will use the three public datasets: Wikipedia, Freebase and USPTO and then we will add the private datasets.<\/p>\n<\/div>\n<div id=\"outline-container-orgheadline17\" class=\"outline-4\">\n<h4 id=\"orgheadline17\">Wikipedia + Freebase + USPTO + PD #1<\/h4>\n<div id=\"text-orgheadline17\" class=\"outline-text-4\">\n<div class=\"org-src-container\">\n<pre class=\"src src-clojure\"><span style=\"color: #ae81ff;\">(<\/span>table <span style=\"color: #66d9ef;\">(<\/span>generate-stats <span style=\"color: #ae81ff;\">:js<\/span> <span style=\"color: #ae81ff;\">:execute<\/span> <span style=\"color: #ae81ff;\">:datasets<\/span> <span style=\"color: #a6e22e;\">[<\/span><span style=\"color: #e6db74;\">\"http:\/\/dbpedia.org\/resource\/\"<\/span>\n                                               <span style=\"color: #e6db74;\">\"http:\/\/www.uspto.gov\"<\/span>\n                                               <span style=\"color: #e6db74;\">\"http:\/\/rdf.freebase.com\/ns\/\"<\/span>\n                                               <span style=\"color: #e6db74;\">\"http:\/\/cognonto.com\/datasets\/private\/1\/\"<\/span><span style=\"color: #a6e22e;\">]<\/span><span style=\"color: #66d9ef;\">)<\/span><span style=\"color: #ae81ff;\">)<\/span>\n<\/pre>\n<\/div>\n<pre class=\"example\">True positives:  279\nFalse positives:  102\nTrue negatives:  19\nFalse negatives:  111\n\n+--------------+------------+\n| key          | value      |\n+--------------+------------+\n| :precision   | 0.7322835  |\n| :recall      | 0.7153846  |\n| :accuracy    | 0.58317024 |\n| :f1          | 0.7237354  |\n| :f2          | 0.7187017  |\n| :f0.5        | 0.7288401  |\n+--------------+------------+\n<\/pre>\n<p>Now, let&#8217;s see the impact of adding the private dataset #1 along with Wikipedia, Freebase and USPTO:<\/p>\n<table border=\"2\" frame=\"hsides\" rules=\"groups\" cellspacing=\"0\" cellpadding=\"6\">\n<colgroup>\n<col class=\"org-left\" \/>\n<col class=\"org-right\" \/> <\/colgroup>\n<thead>\n<tr>\n<th class=\"org-left\" scope=\"col\">metric<\/th>\n<th class=\"org-right\" scope=\"col\">Impact in %<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td class=\"org-left\">precision<\/td>\n<td class=\"org-right\">+8.97%<\/td>\n<\/tr>\n<tr>\n<td class=\"org-left\">recall<\/td>\n<td class=\"org-right\">+142.22%<\/td>\n<\/tr>\n<tr>\n<td class=\"org-left\">accuracy<\/td>\n<td class=\"org-right\">+104.09%<\/td>\n<\/tr>\n<tr>\n<td class=\"org-left\">f1<\/td>\n<td class=\"org-right\">+76.38%<\/td>\n<\/tr>\n<tr>\n<td class=\"org-left\">f2<\/td>\n<td class=\"org-right\">+116.08%<\/td>\n<\/tr>\n<tr>\n<td class=\"org-left\">f0.5<\/td>\n<td class=\"org-right\">+36.12%<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Adding the highly curated and domain specific private dataset #1 had a dramatic impact on all the metrics of the combined public datasets. Now let&#8217;s see what is the impact of the public datasets on the private dataset #1 metrics when it is used alone:<\/p>\n<table border=\"2\" frame=\"hsides\" rules=\"groups\" cellspacing=\"0\" cellpadding=\"6\">\n<colgroup>\n<col class=\"org-left\" \/>\n<col class=\"org-right\" \/> <\/colgroup>\n<thead>\n<tr>\n<th class=\"org-left\" scope=\"col\">metric<\/th>\n<th class=\"org-right\" scope=\"col\">Impact in %<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td class=\"org-left\">precision<\/td>\n<td class=\"org-right\">+7.77%<\/td>\n<\/tr>\n<tr>\n<td class=\"org-left\">recall<\/td>\n<td class=\"org-right\">+18.60%<\/td>\n<\/tr>\n<tr>\n<td class=\"org-left\">accuracy<\/td>\n<td class=\"org-right\">+19.19%<\/td>\n<\/tr>\n<tr>\n<td class=\"org-left\">f1<\/td>\n<td class=\"org-right\">+13.25%<\/td>\n<\/tr>\n<tr>\n<td class=\"org-left\">f2<\/td>\n<td class=\"org-right\">+16.50%<\/td>\n<\/tr>\n<tr>\n<td class=\"org-left\">f0.5<\/td>\n<td class=\"org-right\">+9.99%<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>As we can see, the public datasets does significantly increase the performance of the highly curated and domain specific private dataset #1.<\/p>\n<\/div>\n<\/div>\n<div id=\"outline-container-orgheadline18\" class=\"outline-4\">\n<h4 id=\"orgheadline18\">Wikipedia + Freebase + USPTO + PD #2<\/h4>\n<div id=\"text-orgheadline18\" class=\"outline-text-4\">\n<div class=\"org-src-container\">\n<pre class=\"src src-clojure\"><span style=\"color: #ae81ff;\">(<\/span>table <span style=\"color: #66d9ef;\">(<\/span>generate-stats <span style=\"color: #ae81ff;\">:js<\/span> <span style=\"color: #ae81ff;\">:execute<\/span> <span style=\"color: #ae81ff;\">:datasets<\/span> <span style=\"color: #a6e22e;\">[<\/span><span style=\"color: #e6db74;\">\"http:\/\/dbpedia.org\/resource\/\"<\/span>\n                                               <span style=\"color: #e6db74;\">\"http:\/\/www.uspto.gov\"<\/span>\n                                               <span style=\"color: #e6db74;\">\"http:\/\/rdf.freebase.com\/ns\/\"<\/span>\n                                               <span style=\"color: #e6db74;\">\"http:\/\/cognonto.com\/datasets\/private\/2\/\"<\/span><span style=\"color: #a6e22e;\">]<\/span><span style=\"color: #66d9ef;\">)<\/span><span style=\"color: #ae81ff;\">)<\/span>\n<\/pre>\n<\/div>\n<pre class=\"example\">True positives:  138\nFalse positives:  69\nTrue negatives:  19\nFalse negatives:  285\n\n+--------------+------------+\n| key          | value      |\n+--------------+------------+\n| :precision   | 0.6666667  |\n| :recall      | 0.32624114 |\n| :accuracy    | 0.3072407  |\n| :f1          | 0.43809524 |\n| :f2          | 0.36334914 |\n| :f0.5        | 0.55155873 |\n+--------------+------------+\n<\/pre>\n<p>Not all of the private datasets have equivalent impact. Let&#8217;s see the impact of adding the private dataset #2 instead of the #1:<\/p>\n<table border=\"2\" frame=\"hsides\" rules=\"groups\" cellspacing=\"0\" cellpadding=\"6\">\n<colgroup>\n<col class=\"org-left\" \/>\n<col class=\"org-right\" \/> <\/colgroup>\n<thead>\n<tr>\n<th class=\"org-left\" scope=\"col\">metric<\/th>\n<th class=\"org-right\" scope=\"col\">Impact in %<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td class=\"org-left\">precision<\/td>\n<td class=\"org-right\">-0.78%<\/td>\n<\/tr>\n<tr>\n<td class=\"org-left\">recall<\/td>\n<td class=\"org-right\">+10.46%<\/td>\n<\/tr>\n<tr>\n<td class=\"org-left\">accuracy<\/td>\n<td class=\"org-right\">+7.52%<\/td>\n<\/tr>\n<tr>\n<td class=\"org-left\">f1<\/td>\n<td class=\"org-right\">+6.75%<\/td>\n<\/tr>\n<tr>\n<td class=\"org-left\">f2<\/td>\n<td class=\"org-right\">+9.23%<\/td>\n<\/tr>\n<tr>\n<td class=\"org-left\">f0.5<\/td>\n<td class=\"org-right\">+3.00%<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<\/div>\n<div id=\"outline-container-orgheadline19\" class=\"outline-4\">\n<h4 id=\"orgheadline19\">Wikipedia + Freebase + USPTO + PD #1 + PD #2<\/h4>\n<div id=\"text-orgheadline19\" class=\"outline-text-4\">\n<p>Now let&#8217;s see what happens when we use all the public and private datasets.<\/p>\n<div class=\"org-src-container\">\n<pre class=\"src src-clojure\"><span style=\"color: #ae81ff;\">(<\/span>table <span style=\"color: #66d9ef;\">(<\/span>generate-stats <span style=\"color: #ae81ff;\">:js<\/span> <span style=\"color: #ae81ff;\">:execute<\/span> <span style=\"color: #ae81ff;\">:datasets<\/span> <span style=\"color: #a6e22e;\">[<\/span><span style=\"color: #e6db74;\">\"http:\/\/dbpedia.org\/resource\/\"<\/span>\n                                               <span style=\"color: #e6db74;\">\"http:\/\/www.uspto.gov\"<\/span>\n                                               <span style=\"color: #e6db74;\">\"http:\/\/rdf.freebase.com\/ns\/\"<\/span>\n                                               <span style=\"color: #e6db74;\">\"http:\/\/cognonto.com\/datasets\/private\/1\/\"<\/span>\n                                               <span style=\"color: #e6db74;\">\"http:\/\/cognonto.com\/datasets\/private\/2\/\"<\/span><span style=\"color: #a6e22e;\">]<\/span><span style=\"color: #66d9ef;\">)<\/span><span style=\"color: #ae81ff;\">)<\/span>\n<\/pre>\n<\/div>\n<pre class=\"example\">True positives:  285\nFalse positives:  102\nTrue negatives:  19\nFalse negatives:  105\n\n+--------------+------------+\n| key          | value      |\n+--------------+------------+\n| :precision   | 0.7364341  |\n| :recall      | 0.7307692  |\n| :accuracy    | 0.59491193 |\n| :f1          | 0.7335907  |\n| :f2          | 0.7318952  |\n| :f0.5        | 0.7352941  |\n+--------------+------------+\n<\/pre>\n<p>Let&#8217;s see the impact of adding the private datasets #1 and #2 to the public datasets:<\/p>\n<table border=\"2\" frame=\"hsides\" rules=\"groups\" cellspacing=\"0\" cellpadding=\"6\">\n<colgroup>\n<col class=\"org-left\" \/>\n<col class=\"org-right\" \/> <\/colgroup>\n<thead>\n<tr>\n<th class=\"org-left\" scope=\"col\">metric<\/th>\n<th class=\"org-right\" scope=\"col\">Impact in %<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td class=\"org-left\">precision<\/td>\n<td class=\"org-right\">+9.60%<\/td>\n<\/tr>\n<tr>\n<td class=\"org-left\">recall<\/td>\n<td class=\"org-right\">+147.44%<\/td>\n<\/tr>\n<tr>\n<td class=\"org-left\">accuracy<\/td>\n<td class=\"org-right\">+108.22%<\/td>\n<\/tr>\n<tr>\n<td class=\"org-left\">f1<\/td>\n<td class=\"org-right\">+78.77%<\/td>\n<\/tr>\n<tr>\n<td class=\"org-left\">f2<\/td>\n<td class=\"org-right\">+120.02%<\/td>\n<\/tr>\n<tr>\n<td class=\"org-left\">f0.5<\/td>\n<td class=\"org-right\">+37.31%<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<\/div>\n<\/div>\n<div id=\"outline-container-orgheadline20\" class=\"outline-3\">\n<h3 id=\"orgheadline20\">Adding Unknown Entities Tagger<\/h3>\n<div id=\"text-orgheadline20\" class=\"outline-text-3\">\n<p>There is one last feature with the Cognonto publisher analyzer: it is possible for it to identify unknown entities from the web page. (An &#8220;unknown entity&#8221; is identified as a likely organization entity, but which does not already exist in the KB.) Sometimes, it is the unknown entity that is the publisher of the web page.<\/p>\n<div class=\"org-src-container\">\n<pre class=\"src src-clojure\"><span style=\"color: #ae81ff;\">(<\/span>table <span style=\"color: #66d9ef;\">(<\/span>generate-stats <span style=\"color: #ae81ff;\">:js<\/span> <span style=\"color: #ae81ff;\">:execute<\/span> <span style=\"color: #ae81ff;\">:datasets<\/span> <span style=\"color: #ae81ff;\">:all<\/span><span style=\"color: #66d9ef;\">)<\/span><span style=\"color: #ae81ff;\">)<\/span>\n<\/pre>\n<\/div>\n<pre class=\"example\">True positives:  345\nFalse positives:  104\nTrue negatives:  19\nFalse negatives:  43\n\n+--------------+------------+\n| key          | value      |\n+--------------+------------+\n| :precision   | 0.76837415 |\n| :recall      | 0.88917524 |\n| :accuracy    | 0.7123288  |\n| :f1          | 0.82437277 |\n| :f2          | 0.86206895 |\n| :f0.5        | 0.78983516 |\n+--------------+------------+\n<\/pre>\n<p>As we can see, the overall accuracy improved by <code>19.73%<\/code> when considering the unknown entities compared to the public and private datasets.<\/p>\n<table border=\"2\" frame=\"hsides\" rules=\"groups\" cellspacing=\"0\" cellpadding=\"6\">\n<colgroup>\n<col class=\"org-left\" \/>\n<col class=\"org-right\" \/> <\/colgroup>\n<thead>\n<tr>\n<th class=\"org-left\" scope=\"col\">metric<\/th>\n<th class=\"org-right\" scope=\"col\">Impact in %<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td class=\"org-left\">precision<\/td>\n<td class=\"org-right\">+4.33%<\/td>\n<\/tr>\n<tr>\n<td class=\"org-left\">recall<\/td>\n<td class=\"org-right\">+21.67%<\/td>\n<\/tr>\n<tr>\n<td class=\"org-left\">accuracy<\/td>\n<td class=\"org-right\">+19.73%<\/td>\n<\/tr>\n<tr>\n<td class=\"org-left\">f1<\/td>\n<td class=\"org-right\">+12.37%<\/td>\n<\/tr>\n<tr>\n<td class=\"org-left\">f2<\/td>\n<td class=\"org-right\">+17.79%<\/td>\n<\/tr>\n<tr>\n<td class=\"org-left\">f0.5<\/td>\n<td class=\"org-right\">+7.42%<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<\/div>\n<\/div>\n<div id=\"outline-container-orgheadline21\" class=\"outline-2\">\n<h2 id=\"orgheadline21\">Analysis<\/h2>\n<div id=\"text-orgheadline21\" class=\"outline-text-2\">\n<p>When we first tested the system with single datasets, some of them were scoring better than others for most of the metrics. However, does that mean that we could only use them and be done with it? No, what this analysis is telling us is that some datasets score better for this set of web pages. They cover more entities found in those web pages. However, even if a dataset was scoring lower it does not mean it is useless. In fact, that <i>worse<\/i> dataset may in fact cover one prediction area not covered in a <i>better<\/i> one, which means that by combining the two, we could improve the general prediction power of the system. This is what we can see by adding the private datasets to the public ones.<\/p>\n<p>Even if the highly curated and domain-specific private datasets score much better than the more general public datasets, the system still greatly benefits from the contribution of the public dataset by significantly improving the accuracy of the system. We got a gain <code>19.19%<\/code> in accuracy by adding the public datasets to the better scoring private dataset #1. Nearly <code>20%<\/code> of improvement in such a predictive system is highly significant.<\/p>\n<p>Another thing that this series of tests tends to demonstrate is that the more knowledge we have, the more we can improve the accuracy of the system. Adding datasets doesn&#8217;t appear to lower the overall performance of the system (even if I am sure that some could), but generally the more the better (but more doesn&#8217;t necessarely produce significant accuracy increases).<\/p>\n<p>Finally, adding a feature to the system can also greatly improve its overall accuracy. In this case, we added the feature of detecting unknown entities (organization entities that are not existing in the datasets that compose the knowledge base), which improved the overall accuracy by another <code>19.73%<\/code>. How is that possible? To understand this we have to consider the domain: random web pages that exist on the Web. A web page can be published by anybody and any organization. This means that the <a href=\"https:\/\/en.wikipedia.org\/wiki\/Long_tail\">long tail<\/a> of web page publisher is probably pretty long. Considering this fact, it is normal that existing knowledge bases may not contain all of the obscure organizations that publish web pages. It is most likely why having a system that can detect and predict unknown entities as the publishers of web page will have a significant impact on the overall accuracy of the system. The flagging of such &#8220;unknown&#8221; entities tells us where to focus efforts to add to the known database of existing publishers.<\/p>\n<\/div>\n<\/div>\n<div id=\"outline-container-orgheadline22\" class=\"outline-2\">\n<h2 id=\"orgheadline22\">Conclusion<\/h2>\n<div id=\"text-orgheadline22\" class=\"outline-text-2\">\n<p>As we saw in this analysis, adding high quality and domain-specific private datasets can greatly improve the accuracy of such a prediction system. Some datasets may have a more significan impact than others, but overall, each dataset contribute to the overall improvement of the predictions.<\/p>\n<\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>In the last decade, we have seen the emergence of two big families of datasets: the public and the private ones. Invaluable public datasets like Wikipedia, Wikidata, Open Corporates and others have been created and leveraged by organizations world-wide. However, as great as they are, most organization still rely on private datasets of their own [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[293,287,84],"tags":[263,288,292,289,231],"class_list":["post-3354","post","type-post","status-publish","format-standard","hentry","category-artificial-intelligence","category-cognonto","category-semantic-web","tag-ai","tag-cognonto","tag-datasets","tag-machinelearning","tag-semanticweb"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/fgiasson.com\/blog\/index.php\/wp-json\/wp\/v2\/posts\/3354","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/fgiasson.com\/blog\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/fgiasson.com\/blog\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/fgiasson.com\/blog\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/fgiasson.com\/blog\/index.php\/wp-json\/wp\/v2\/comments?post=3354"}],"version-history":[{"count":5,"href":"https:\/\/fgiasson.com\/blog\/index.php\/wp-json\/wp\/v2\/posts\/3354\/revisions"}],"predecessor-version":[{"id":3500,"href":"https:\/\/fgiasson.com\/blog\/index.php\/wp-json\/wp\/v2\/posts\/3354\/revisions\/3500"}],"wp:attachment":[{"href":"https:\/\/fgiasson.com\/blog\/index.php\/wp-json\/wp\/v2\/media?parent=3354"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/fgiasson.com\/blog\/index.php\/wp-json\/wp\/v2\/categories?post=3354"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/fgiasson.com\/blog\/index.php\/wp-json\/wp\/v2\/tags?post=3354"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}