In my previous blog post, Create a Domain Text Classifier Using Cognonto, I explained how one can use the KBpedia Knowledge Graph to automatically create positive and negative training corpuses for different machine learning tasks. I explained how SVM classifiers could be trained and used to check if an input text belongs to the defined domain or not.

This article is the first of two articles.In first part I will extend on this idea to explain how the KBpedia Knowledge Graph can be used, along with other machine learning techniques, to cope with different situations and use cases. I will cover the concepts of feature selection, hyperparameter optimization, and ensemble learning (in part 2 of this series). The emphasis here is on the testing and refining of machine learners, versus the set up and configuration times that dominate other approaches.

Depending on the domain of interest, and depending on the required precision or recall, different strategies and techniques can lead to better predictions. More often than not, multiple different training corpuses, learners and hyperparameters need to be tested before ending up with the initial best possible prediction model. This is why I will strongly emphasize the fact that the KBpedia Knowledge Graph and Cognonto can be used to automate fully the creation of a wide range of different training corpuses, to create models, to optimize their hyperparameters, and to evaluate those models.

[extoc]

New Knowledge Graph and Reasoning

For this article, I will use the latest version of the KBpedia Knowledge Graph version 1.10 that we just released. A knowledge graph such as KBpedia is not static. It constantly evolves, gets fixed, and improves. New concepts are created, deprecated concepts are removed, new linkage to external data sources are created, etc. This growth means that any of these changes can have a [positive] impact on the creation of the positive and negative training sets. Applications based on KBpedia should be tested against any new knowledge graph that is released to see if its models will improve. Better concepts, better structure, and more linkages will often lead to better training sets as well.

Such growth in KBpedia is also why automating, and more importantly testing, this process is crucial. Upon the release of major new versions we are able to automate all of these steps to see the final impacts of upgrading the knowledge graph:

  1. Aggregate all the reference concepts that scope the specified domain (by inference)
  2. Create the positive and negative training corpuses
  3. Prune the training corpuses
  4. Configure the classifier (in this case, create the semantic vectors for ESA)
  5. Train the model (in this case, the SVM model)
  6. Optimize the hyperparameters of the algorithm (in this case, the linear SVM hyperparameters), and
  7. Evaluate the model on multiple gold standards.

Because each of these steps belongs to an automated workflow, we can easily check the impact of updating the KBpedia Knowledge Graph on our models.

Reasoning Over The Knowledge Graph

A new step I am adding in this current use case is to use a reasoner to reason over the KBpedia Knowledge Graph. The reasoner is used when we define the scope of the domain to classify. We will browse the knowledge graph to see which seed reference concepts we should add to the scope. Then we will use a reasoner to extend the models to include any new sub-classes relevant to the scope of the domain. This means that we may add further specific features to the final model.

Update Domain Training Corpus Using KBpedia 1.10 and a Reasoner

Recall our prior use case used Music as its domain scope. The first step is to use the new KBpedia version 1.10 along with a reasoner to create the full scope of this updated Music domain.

The result of using this new version and a reasoner is that we now end up with 196 features (reference documents) instead of 64. This also means that we will have 196 documents in our positive training set if we only use the Wikipedia pages linked to these reference concepts (and not their related named entities).

(use 'cognonto-esa.core)
(require '[cognonto-owl.core :as owl])
(require '[cognonto-owl.reasoner :as reasoner])

(def kbpedia-manager (owl/make-ontology-manager))
(def kbpedia (owl/load-ontology "resources/kbpedia_reference_concepts_linkage.n3"
                                :manager kbpedia-manager))
(def kbpedia-reasoner (reasoner/make-reasoner kbpedia))

(define-domain-corpus ["http://kbpedia.org/kko/rc/Music"
                       "http://kbpedia.org/kko/rc/Musician"
                       "http://kbpedia.org/kko/rc/MusicPerformanceOrganization"
                       "http://kbpedia.org/kko/rc/MusicalInstrument"
                       "http://kbpedia.org/kko/rc/Album-CW"
                       "http://kbpedia.org/kko/rc/Album-IBO"
                       "http://kbpedia.org/kko/rc/MusicalComposition"
                       "http://kbpedia.org/kko/rc/MusicalText"
                       "http://kbpedia.org/kko/rc/PropositionalConceptualWork-MusicalGenre"
                       "http://kbpedia.org/kko/rc/MusicalPerformer"]
  kbpedia
  "resources/domain-corpus-dictionary.csv"
  :reasoner kbpedia-reasoner)

Create Training Corpuses

The next step is to create the actual training corpuses: the general and domain ones. We have to load the dictionaries we created in the previous step, and then to locally cache and normalize the corpuses. Remember that the normalization steps are:

  1. Defluff the raw HTML page. We convert the HTML into text, and we only keep the body of the page
  2. Normalize the text with the following rules:
    1. remove diacritics characters
    2. remove everything between brackets like: [edit] [show]
    3. remove punctuation
    4. remove all numbers
    5. remove all invisible control characters
    6. remove all [math] symbols
    7. remove all words with 2 characters or fewer
    8. remove line and paragraph seperators
    9. remove anything that is not an alpha character
    10. normalize spaces
    11. put everything in lower case, and
    12. remove stop words.
(load-dictionaries "resources/general-corpus-dictionary.csv" "resources/domain-corpus-dictionary.csv")

(cache-corpus)

(normalize-cached-corpus "resources/corpus/" "resources/corpus-normalized/")

Create New Gold Standard

Because we never have enough instances in our gold standards to test against, let’s create a third one, but this time adding a music related news feed that will add more positive examples to the gold standard.

(defn create-gold-standard-from-feeds
  [name]
  (let [feeds ["http://www.music-news.com/rss/UK/news"
               "http://rss.cbc.ca/lineup/topstories.xml"
               "http://rss.cbc.ca/lineup/world.xml"
               "http://rss.cbc.ca/lineup/canada.xml"
               "http://rss.cbc.ca/lineup/politics.xml"
               "http://rss.cbc.ca/lineup/business.xml"
               "http://rss.cbc.ca/lineup/health.xml"
               "http://rss.cbc.ca/lineup/arts.xml"
               "http://rss.cbc.ca/lineup/technology.xml"
               "http://rss.cbc.ca/lineup/offbeat.xml"
               "http://www.cbc.ca/cmlink/rss-cbcaboriginal"
               "http://rss.cbc.ca/lineup/sports.xml"
               "http://rss.cbc.ca/lineup/canada-britishcolumbia.xml"
               "http://rss.cbc.ca/lineup/canada-calgary.xml"
               "http://rss.cbc.ca/lineup/canada-montreal.xml"
               "http://rss.cbc.ca/lineup/canada-pei.xml"
               "http://rss.cbc.ca/lineup/canada-ottawa.xml"
               "http://rss.cbc.ca/lineup/canada-toronto.xml"
               "http://rss.cbc.ca/lineup/canada-north.xml"
               "http://rss.cbc.ca/lineup/canada-manitoba.xml"
               "http://feeds.reuters.com/news/artsculture"
               "http://feeds.reuters.com/reuters/businessNews"
               "http://feeds.reuters.com/reuters/entertainment"
               "http://feeds.reuters.com/reuters/companyNews"
               "http://feeds.reuters.com/reuters/lifestyle"
               "http://feeds.reuters.com/reuters/healthNews"
               "http://feeds.reuters.com/reuters/MostRead"
               "http://feeds.reuters.com/reuters/peopleNews"
               "http://feeds.reuters.com/reuters/scienceNews"
               "http://feeds.reuters.com/reuters/technologyNews"
               "http://feeds.reuters.com/Reuters/domesticNews"
               "http://feeds.reuters.com/Reuters/worldNews"
               "http://feeds.reuters.com/reuters/USmediaDiversifiedNews"]]

    (with-open [out-file (io/writer (str "resources/" name ".csv"))]
      (csv/write-csv out-file [["class" "title" "url"]])
      (doseq [feed-url feeds]
        (doseq [item (:entries (feed/parse-feed feed-url))]
          (csv/write-csv out-file "" (:title item) (:link item) :append true))))))

This routine creates this third gold standard. Remember, we use the gold standard to evaluate different methods and models to classify an input text to see if it belongs to the domain or not.

For each piece of news aggregated that way, I manually determined if the candidate document belongs to the domain or not. This task can be tricky, and requires a clear understanding of the proper scope for the domain. In this example, I consider an article to belong to the music domain if it mentions music concepts such as musical albums, songs, multiple music related topics, etc. If only a singer is mentioned in an article because he broke up with his girlfriend, without further mention of anything related to music, I won’t tag it as being part of the domain.

[However, under a different interpretation of what should be in the domain wherein any mention of a singer qualifies, then we could extend the classification process to include named entities (the singer) extraction to help properly classify those articles. This revised scope is not used in this article, but it does indicate how your exact domain needs should inform such scoping decisions.]

You can download this new third gold standard from here.

Evaluate Initial Domain Model

Now that we have updated the training corpuses using the updated scope of the domain compared to the previous tests, let’s analyze the impact of using a new version of KBpedia and to use a reasoner to increase the number of features in our model. Let’s run our automatic process to evaluate the new models. The remaining steps that needs to be run are:

  1. Configure the classifier (in this case, create the semantic vectors for ESA)
  2. Train the model (in this case, the SVM model), and
  3. Evaluate the model on multiple gold standards.

Note: the see the full explanation of how ESA and the SVM classifiers works, please refer to the Create a Domain Text Classifier
Using Cognonto
article for more background information.

;; Load positive and negative training corpuses
(load-dictionaries "resources/general-corpus-dictionary.csv" "resources/domain-corpus-dictionary.csv")

;; Build the ESA semantic interpreter 
(build-semantic-interpreter "base" "resources/semantic-interpreters/base/" (distinct (concat (get-domain-pages) (get-general-pages))))

;; Build the vectors to feed to a SVM classifier using ESA
(build-svm-model-vectors "resources/svm/base/" :corpus-folder-normalized "resources/corpus-normalized/")

;; Train the SVM using the best parameters discovered in the previous tests
(train-svm-model "svm.w50" "resources/svm/base/"
                 :weights {1 50.0}
                 :v nil
                 :c 1
                 :algorithm :l2l2)

Let’s evaluate this model using our three gold standards:

(evaluate-model "svm.goldstandard.1.w50" "resources/gold-standard-1.csv")
True positive:  21
False positive:  3
True negative:  306
False negative:  6

Precision:  0.875
Recall:  0.7777778
Accuracy:  0.97321427
F1:  0.8235294

The performance changes related to the previous results (using KBpedia 1.02) are:

  • Precision: +10.33%
  • Recall: -12.16%
  • Accuracy: +0.31%
  • F1: +0.26%

The results for the second gold standard are:

(evaluate-model "svm.goldstandard.2.w50" "resources/gold-standard-2.csv")
True positive:  16
False positive:  3
True negative:  317
False negative:  9

Precision:  0.84210527
Recall:  0.64
Accuracy:  0.9652174
F1:  0.72727275

The performances changes related to the previous results (using KBpedia 1.02) are:

  • Precision: +6.18%
  • Recall: -29.35%
  • Accuracy: -1.19%
  • F1: -14.63%

What we can say is that the new scope for the domain greatly improved the precision of the model. This happens because the new model is probably more complex and better scoped, which leads it to be more selective. However, because of this the recall of the model suffers since some of the positive case of our gold standard are not considered to be positive but negative, which now creates new false positives. As you can see, there is almost always a tradeoff between precision and recall. However, you could have 100% precision by only having one result right, but then the recall would suffer greatly. This is why the F1 score is important since it is a weighted average of the precision and the recall.

Now let’s look at the results of our new gold standard:

(evaluate-model "svm.goldstandard.3.w50" "resources/gold-standard-3.csv")
True positive:  28
False positive:  3
True negative:  355
False negative:  22

Precision:  0.9032258
Recall:  0.56
Accuracy:  0.9387255
F1:  0.69135803

Again, with this new gold standard, we can see the same pattern: the precision is pretty good, but the recall is not that great since about half the true positives did not get noticed by the model.

Now, what could we do to try to improve this situation? The next thing we will investigate is to use feature selection and pruning.

Features Selection Using Pruning and Training Corpus Pruning

A new method that we will investigate to try to improve the performance of the models is called feature selection. As its name says, what we are doing is to select specific features to create our prediction model. The idea here is that not all features are born equal and different features may have different (positive or negative) impacts on the model.

In our specific use case, we want to do feature selection using a pruning technique. What we will do is to count the number of tokens for each of our features, and each of the Wikipedia page related to these features. If the number of tokens in an article is too small (below 100), then we will drop that feature.

[Note: feature selection is a complex topic; other options and nuances are not further discussed here.]

The idea here is not to give undue importance to a feature for which we lack proper positive documents in the training corpus. Depending on the feature, it may, or may not, have an impact on the overall model’s performance.

Pruning the general and domain specific dictionaries is really simple. We only have to read the current dictionaries, to read each of the documents mentioned in the dictionary from the cache, to calculate the number of tokens in each, and then to keep them or to drop them if they reach a certain threshold. Finally we write a new dictionary with the pruned features and documents:

(defn create-pruned-pages-dictionary-csv
  [dictionary-file prunned-file normalized-corpus-folder & {:keys [min-tokens]
                                                            :or {min-tokens 100}}]
  (let [dictionary (rest
                    (with-open [in-file (io/reader dictionary-file)]
                      (doall
                       (csv/read-csv in-file))))]
    (with-open [out-file (io/writer prunned-file)]
      (csv/write-csv out-file (->> dictionary
                                   (mapv (fn [[title rc]]
                                           (when (.exists (io/as-file (str normalized-corpus-folder title ".txt")))
                                             (when (> (->> (slurp (str normalized-corpus-folder title ".txt"))
                                                           tokenize
                                                           count) min-tokens)
                                               [[title rc]]))))
                                   (apply concat)
                                   (into []))))))

Then we can prune the general and domain specific dictionaries using this simple function:

(create-pruned-pages-dictionary-csv "resources/general-corpus-dictionary.csv"
                                    "resources/general-corpus-dictionary.pruned.csv" 
                                    "resources/corpus-normalized/"
                                    min-tokens 100)

(create-pruned-pages-dictionary-csv "resources/domain-corpus-dictionary.csv"
                                    "resources/domain-corpus-dictionary.pruned.csv" 
                                    "resources/corpus-normalized/"
                                    min-tokens 100)

As a result of this specific pruning approach, the number of features drops from 197 to 175.

Evaluating Pruned Training Corpuses and Selected Features

Now that the training corpuses have been pruned, let’s load them and then evaluate their performance on the gold standards.

;; Load positive and negative pruned training corpuses
(load-dictionaries "resources/general-corpus-dictionary.pruned.csv" "resources/domain-corpus-dictionary.pruned.csv")

;; Build the ESA semantic interpreter 
(build-semantic-interpreter "base" "resources/semantic-interpreters/base-pruned/" (distinct (concat (get-domain-pages) (get-general-pages))))

;; Build the vectors to feed to a SVM classifier using ESA
(build-svm-model-vectors "resources/svm/base-pruned/" :corpus-folder-normalized "resources/corpus-normalized/")

;; Train the SVM using the best parameters discovered in the previous tests
(train-svm-model "svm.w50" "resources/svm/base-pruned/"
                 :weights {1 50.0}
                 :v nil
                 :c 1
                 :algorithm :l2l2)

Let’s evaluate this model using our three gold standards:

(evaluate-model "svm.pruned.goldstandard.1.w50" "resources/gold-standard-1.csv")
True positive:  21
False positive:  2
True negative:  307
False negative:  6

Precision:  0.9130435
Recall:  0.7777778
Accuracy:  0.97619045
F1:  0.84000003

The performances changes related to the initial results (using KBpedia 1.02) are:

  • Precision: +18.75%
  • Recall: -12.08%
  • Accuracy: +0.61%
  • F1: +2.26%

In this case, compared with the previous results (non-pruned with KBpedia 1.10), we improved the precision without decreasing the recall which is the ultimate goal. This means that the F1 score increased by 2.26% just by pruning, for this gold standard.

The results for the second gold standard are:

(evaluate-model "svm.goldstandard.2.w50" "resources/gold-standard-2.csv")
True positive:  16
False positive:  3
True negative:  317
False negative:  9

Precision:  0.84210527
Recall:  0.64
Accuracy:  0.9652174
F1:  0.72727275

The performances changes related to the previous results (using KBpedia 1.02) are:

  • Precision: +6.18%
  • Recall: -29.35%
  • Accuracy: -1.19%
  • F1: -14.63%

In this case, the results are identical (with non-pruned with KBpedia 1.10). Pruning did not change anything. Considering the relatively small size of the gold standard, this is to be expected since the model also did not drastically change.

Now let’s look at the results of our new gold standard:

(evaluate-model "svm.goldstandard.3.w50" "resources/gold-standard-3.csv")
True positive:  27
False positive:  7
True negative:  351
False negative:  23

Precision:  0.7941176
Recall:  0.54
Accuracy:  0.9264706
F1:  0.64285713

Now let’s check how these compare to the non-pruned version of the training corpus:

  • Precision: -12.08%
  • Recall: -3.7%
  • Accuracy: -1.31%
  • F1: -7.02%

Both false positives and false negatives increased with this change, which also led to a decrease in the overall metrics. What happened?

Different things may have happened in fact. Maybe the new set of features are not optimal, or maybe the hyperparameters of the SVM classifier are offset. This is what we will try to figure out by working with two new methods that we will use to try to continue to improve our model: hyperparameters optimization using grid search and using ensembles learning.

Hyperparameters Optimization Using Grid Search

Hyperparameters are parameters that are not learned by the estimators. They are a kind of configuration option for an algorithm. In the case of a linear SVM, hyperparameters are C, epsilon, weight and the algorithm used. Hyperparameter optimization is the task of trying to find the right parameter values in order to optimize the performance of the model.

There are multiple different strategies that we can use to try to find the best values for these hyperparameters, but the one we will use is called the grid search, which exhaustively searches across a manually defined subset of possible hyperparameter values.

The grid search function we want to define will enable us to specify the algorithm(s), the weight(s), C and the stopping tolerence. Then we will want the grid search to keep the hyperparameters that optimize the score of the metric we want to focus on. We also have to specify the gold standard we want to use to evaluate the performance of the different models.

Here is the function that implements that grid search algorithm:

(defn svm-grid-search
  [name model-path gold-standard & {:keys [grid-parameters selection-metric]
                                    :or {grid-parameters [{:c [1 2 4 16 256]
                                                           :e [0.001 0.01 0.1]
                                                           :algorithm [:l2l2]
                                                           :weight [1 15 30]}]
                                         selection-metric :f1}}]
  (let [best (atom {:gold-standard gold-standard
                    :selection-metric selection-metric
                    :score 0.0
                    :c nil
                    :e nil
                    :algorithm nil
                    :weight nil})
        model-vectors (read-string (slurp (str model-path "model.vectors")))]
    (doseq [parameters grid-parameters]
      (doseq [algo (:algorithm parameters)]
        (doseq [weight (:weight parameters)]
          (doseq [e (:e parameters)]
            (doseq [c (:c parameters)]
              (train-svm-model name model-path
                               :weights {1 (double weight)}
                               :v nil
                               :c c
                               :e e
                               :algorithm algo
                               :model-vectors model-vectors)
              (let [results (evaluate-model name gold-standard :output false)]              
                (println "Algorithm:" algo)
                (println "C:" c)
                (println "Epsilon:" e)
                (println "Weight:" weight)
                (println selection-metric ":" (get results selection-metric))
                (println)

                (when (> (get results selection-metric) (:score @best))
                  (reset! best {:gold-standard gold-standard
                                :selection-metric selection-metric
                                :score (get results selection-metric)
                                :c c
                                :e e
                                :algorithm algo
                                :weight weight}))))))))
    @best))

The possible algorithms are:

  1. :l2lr_primal
  2. :l2l2
  3. :l2l2_primal
  4. :l2l1
  5. :multi
  6. :l1l2_primal
  7. :l1lr
  8. :l2lr

To simplify things a little bit for this task, we will merge the three gold standards we have into one. We will use that gold standard moving forward. The merged gold standard can be downloaded from here. We now have a single gold standard with 1017 manually vetted web pages.

Now that we have a new consolidated gold standard, let’s calculate the performance of the models when the training corpuses are pruned and not. This will become the new basis to compare the subsequent results in this article. The metrics when they training corpuses are pruned:

True positive: 56
false positive: 10
True negative: 913
False negative: 38

Precision: 0.8484849
Recall: 0.59574467
Accuracy: 0.95280236
F1: 0.7

Now, let’s run the grid search that will try to optimize the F1 score of the model using the pruned training corpuses and using the full gold standard:

(svm-grid-search "grid-search-base-pruned-tests" 
                 "resources/svm/base-pruned/" 
                 "resources/gold-standard-full.csv"
                 :selection-metric :f1
                 :grid-parameters [{:c [1 2 4 16 256]
                                    :e [0.001 0.01 0.1]
                                    :algorithm [:l2l2]
                                    :weight [1 15 30]}])
{:gold-standard "resources/gold-standard-full.csv"
 :selection-metric :f1
 :score 0.7096774
 :c 2
 :e 0.001
 :algorithm :l2l2
 :weight 30}

With a simple subset of the possible hyperparameter space, we found that by increasing the c parameter to 2 we could improve the performance of the F1 score on the gold standard by 1.37%. It is not a huge gain, but it is still an appreciable gain given the miinimal effort invested so far (basically: waiting for the grid search to finish). Subsequently we could tweak the subset of parameters to try to improve a little further. Let’s try with c = [ 1.5 , 2 , 2.5 ] and weight = [30, 40]. Let’s also try to check with other algorithms as well like L2-regularized L1-loss support vector regression (dual).

The goal here is to configure the initial grid search with general parameters with a wide range of possible values. Then subsequently we can use that tool to fine tune some of the parameters that were returning good results. In any case, the more computer power and time you have, the more tests you will be able to perform.

Part 2

Leave a Reply

Your email address will not be published. Required fields are marked *