In the first part of this series we found the good hyperparameters for a single linear SVM classifier. In part 2, we will try another technique to improve the performance of the system: ensemble learning.

So far, we already reached 95% of accuracy with some tweaking the hyperparameters and the training corpuses but the F1 score is still around ~70% with the full gold standard which can be improved. There are also situations when precision should be nearly perfect (because false positives are really not acceptable) or when the recall should be optimized.

Here we will try to improve this situation by using ensemble learning. It uses multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone. In our examples, each model will have a vote and the weight of the vote will be equal for each mode. We will use five different strategies to create the models that will belong to the ensemble:

  1. Bootstrap aggregating (bagging)
  2. Asymmetric bagging 1
  3. Random subspace method (feature bagging)
  4. Asymmetric bagging + random subspace method (ABRS) 1
  5. Bootstrap aggregating + random subspace method (BRS)

Different strategies will be used depending on different things like: are the positive and negative training documents unbalanced? How many features does the model have? etc. Let’s introduce each of these different strategies.

Note that in this article I am only creating ensembles with linear SVM learners. However an ensemble can be composed of multiple different kind of learners, like SVM with non-linear kernels, decisions trees, etc. However, to simplify this article, we will stick to a single linear SVM with multiple different training corpuses and features.

[extoc]

Ensemble Learning Strategies

Bootstrap Aggregating (bagging)

The idea behind bagging is to draw a subset of positive and negative training samples at random and with replacement. Each model of the ensemble will have a different training set but some of the training sample may appear in multiple different training sets.

Asymmetric Bagging

Asymmetric Bagging has been proposed by Tao, Tang, Li and Wu 1. The idea is to use asymmetric bagging when the number of positive training samples is largely unbalanced relatively to the negative training samples. The idea is to create a subset of random (with replacement) negative training samples, but by always keeping the full set of positive training samples.

Random Subspace method (feature bagging)

The idea behind feature bagging is the same as bagging, but works on the features of the model instead of the training sets. It attempts to reduce the correlation between estimators (features) in an ensemble by training them on random samples of features instead of the entire feature set.

Asymmetric Bagging + Random Subspace method (ABRS)

Asymmetric Bagging and Random Subspace Method has also been proposed by Tao, Tang, Li and Wu 1. The problems they had with their content-based image retrieval system are the same we have with this kind of automatic training corpuses generated from knowledge graph:

  1. SVM is unstable on small-sized training set
  2. SVM’s optimal hyperplane may be biased when the positive training sample is much less than the negative feedback sample (this is why we used weights in this case), and
  3. The training set is smaller than the number of features in the SVM model.

The third point is not immediately an issue for us (except if you have a domain with many more features than we had in our example), but becomes one when we start using asymmetric bagging.

What we want to do here is to implement asymmetric bagging and the random subspace method to create dynamic_learning_859c7c3c1099242193bc675bd7b1bf25c900754e number of individual models. This method is called ABRS-SVM which stands for Asymmetric Bagging Random Subspace Support Vector Machines.

The algorithm we will use is:

  1. Let the number of positive training documents be dynamic_learning_2_edc31b30a8bd1852c35517549bcac8b4a7af7fc8, the number of negative training document be dynamic_learning_2_a957f8e8350deff65b7a8982eb9a29c95f5e7773 and the number of features in the training data be dynamic_learning_2_8541d15dc6a8dbbfe4c3da369275938094ab9a70.
  2. Choose dynamic_learning_859c7c3c1099242193bc675bd7b1bf25c900754e to be the number of individual models in the ensemble.
  3. For all individual model dynamic_learning_0faca97934e9db8e9f056c94b7613c45cb12e1ef, choose dynamic_learning_2_81af6d98760653700a014a8f1362a186300f0207 where dynamic_learning_90ae52c890d2e8bd9b3a9376696d65719d104954 to be the number of negative training documents for dynamic_learning_d15a4e4ae61385fcd2221a2be30a7f59da7bd4ca
  4. For all individual models dynamic_learning_0faca97934e9db8e9f056c94b7613c45cb12e1ef, choose dynamic_learning_2_36954911a1d2865eaf400325a9f0b3d9d08a9993 where dynamic_learning_cd852c5f4f131fdce3c46aa64249678cb4456717 to be the number of input variables for ensemble-learning_8af9860053a468761786b279cda937b39be994c5.
  5. For each individual model dynamic_learning_0faca97934e9db8e9f056c94b7613c45cb12e1ef, create a training set by choosing dynamic_learning_2_36954911a1d2865eaf400325a9f0b3d9d08a9993 features from dynamic_learning_2_8541d15dc6a8dbbfe4c3da369275938094ab9a70 with replacement, by choosing dynamic_learning_2_81af6d98760653700a014a8f1362a186300f0207 negative training documents from dynamic_learning_2_a957f8e8350deff65b7a8982eb9a29c95f5e7773 with replacement, by choosing all positive training documents dynamic_learning_2_edc31b30a8bd1852c35517549bcac8b4a7af7fc8 and then train the model.

Bootstrap Aggregating + Random Subspace method (BRS)

Bagging with features bagging is the same as asymmetric bagging with the random subspace method except that we use bagging instead of asymmetric bagging. (ABRS should be used if your positive training sample is severely unbalanced compared to your negative training sample. Otherwise BRS should be used.)

SVM Learner

We use the linear Semantic Vector Machine (SVM) as the learner to use for the ensemble. What we will be creating is a series of SVM models that will be different depending on the ensemble method(s) we will use to create the ensemble.

Build Training Document Vectors

The first step we have to do is to create a structure where all the positive and negative training documents will have their vector representation. Since this is the task that takes the most time in the whole process, we will calculate them using the (build-svm-model-vectors) function and we will serialize the structure on the file system. That way, to create the ensemble’s models, we will only have to load it from the file system without having the re-calculate it each time.

Train, Classify and Evaluate Ensembles

The goal is to create a set of X number of SVM classifiers where each of them use different models. The models can differ in their features or their training corpus. Then each of the classifier will try to classify an input text according to their own model. Finally each classifier will vote to determine if that input text belong, or not, to the domain.

There are four hyperparameters related to ensemble learning:

  1. The mode to use
  2. The number of models we want to create in the ensemble
  3. The number of training documents we want in the training corpus, and
  4. The number of features.

Other hyperparameters could include the ones of the linear SVM classifier, but in this example we will simply reuse the best parameters we found above. We now train the ensemble using the (train-ensemble-svm) function.

Once the ensemble is created and trained, then we have to use the (classify-ensemble-text) function to classify an input text using the ensemble we created. That function takes two parameters: :mode, which is the ensemble’s mode, and :vote-acceptance-ratio, which defines the number of positive votes that is required such that the ensemble positively classify the input text. By default, the ratio is 50%, but if you want to optimize the precision of the ensemble, then you may want to increase that ratio to 70% or even 95% as we will see below.

Finally the ensemble, configured with all its hyperparameters, will be evaluated using the (evaluate-ensemble) function, which is the same as the (evaluate-model) function, but which uses the ensemble instead of a single SVM model to classify all of the articles. As before, we will characterize the assignments in relation to the gold standard.

Let’s now train different ensembles to try to improve the performance of the system.

Asymmetric Bagging

The current corpus training set is highly unbalanced. This is why the first test we will do is to apply the asymmetric bagging strategy. What this does is that each of the SVM classifiers will use the same positive training set with the same number of positive documents. However, each of them will take a random number of negative training documents (with replacement).

(use 'cognonto-esa.core)
(use 'cognonto-esa.ensemble-svm)

(load-dictionaries "resources/general-corpus-dictionary.pruned.csv" "resources/domain-corpus-dictionary.pruned.csv")
(load-semantic-interpreter "base-pruned" "resources/semantic-interpreters/base-pruned/")

(reset! ensemble [])

(train-ensemble-svm "ensemble.base.pruned.ab.c2.w30" "resources/ensemble-svm/base-pruned/" 
                    :mode :ab 
                    :weight {1 30.0}
                    :c 2
                    :e 0.001
                    :nb-models 100
                    :nb-training-documents 3500)

Now let’s evaluate this ensemble with a vote acceptance ratio of 50%

(evaluate-ensemble "ensemble.base.pruned.ab.c2.w30" 
                   "resources/gold-standard-full.csv" 
                   :mode :ab 
                   :vote-acceptance-ratio 0.50)
True positive:  48
False positive:  6
True negative:  917
False negative:  46

Precision:  0.8888889
Recall:  0.5106383
Accuracy:  0.9488692
F1:  0.6486486

Let’s increase the vote acceptance ratio to 90%:

(evaluate-ensemble "ensemble.base.pruned.ab.c2.w30" 
                   "resources/gold-standard-full.csv" 
                   :mode :ab 
                   :vote-acceptance-ratio 0.90)
True positive:  37
False positive:  2
True negative:  921
False negative:  57

Precision:  0.94871795
Recall:  0.39361703
Accuracy:  0.94198626
F1:  0.556391

In both cases, the precision increases considerably compared to the non-ensemble learning results. However, the recall did drop at the same time, which dropped the F1 score as well. Let’s now try with the ABRS method

Asymmetric Bagging + Random Subspace method (ABRS)

The goal of the random subspace method is to select a random set of features. This means that each model will have their own feature set and will make predictions according to them. With the ABRS strategy, we will conclude with highly different models since none will have the same negative training sets nor the same features.

Here what we test is to define each classifier with 65 randomly chosen features out of 174 to restrict the negative training corpus to 3500 randomly selected documents. Then we choose to create 300 models to try to get a really heterogeneous population of models.

(reset! ensemble [])
(train-ensemble-svm "ensemble.base.pruned.abrs.c2.w30" "resources/ensemble-svm/base-pruned/" 
                    :mode :abrs 
                    :weight {1 30.0}
                    :c 2
                    :e 0.001
                    :nb-models 300
                    :nb-features 65
                    :nb-training-documents 3500)
(evaluate-ensemble "ensemble.base.pruned.abrs.c2.w30" 
                   "resources/gold-standard-full.csv" 
                   :mode :abrs
                   :vote-acceptance-ratio 0.50)
True positive:  41
False positive:  3
True negative:  920
False negative:  53

Precision:  0.9318182
Recall:  0.43617022
Accuracy:  0.9449361
F1:  0.59420294

For these features and training sets, using the ABRS method did not improve on the AB method we tried above.

Conclusion

This use case shows three totally different ways to use the KBpedia Knowledge Graph to automatically create positive and negative training sets. We demonstrated how the full process can be automated where the only requirement is to get a list of seed KBpedia reference concepts.

We also quantified the impact of using new versions of KBpedia, and how different strategies, techniques or algorithms can have different impacts on the prediction models.

Creating prediction models using supervised machine learning algorithms (which is currently the bulk of the learners currently used) has two global steps:

  1. Label training sets and generate gold standards, and
  2. Test, compare, and optimize different learners, ensembles and hyperparameters.

Unfortunately, today, given the manual efforts required by the first step, the overwhelming portion of time and budget is spent here to create a prediction model. By automating much of this process, Cognonto and KBpedia substantially reduces this effort. Time and budget can now be re-directed to the second step of “dialing in” the learners, where the real payoff occurs. of training corpuses.

Further, as we also demonstrated, once we automated this process of labeling and reference standards, then we can also automate the testing and optimization of multiple different kind of prediction algorithms, hyperparameters configuration, etc. In short, for both steps, KBpedia provides significant reductions in times and efforts to get to desired results.

Footnotes

1Asymmetric Bagging and Random Subspace for Support Vector Machines-Based Relevance Feedback in Image Retrieval

Leave a Reply

Your email address will not be published. Required fields are marked *