In previous articles I have covered multiple ways to create training corpuses for unsupervised learning and positive and negative training sets for supervised learning 1 , 2 , 3 using Cognonto and KBpedia. Different structures inherent to a knowledge graph like KBpedia can lead to quite different corpuses and sets. Each of these corpuses or sets may yield different predictive powers depending on the task at hand.
So far we have covered two ways to leverage the KBpedia Knowledge Graph to automatically create positive and negative training corpuses:
- Using the links that exist between each KBpedia reference concept and their related Wikipedia pages
- Using the linkages between KBpedia reference concepts and external vocabularies to create training corpuses out of
named entities.
Now we will introduce a third way to create a different kind of training corpus:
- Using the KBpedia aspects linkages.
Aspects
are aggregations of entities that are grouped according to their characteristics different from their direct types. Aspects help to group related entities by situation, and not by identity nor definition. It is another way to organize the knowledge graph and to leverage it. KBpedia has about 80 aspects that provide this secondary means for placing entities into related real-world contexts. Not all aspects relate to a given entity.
[extoc]
Creating New Domain Using KBpedia Aspects
To continue with the musical domain, there exists two aspects of interest:
- Music
- Genres
What we will do first is to query the KBpedia Knowledge Graph using theSPARQL query language to get the list of all of the KBpedia reference concepts that are related to the Music
or the Genre
aspects. Then, for each of these reference concepts, we will count the number of named entities that can be reached in the complete KBpedia structure.
prefix kko: <http://kbpedia.org/ontologies/kko#> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema> prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> prefix dcterms: <http://purl.org/dc/terms/> prefix schema: <http://schema.org/> select distinct ?class count(distinct ?entity) as ?nb from <http://dbpedia.org> from <http://www.uspto.gov> from <http://wikidata.org> from <http://kbpedia.org/1.10/> where { ?entity dcterms:subject ?category . graph <http://kbpedia.org/1.10/> { {?category <http://kbpedia.org/ontologies/kko#hasMusicAspect> ?class .} union {?category <http://kbpedia.org/ontologies/kko#hasGenre> ?class .} } } order by desc(?nb)
reference concept | nb |
---|---|
http://kbpedia.org/kko/rc/Album-CW | 128772 |
http://kbpedia.org/kko/rc/Song-CW | 74886 |
http://kbpedia.org/kko/rc/Music | 51006 |
http://kbpedia.org/kko/rc/Single | 50661 |
http://kbpedia.org/kko/rc/RecordCompany | 5695 |
http://kbpedia.org/kko/rc/MusicalComposition | 5272 |
http://kbpedia.org/kko/rc/MovieSoundtrack | 2919 |
http://kbpedia.org/kko/rc/Lyric-WordsToSong | 2374 |
http://kbpedia.org/kko/rc/Band-MusicGroup | 2185 |
http://kbpedia.org/kko/rc/Quartet-MusicalPerformanceGroup | 2078 |
http://kbpedia.org/kko/rc/Ensemble | 1438 |
http://kbpedia.org/kko/rc/Orchestra | 1380 |
http://kbpedia.org/kko/rc/Quintet-MusicalPerformanceGroup | 1335 |
http://kbpedia.org/kko/rc/Choir | 754 |
http://kbpedia.org/kko/rc/Concerto | 424 |
http://kbpedia.org/kko/rc/Symphony | 299 |
http://kbpedia.org/kko/rc/Singing | 154 |
Seventeen KBpedia reference concepts are related to the two aspects we want to focus on. The next step is to take these 17 reference concepts and to create a new domain corpus with them. We will use the new version of KBpedia to create the full set of reference concepts that will scope our domain by inference.
Next we will try to use this information to create two totally different kinds of training corpuses:
- One that will rely on the links between the reference concepts and Wikipedia pages
- One that will rely on the linkages to external vocabularies to create a list of named entities that will be used as
the training corpus
Creating Model With Reference Concepts
The first training corpus we want to test is one that uses the linkage between KBpedia reference concepts and Wikipedia pages. The first thing is to generate the domain training corpus with the 17
seed reference concepts and then to infer other related reference concepts.
(use 'cognonto-esa.core) (require '[cognonto-owl.core :as owl]) (require '[cognonto-owl.reasoner :as reasoner]) (def kbpedia-manager (owl/make-ontology-manager)) (def kbpedia (owl/load-ontology "resources/kbpedia_reference_concepts_linkage.n3" :manager kbpedia-manager)) (def kbpedia-reasoner (reasoner/make-reasoner kbpedia)) (define-domain-corpus ["http://kbpedia.org/kko/rc/Album-CW" "http://kbpedia.org/kko/rc/Song-CW" "http://kbpedia.org/kko/rc/Music" "http://kbpedia.org/kko/rc/Single" "http://kbpedia.org/kko/rc/RecordCompany" "http://kbpedia.org/kko/rc/MusicalComposition" "http://kbpedia.org/kko/rc/MovieSoundtrack" "http://kbpedia.org/kko/rc/Lyric-WordsToSong" "http://kbpedia.org/kko/rc/Band-MusicGroup" "http://kbpedia.org/kko/rc/Quartet-MusicalPerformanceGroup" "http://kbpedia.org/kko/rc/Ensemble" "http://kbpedia.org/kko/rc/Orchestra" "http://kbpedia.org/kko/rc/Quintet-MusicalPerformanceGroup" "http://kbpedia.org/kko/rc/Choir" "http://kbpedia.org/kko/rc/Symphony" "http://kbpedia.org/kko/rc/Singing" "http://kbpedia.org/kko/rc/Concerto"] kbpedia "resources/aspects-concept-corpus-dictionary.csv" :reasoner kbpedia-reasoner) (create-pruned-pages-dictionary-csv "resources/aspects-concept-corpus-dictionary.csv" "resources/aspects-concept-corpus-dictionary.pruned.csv" "resources/aspects-corpus-normalized/")
Once pruned, we end-up with a domain which has 108
reference concepts which will enable us to create models with 108 features. The next step is to create the actual semantic interpreter and the SVM models:
;; Load dictionaries (load-dictionaries "resources/general-corpus-dictionary.pruned.csv" "resources/aspects-concept-corpus-dictionary.pruned.csv") ;; Create the semantic interpreter (build-semantic-interpreter "aspects-concept-pruned" "resources/semantic-interpreters/aspects-concept-pruned/" (distinct (concat (get-domain-pages) (get-general-pages)))) ;; Build the SVM model vectors (build-svm-model-vectors "resources/svm/aspects-concept-pruned/" :corpus-folder-normalized "resources/aspects-corpus-normalized/") ;; Train the linear SVM classifier (train-svm-model "svm.aspects.concept.pruned" "resources/svm/aspects-concept-pruned/" :weights nil :v nil :c 1 :algorithm :l2l2)
Then we have to evaluate this new model using the gold standard:
(evaluate-model "svm.aspects.concept.pruned" "resources/gold-standard-full.csv")
True positive: 28 False positive: 0 True negative: 923 False negative: 66 Precision: 1.0 Recall: 0.29787233 Accuracy: 0.93510324 F1: 0.45901638
Now let’s try to find better hyperparameters using grid search:
(svm-grid-search "grid-search-aspects-concept-pruned-tests" "resources/svm/aspects-concept-pruned/" "resources/gold-standard-full.csv" :selection-metric :f1 :grid-parameters [{:c [1 2 4 16 256] :e [0.001 0.01 0.1] :algorithm [:l2l2] :weight [1 15 30]}])
{:gold-standard "resources/gold-standard-full.csv" :selection-metric :f1 :score 0.84444445 :c 1 :e 0.001 :algorithm :l2l2 :weight 30}
After running the grid search with these initial broad range values, we found a configuration that gives us 0.8444
for the F1
score. So far, this score is the best to date we have gotten for the full gold standard2, 3. Let’s see all of the metrics for this configuration:
(train-svm-model "svm.aspects.concept.pruned" "resources/svm/aspects-concept-pruned/" :weights {1 30.0} :v nil :c 1 :e 0.001 :algorithm :l2l2) (evaluate-model "svm.aspects.concept.pruned" "resources/gold-standard-full.csv")
True positive: 76 False positive: 10 True negative: 913 False negative: 18 Precision: 0.88372093 Recall: 0.80851066 Accuracy: 0.972468 F1: 0.84444445
These results are also the best balance between precision
and recall
that we have gotten so far2, 3. Better precision
can be obtained if necessary but only at the expense of lower recall
.
Let’s take a look at the improvements we got compared to the previous training corpuses we had:
- Precision:
+4.16%
- Recall:
+35.72%
- Accuracy:
+2.06%
- F1:
+20.63%
This new training corpus based on the KBpedia aspects, after hyperparameter optimization, did increase all the metrics we calculate. The more stiking improvement is the recall
which improved by more than 35%
.
Creating Model With Entities
The next training corpus we want to test is one that uses the linkage between KBpedia reference concepts and linked external vocabularies to get a series of linked named entities as the positive training set of for each of the features of the model.
The first thing to do is to is to create the positive training set populated with named entities related to the reference concepts. We will get a random sample of ~50 named entities per reference concept:
(require '[cognonto-rdf.query :as query]) (require '[clojure.java.io :as io]) (require '[clojure.data.csv :as csv]) (require '[clojure.string :as string]) (defn generate-domain-by-rc [rc domain-file nb] (with-open [out-file (io/writer domain-file :append true)] (doall (->> (query/select (str "prefix kko: <http://kbpedia.org/ontologies/kko#> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema> prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> select distinct ?entity from <http://dbpedia.org> from <http://www.uspto.gov> from <http://wikidata.org> from <http://kbpedia.org/1.10/> where { ?entity dcterms:subject ?category . graph <http://kbpedia.org/1.10/> { ?category ?aspectProperty <" rc "> . } } ORDER BY RAND() LIMIT " nb) kb-connection) (map (fn [entity] (csv/write-csv out-file [[(string/replace (:value (:entity entity)) "http://dbpedia.org/resource/" "") (string/replace rc "http://kbpedia.org/kko/rc/" "")]]))))))) (defn generate-domain-by-rcs [rcs domain-file nb-per-rc] (with-open [out-file (io/writer domain-file)] (csv/write-csv out-file [["wikipedia-page" "kbpedia-rc"]]) (doseq [rc rcs] (generate-domain-by-rc rc domain-file nb-per-rc)))) (generate-domain-by-rcs ["http://kbpedia.org/kko/rc/" "http://kbpedia.org/kko/rc/Concerto" "http://kbpedia.org/kko/rc/DoubleAlbum-CW" "http://kbpedia.org/kko/rc/MusicalComposition-Psychedelic" "http://kbpedia.org/kko/rc/MusicalComposition-Religious" "http://kbpedia.org/kko/rc/PunkMusic" "http://kbpedia.org/kko/rc/BluesMusic" "http://kbpedia.org/kko/rc/HeavyMetalMusic" "http://kbpedia.org/kko/rc/PostPunkMusic" "http://kbpedia.org/kko/rc/CountryRockMusic" "http://kbpedia.org/kko/rc/BarbershopQuartet-MusicGroup" "http://kbpedia.org/kko/rc/FolkMusic" "http://kbpedia.org/kko/rc/Verse" "http://kbpedia.org/kko/rc/RockBand" "http://kbpedia.org/kko/rc/Lyric-WordsToSong" "http://kbpedia.org/kko/rc/Refrain" "http://kbpedia.org/kko/rc/MusicalComposition-GangstaRap" "http://kbpedia.org/kko/rc/MusicalComposition-Klezmer" "http://kbpedia.org/kko/rc/HouseMusic" "http://kbpedia.org/kko/rc/MusicalComposition-AlternativeCountry" "http://kbpedia.org/kko/rc/PsychedelicMusic" "http://kbpedia.org/kko/rc/ReggaeMusic" "http://kbpedia.org/kko/rc/AlternativeRockBand" "http://kbpedia.org/kko/rc/AlternativeRockMusic" "http://kbpedia.org/kko/rc/MusicalComposition-Trance" "http://kbpedia.org/kko/rc/Ensemble" "http://kbpedia.org/kko/rc/RhythmAndBluesMusic" "http://kbpedia.org/kko/rc/NewAgeMusic" "http://kbpedia.org/kko/rc/RockabillyMusic" "http://kbpedia.org/kko/rc/MusicalComposition-Blues" "http://kbpedia.org/kko/rc/MusicalComposition-Opera" "http://kbpedia.org/kko/rc/Choir" "http://kbpedia.org/kko/rc/SurfMusic" "http://kbpedia.org/kko/rc/Quintet-MusicalPerformanceGroup" "http://kbpedia.org/kko/rc/MusicalComposition-JazzRock" "http://kbpedia.org/kko/rc/MusicalComposition-Country" "http://kbpedia.org/kko/rc/CountryMusic" "http://kbpedia.org/kko/rc/MusicalComposition-PopRock" "http://kbpedia.org/kko/rc/MusicalComposition-Romantic" "http://kbpedia.org/kko/rc/Recitative" "http://kbpedia.org/kko/rc/Chorus" "http://kbpedia.org/kko/rc/FusionMusic" "http://kbpedia.org/kko/rc/MovieSoundtrack" "http://kbpedia.org/kko/rc/GreatestHitsAlbum-CW" "http://kbpedia.org/kko/rc/MusicalComposition-Christian" "http://kbpedia.org/kko/rc/ClassicalMusic-Baroque" "http://kbpedia.org/kko/rc/MusicalComposition-NewAge" "http://kbpedia.org/kko/rc/MusicalComposition-TraditionalPop" "http://kbpedia.org/kko/rc/TranceMusic" "http://kbpedia.org/kko/rc/MusicalComposition-Celtic" "http://kbpedia.org/kko/rc/LoungeMusic" "http://kbpedia.org/kko/rc/MusicalComposition-Reggae" "http://kbpedia.org/kko/rc/MusicalComposition-Baroque" "http://kbpedia.org/kko/rc/Trio-MusicalPerformanceGroup" "http://kbpedia.org/kko/rc/Symphony" "http://kbpedia.org/kko/rc/MusicalComposition-RockAndRoll" "http://kbpedia.org/kko/rc/PopRockMusic" "http://kbpedia.org/kko/rc/IndustrialMusic" "http://kbpedia.org/kko/rc/JazzMusic" "http://kbpedia.org/kko/rc/MusicalChord" "http://kbpedia.org/kko/rc/ProgressiveRockMusic" "http://kbpedia.org/kko/rc/GothicMusic" "http://kbpedia.org/kko/rc/LiveAlbum-CW" "http://kbpedia.org/kko/rc/NewWaveMusic" "http://kbpedia.org/kko/rc/NationalAnthem" "http://kbpedia.org/kko/rc/OldieSong" "http://kbpedia.org/kko/rc/Song-Sung" "http://kbpedia.org/kko/rc/RockMusic" "http://kbpedia.org/kko/rc/Aria" "http://kbpedia.org/kko/rc/MusicalComposition-Disco" "http://kbpedia.org/kko/rc/GospelMusic" "http://kbpedia.org/kko/rc/BluegrassMusic" "http://kbpedia.org/kko/rc/FolkRockMusic" "http://kbpedia.org/kko/rc/RockAndRollMusic" "http://kbpedia.org/kko/rc/Opera-CW" "http://kbpedia.org/kko/rc/HitSong-CW" "http://kbpedia.org/kko/rc/Tune" "http://kbpedia.org/kko/rc/Quartet-MusicalPerformanceGroup" "http://kbpedia.org/kko/rc/RapMusic" "http://kbpedia.org/kko/rc/RecordCompany" "http://kbpedia.org/kko/rc/MusicalComposition-ACappella" "http://kbpedia.org/kko/rc/MusicalComposition-Electronica" "http://kbpedia.org/kko/rc/Music" "http://kbpedia.org/kko/rc/GlamRockMusic" "http://kbpedia.org/kko/rc/LoveSong" "http://kbpedia.org/kko/rc/MusicalComposition-Gothic" "http://kbpedia.org/kko/rc/MarchingBand" "http://kbpedia.org/kko/rc/MusicalComposition-Punk" "http://kbpedia.org/kko/rc/BluesRockMusic" "http://kbpedia.org/kko/rc/TechnoMusic" "http://kbpedia.org/kko/rc/SoulMusic" "http://kbpedia.org/kko/rc/ChamberMusicComposition" "http://kbpedia.org/kko/rc/Requiem" "http://kbpedia.org/kko/rc/MusicalComposition" "http://kbpedia.org/kko/rc/ElectronicMusic" "http://kbpedia.org/kko/rc/CompositionMovement" "http://kbpedia.org/kko/rc/StringQuartet-MusicGroup" "http://kbpedia.org/kko/rc/Riff" "http://kbpedia.org/kko/rc/Anthem" "http://kbpedia.org/kko/rc/HardRockMusic" "http://kbpedia.org/kko/rc/MusicalComposition-BluesRock" "http://kbpedia.org/kko/rc/MusicalComposition-Cyberpunk" "http://kbpedia.org/kko/rc/MusicalComposition-Industrial" "http://kbpedia.org/kko/rc/MusicalComposition-Funk" "http://kbpedia.org/kko/rc/Album-CW" "http://kbpedia.org/kko/rc/HipHopMusic" "http://kbpedia.org/kko/rc/Single" "http://kbpedia.org/kko/rc/Singing" "http://kbpedia.org/kko/rc/SwingMusic" "http://kbpedia.org/kko/rc/Song-CW" "http://kbpedia.org/kko/rc/SalsaMusic" "http://kbpedia.org/kko/rc/MusicalComposition-Jazz" "http://kbpedia.org/kko/rc/ClassicalMusic" "http://kbpedia.org/kko/rc/MilitaryBand" "http://kbpedia.org/kko/rc/SkaMusic" "http://kbpedia.org/kko/rc/Orchestra" "http://kbpedia.org/kko/rc/GrungeRockMusic" "http://kbpedia.org/kko/rc/SouthernRockMusic" "http://kbpedia.org/kko/rc/MusicalComposition-Ambient" "http://kbpedia.org/kko/rc/DiscoMusic"] "resources/aspects-domain-corpus.csv")
Next let’s create the actual positive training corpus and let’s normalize it:
(cache-aspects-corpus "resources/aspects-entities-corpus.csv" "resources/aspects-corpus/") (normalize-cached-corpus "resources/corpus/" "resources/corpus-normalized/")
We end up with 22
features for which we can get named entities from the KBpedia Knowledge Base. These will be the 22 features of our model. The complete positive training set has 799 documents in it.
(load-dictionaries "resources/general-corpus-dictionary.pruned.csv" "resources/aspects-entities-corpus-dictionary.pruned.csv") (build-semantic-interpreter "aspects-entities-pruned" "resources/semantic-interpreters/aspects-entities-pruned/" (distinct (concat (get-domain-pages) (get-general-pages)))) (build-svm-model-vectors "resources/svm/aspects-entities-pruned/" :corpus-folder-normalized "resources/aspects-corpus-normalized/") (train-svm-model "svm.aspects.entities.pruned" "resources/svm/aspects-entities-pruned/" :weights nil :v nil :c 1 :algorithm :l2l2)
Now let’s evaluate the model with default hyperparameters:
(evaluate-model "svm.aspects.entities.pruned" "resources/gold-standard-full.csv")
True positive: 9 False positive: 10 True negative: 913 False negative: 85 Precision: 0.47368422 Recall: 0.095744684 Accuracy: 0.906588 F1: 0.15929204
Now let’s try to improve this F1 score using grid search:
(svm-grid-search "grid-search-aspects-entities-pruned-tests" "resources/svm/aspects-entities-pruned/" "resources/gold-standard-full.csv" :selection-metric :f1 :grid-parameters [{:c [1 2 4 16 256] :e [0.001 0.01 0.1] :algorithm [:l2l2] :weight [1 15 30]}])
{:gold-standard "resources/gold-standard-full.csv" :selection-metric :f1 :score 0.44052863 :c 4 :e 0.001 :algorithm :l2l2 :weight 15}
We have been able to greatly improve the F1
score by tweaking the hyperparameters, but the results are still disappointing. There are multiple ways to automatically generate training corpuses, but not all of them are born equal. This is why having a pipeline that can automatically create the training corpuses, optimize the hyperparameters and evaluate the models is more than welcome since this is the bulk of the time a data scientist has to spend to create his models.
Conclusion
After automatically creating multiple different positive and negative training sets, after testing multiple learning methods and optimizing hyperparameters, we found the best training sets with the best learning method and the best hyperparameter to create an initial, optimal, model that has an accuracy of 97.2%
, a precision of 88.4%
, a recall of
80.9%
and overall F1 measure of 84.4%
on a gold standard created from real, random, pieces of news from different general and specialized news sites.
The thing that is really interesting and innovative in this method is how a knowledge base of concepts and entities can be used to label positive and negative training sets to feed supervised learners and how the learner can perform well on totally different input text data (in this case, news articles). The same is true when creating training corpuses for unsupervised leaning4.
The most wonderful thing from an operational standpoint is that all of this searching, testing and optimizing can be performed by a computer automatically. The only tasks required by a human is to define the scope of a domain and to manually label a gold standard for performance evaluation and hyperparameters optimization.