In the past month or two I got more and more interested in the Yago project. First this gave me the opportunity to find a really interesting person, the main author of Yago, Fabian Suchanek. I have been impressed by the simplicity (and creating simple things with such complex stuff is certainly the harder task out there) and the coverage of Yago. It was well built and based on solid foundations. It is after downloading it, converting it into RDF, indexing it into a triple store and fixing serialization glitches and semantic relations issues that I really started to appreciate all the work that has been put in that project.
I am now pleased to write about the next version of Yago that has recently been released by Fabian & Co. The papers describing this new version has been published about a week ago (and written by Fabian, Gjergji Kasneci and Gerhard Weikum), and the new data set has been released a couple of days ago. After fixing one last RDF issue with the conversion of the Yago data set into RDF, I am now ready to write something about it.
First of all, what is Yago? Yago is some kind of ontologies. It is a dataset composed of entities and facts about these entities. It describes things such as Abraham Lincoln (entity) is the successor (fact) of James Buchanan (entity). All these entities and facts come from two data sources: Wikipedia and Wordnet. Please read Fabian’s paper to know exactly hat come from where.
Yago has its own representation and logic framework. However, converters exist to convert the Yago dataset into RDF serialized in XML or into other formats. Just to demonstrate how Yago is complete by itself, a query language has been created explicitly to query Yago. However, one can convert the Yago dataset into RDF, index it in a triple store, and query the same information using SPARQL (it is what I have done myself). To read about these frameworks, and to read about how Yago is working internally, you have to read the presentation paper written by Fabian.
So, what is new with this second version of Yago?
There is about 500 000 additional entities (now counting about 1 500 000 entities in the Yago dataset).
Also, many new predicates have been added in this new version, there is the list of 99 predicates available to build queries:
actedIn, bornIn, bornOnDate, created, createdOnDate, dealsWith, describes, diedIn, diedOnDate, directed, discovered, discoveredOnDate, domain, during, during, establishedOnDate, exports, familyNameOf, foundIn, givenNameOf, graduatedFrom, happenedIn, hasAcademicAdvisor, hasArea, hasBudget, hasCallingCode, hasCapital, hasChild, hasCurrency, hasDuration, hasEconomicGrowth, hasExpenses, hasExport, hasGDPPPP, hasGini, hasHDI, hasHeight, hasImdb, hasImport, hasInflation, hasISBN, hasLabor, hasMotto, hasNominalGDP, hasOfficialLanguage, hasPages, hasPopulation, hasPopulationDensity, hasPoverty, hasPredecessor, hasProduct, hasProductionLanguage, hasRevenue, hasSuccessor, hasTLD, hasUnemployment, hasUTCOffset, hasValue, hasWaterPart, hasWebsite, hasWeight, hasWonPrize, imports, influences, inLanguage, interestedIn, inTimeZone, inUnit, isAffiliatedTo, isCalled, isCitizenOf, isLeaderOf, isMarriedTo, isMemberOf, isNativeNameOf, isNumber, isOfGenre, isPartOf, isSubstanceOf, livesIn, locatedIn, madeCoverFor, means, musicalRole, originatesFrom, participatedIn, politicianOf, produced, publishedOnDate, range, since, subClassOf, subPropertyOf, type, until, using, worksAt, writtenInYear, wrote
Also the converted RDF dump is much, much bigger than the previous one. In fact, the RDF dump that is generated is about 15 gigabytes.
Trying to slim the RDF dump using N3 serialization
It is after noticing the size of the RDF dump serialized in XML that I checked if we could slim this data dump a bit by serializing all the RDF using N3/Turtle instead of XML.
However it was not concluding. Except for the friendliness aspect of the N3 code compared to the XML one, there is no real gain in term of space. The reason is that Yago extensively use reification to assert a statement about a triple (a fact). Since there is no reification syntax in N3 (or N3 Turtle), we have to describe the reification statement at length like this:
A RDF/XML Yago fact:
<!DOCTYPE rdf:RDF [<!ENTITY d "http://www.w3.org/2001/XMLSchema#">
<!ENTITY y "http://www.mpii.de/yago#">]>
<rdf:Description rdf:about=”&y;Abraham_Lincoln”><y:hasSuccessor rdf:ID=”f200876173″ rdf:resource=”&y;Thomas_L._Harris”/></rdf:Description>
<rdf:Description rdf:about=”#f200876173″><y:confidence rdf:datatype=”&d;double”>0.9486150988008782</y:confidence></rdf:Description>
And its RDF/N3 counterpart:
@base <http://www.mpii.de/yago> .
@prefix y: <http://www.mpii.de/yago/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
<#Abraham_Lincoln> y:politicianOf <#United_States> .
<#f201920397> rdf:type rdf:Statement ;
rdf:subject <#Abraham_Lincoln> ;
rdf:predicate y:politicianOf ;
rdf:object <#United_States> ;
Since Yago is a special case that uses extensively reification for all of its facts, you can’t gain significant hard drive space by serializing in N3: it is at best marginal.
What would be the usefulness of Yago without being able to query it? There won’t be any; so lets test it with some SPARQL queries.
Question 1: How is called the place where Andre Agassi is living?
<http://www.mpii.de/yago#Andre_Agassi> <http://www.mpii.de/yago/livesIn> ?place.
?place <http://www.mpii.de/yago/isCalled> ?place_name.
Result: “Las Vegas”
Question 2: What are the other film produced by the guy that produced the movie Blade Runner?
?producer <http://www.mpii.de/yago/produced> <http://www.mpii.de/yago#Blade_Runner>.
?producer <http://www.mpii.de/yago/produced> ?other_movies.
Result: “The Italian Job”, “Murphy’s War”, “Robbery”
And so on. It is that simple. If you do not know the URI of an entity, you only have to refer to its label using the property isCalled.
Considering that fact that we know the properties that are describing within Yago, and considering that all properties are consistent within Yago, it become quite easy to get interesting stuff by querying the dataset.
This new version is a clear leap ahead. It continues to be as simple as the first version. It is enhanced with more entities and more predicates; but is always consistent with a really good accuracy level.
I would like to see one more thing with Yago: being able to dereference these URIs on the Web. I will check with Fabian to make all these URIs dereferencable on the Web. So expect another blog post announcing this in the following days or weeks.