In the past month or two I got more and more interested in the Yago project. First this gave me the opportunity to find a really interesting person, the main author of Yago, Fabian Suchanek. I have been impressed by the simplicity (and creating simple things with such complex stuff is certainly the harder task out there) and the coverage of Yago. It was well built and based on solid foundations. It is after downloading it, converting it into RDF, indexing it into a triple store and fixing serialization glitches and semantic relations issues that I really started to appreciate all the work that has been put in that project.

I am now pleased to write about the next version of Yago that has recently been released by Fabian & Co. The papers describing this new version has been published about a week ago (and written by Fabian, Gjergji Kasneci and Gerhard Weikum), and the new data set has been released a couple of days ago. After fixing one last RDF issue with the conversion of the Yago data set into RDF, I am now ready to write something about it.

First of all, what is Yago? Yago is some kind of ontologies. It is a dataset composed of entities and facts about these entities. It describes things such as Abraham Lincoln (entity) is the successor (fact) of James Buchanan (entity). All these entities and facts come from two data sources: Wikipedia and Wordnet. Please read Fabian’s paper to know exactly hat come from where.

Yago has its own representation and logic framework. However, converters exist to convert the Yago dataset into RDF serialized in XML or into other formats. Just to demonstrate how Yago is complete by itself, a query language has been created explicitly to query Yago. However, one can convert the Yago dataset into RDF, index it in a triple store, and query the same information using SPARQL (it is what I have done myself). To read about these frameworks, and to read about how Yago is working internally, you have to read the presentation paper written by Fabian.

So, what is new with this second version of Yago?

There is about 500 000 additional entities (now counting about 1 500 000 entities in the Yago dataset).

Also, many new predicates have been added in this new version, there is the list of 99 predicates available to build queries:

actedIn, bornIn, bornOnDate, created, createdOnDate, dealsWith, describes, diedIn, diedOnDate, directed, discovered, discoveredOnDate, domain, during, during, establishedOnDate, exports, familyNameOf, foundIn, givenNameOf, graduatedFrom, happenedIn, hasAcademicAdvisor, hasArea, hasBudget, hasCallingCode, hasCapital, hasChild, hasCurrency, hasDuration, hasEconomicGrowth, hasExpenses, hasExport, hasGDPPPP, hasGini, hasHDI, hasHeight, hasImdb, hasImport, hasInflation, hasISBN, hasLabor, hasMotto, hasNominalGDP, hasOfficialLanguage, hasPages, hasPopulation, hasPopulationDensity, hasPoverty, hasPredecessor, hasProduct, hasProductionLanguage, hasRevenue, hasSuccessor, hasTLD, hasUnemployment, hasUTCOffset, hasValue, hasWaterPart, hasWebsite, hasWeight, hasWonPrize, imports, influences, inLanguage, interestedIn, inTimeZone, inUnit, isAffiliatedTo, isCalled, isCitizenOf, isLeaderOf, isMarriedTo, isMemberOf, isNativeNameOf, isNumber, isOfGenre, isPartOf, isSubstanceOf, livesIn, locatedIn, madeCoverFor, means, musicalRole, originatesFrom, participatedIn, politicianOf, produced, publishedOnDate, range, since, subClassOf, subPropertyOf, type, until, using, worksAt, writtenInYear, wrote

Also the converted RDF dump is much, much bigger than the previous one. In fact, the RDF dump that is generated is about 15 gigabytes.

Trying to slim the RDF dump using N3 serialization

It is after noticing the size of the RDF dump serialized in XML that I checked if we could slim this data dump a bit by serializing all the RDF using N3/Turtle instead of XML.

However it was not concluding. Except for the friendliness aspect of the N3 code compared to the XML one, there is no real gain in term of space. The reason is that Yago extensively use reification to assert a statement about a triple (a fact). Since there is no reification syntax in N3 (or N3 Turtle), we have to describe the reification statement at length like this:

A RDF/XML Yago fact:

<?xml version=”1.0″?>
<!DOCTYPE rdf:RDF [<!ENTITY d “http://www.w3.org/2001/XMLSchema#”>
<!ENTITY y “http://www.mpii.de/yago#”>]>

<rdf:RDF xmlns:rdf=”http://www.w3.org/1999/02/22-rdf-syntax-ns#”
xmlns:base=”http://www.mpii.de/yago”
xmlns:y=”http://www.mpii.de/yago/”>
<rdf:Description rdf:about=”&y;Abraham_Lincoln”><y:hasSuccessor rdf:ID=”f200876173″ rdf:resource=”&y;Thomas_L._Harris”/></rdf:Description>
<rdf:Description rdf:about=”#f200876173″><y:confidence rdf:datatype=”&d;double”>0.9486150988008782</y:confidence></rdf:Description>
</rdf:RDF>

And its RDF/N3 counterpart:

@base <http://www.mpii.de/yago> .
@prefix y: <http://www.mpii.de/yago/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
<#Abraham_Lincoln> y:politicianOf <#United_States> .
<#f201920397> rdf:type rdf:Statement ;
rdf:subject <#Abraham_Lincoln> ;
rdf:predicate y:politicianOf ;
rdf:object <#United_States> ;
y:confidence “0.967356428105286”^^xsd:decimal.

Since Yago is a special case that uses extensively reification for all of its facts, you can’t gain significant hard drive space by serializing in N3: it is at best marginal.

Some queries

What would be the usefulness of Yago without being able to query it? There won’t be any; so lets test it with some SPARQL queries.

Question 1: How is called the place where Andre Agassi is living?


SPARQL query:

sparql
select *
from <http://www.mpii.de/yago/>
where
{
<http://www.mpii.de/yago#Andre_Agassi> <http://www.mpii.de/yago/livesIn> ?place.
?place <http://www.mpii.de/yago/isCalled> ?place_name.
}

Result: “Las Vegas”

Question 2: What are the other film produced by the guy that produced the movie Blade Runner?

SPARQL query:

sparql
select *
from <http://www.mpii.de/yago/>
where
{
?producer <http://www.mpii.de/yago/produced> <http://www.mpii.de/yago#Blade_Runner>.
?producer <http://www.mpii.de/yago/produced> ?other_movies.
}

Result: “The Italian Job”, “Murphy’s War”, “Robbery”

And so on. It is that simple. If you do not know the URI of an entity, you only have to refer to its label using the property isCalled.

Considering that fact that we know the properties that are describing within Yago, and considering that all properties are consistent within Yago, it become quite easy to get interesting stuff by querying the dataset.

Conclusion

This new version is a clear leap ahead. It continues to be as simple as the first version. It is enhanced with more entities and more predicates; but is always consistent with a really good accuracy level.

I would like to see one more thing with Yago: being able to dereference these URIs on the Web. I will check with Fabian to make all these URIs dereferencable on the Web. So expect another blog post announcing this in the following days or weeks.

11 thoughts on “Second version of Yago: more facts and entities

  1. Excellent post, Fred!

    I agree that YAGO is an impressive body of work and evidence again of the usefulness of Wikipedia as a source of named entities, categories and relations. WordNet is also an improvement over Wikipedia alone for setting the baseline ontology.

    According to my initial review, here are the major changes between the first and second (the current) versions of YAGO:

    — Facts: 5 M –> 15 M
    — Entities: 1.05 M –> 1.7 M
    — Relations (predicates): 14 –> 92
    — now exploits infoboxes (a la DBpedia)
    — expands concept ‘type checking’ for QA/QC.

    But what really impresses me about YAGO is its clean, simple design and the use of “confidence” scores on its facts. This kind of approach will become even more important as additional, “dirtier” data gets incorporated into linked data on the Web.

    Fabian & Co. deserve the community’s thanks! as do you for this nice write-up.

  2. Hi Frederick!

    I just noticed that your feed
    http://fgiasson.com/blog/index.php/feed/
    is invested with spam

  3. Hi Thomas!

    Thanks a lot for reporting this! First time I notice that, and really, it seems a crappy spamming bot that is exploiting some WordPress bug.

    Seems that WordPress are aware of the issue, but is not fixed so far (WordPress 2.1 to 2.3.1 seems affected by the bug).

    I installed something that will notice me if this happen again, so that I delete it.

    Will monitor this, thanks!

    Take care,

    Fred

  4. Hi Frederick,

    Very good post. I am investigating YAGO ontology these days and was wondering the possibility of accessing YAGO via SPARQL queries. What I am trying to do is to find resources that correspond to the entities (like Names, Companies) thatappear in webpages and markup them with RDFa as resources to enable finding more information about them.

    Do you or anyone know of any public SPARQL endpoints to YAGO ontology? If there is, that will help me minimise the overhead of having to maintain a server to provide the SPARQL endpoint.

    Regards

    Rohana

  5. Hi Rohana!

    Unfortunately I don’t maintain one anymore. So I think the only thing that you will be able to do is to download their RDF dump, and to index it in some triple stores by yourself ๐Ÿ˜

    Thanks!

    Take care,

    Fred

  6. Hi,
    From where can I download Yago’s RDF data dumps so that I can load them into a postgresql database..Also, can you suggest some good methods to load these triples into a relational database?

    1. Hi!

      Here is the homepage of the Yago project. You can download it from there. Then I would strongly suggest you to learn how to use triple stores like Virtuoso in order to load and work with RDF triple data.

      Loading RDF data into a traditional RDB is not a problem, the problem is to use that data. You really want to have a system that let you query your data using the SPARQL query language.

  7. Thank you very much for the response.. Yes. I want to load the rdf dumb to some RDB .I used Jena sdb since it allows persistent data storage. I am using Jena through eclipse and in this case is their any better way to query the data base ? Also can you suggest me some other methods for loading Rdf to Postgres?

    1. I am not sure why you want to have RDF into Postgres. It is true that most of the triple stores out there uses conventional relational database systems under the hood. However, when come the time to quety RDF data using a triple store, you should be querying it using SPARQL. If it is the case, then you don’t really care if it is in a RDB or not, you simply choose the triple store that fit your needs (and budgets?).

  8. Thank you for the response. Basically, I want to try building indexes in the database to improve my query performance. So, I just require a platform where I can fire SPARQL queries and also convert perform the SPARQL to SQL conversion. Is there any good existing methods for that? I tried Python RDFLib, but due to some version incompatiblity between libraries, I couldn’t see how it works there.

  9. Well, most (if not all by now) triples stores are in fact quad stores. They have four columns: gspo (g for graph). Then, they have different indexation strategies using these four columns. However, all this indexation strategy is mostly handled by the SPARQL engine. It will use different strategies to optimize the SPARQL queries (this is in fact one strategy amongst others).

    What I am saying here is that this kind of considerations should be handled by the quad store, and not the developers (i.e. you). However, this may be the reason why you choose a quad store instead of another, because it is more performing for the kind of task you are requiring.

Leave a Reply

Your email address will not be published. Required fields are marked *