Second version of Yago: more facts and entities

In the past month or two I got more and more interested in the Yago project. First this gave me the opportunity to find a really interesting person, the main author of Yago, Fabian Suchanek. I have been impressed by the simplicity (and creating simple things with such complex stuff is certainly the harder task out there) and the coverage of Yago. It was well built and based on solid foundations. It is after downloading it, converting it into RDF, indexing it into a triple store and fixing serialization glitches and semantic relations issues that I really started to appreciate all the work that has been put in that project.

I am now pleased to write about the next version of Yago that has recently been released by Fabian & Co. The papers describing this new version has been published about a week ago (and written by Fabian, Gjergji Kasneci and Gerhard Weikum), and the new data set has been released a couple of days ago. After fixing one last RDF issue with the conversion of the Yago data set into RDF, I am now ready to write something about it.

First of all, what is Yago? Yago is some kind of ontologies. It is a dataset composed of entities and facts about these entities. It describes things such as Abraham Lincoln (entity) is the successor (fact) of James Buchanan (entity). All these entities and facts come from two data sources: Wikipedia and Wordnet. Please read Fabian’s paper to know exactly hat come from where.

Yago has its own representation and logic framework. However, converters exist to convert the Yago dataset into RDF serialized in XML or into other formats. Just to demonstrate how Yago is complete by itself, a query language has been created explicitly to query Yago. However, one can convert the Yago dataset into RDF, index it in a triple store, and query the same information using SPARQL (it is what I have done myself). To read about these frameworks, and to read about how Yago is working internally, you have to read the presentation paper written by Fabian.

So, what is new with this second version of Yago?

There is about 500 000 additional entities (now counting about 1 500 000 entities in the Yago dataset).

Also, many new predicates have been added in this new version, there is the list of 99 predicates available to build queries:

actedIn, bornIn, bornOnDate, created, createdOnDate, dealsWith, describes, diedIn, diedOnDate, directed, discovered, discoveredOnDate, domain, during, during, establishedOnDate, exports, familyNameOf, foundIn, givenNameOf, graduatedFrom, happenedIn, hasAcademicAdvisor, hasArea, hasBudget, hasCallingCode, hasCapital, hasChild, hasCurrency, hasDuration, hasEconomicGrowth, hasExpenses, hasExport, hasGDPPPP, hasGini, hasHDI, hasHeight, hasImdb, hasImport, hasInflation, hasISBN, hasLabor, hasMotto, hasNominalGDP, hasOfficialLanguage, hasPages, hasPopulation, hasPopulationDensity, hasPoverty, hasPredecessor, hasProduct, hasProductionLanguage, hasRevenue, hasSuccessor, hasTLD, hasUnemployment, hasUTCOffset, hasValue, hasWaterPart, hasWebsite, hasWeight, hasWonPrize, imports, influences, inLanguage, interestedIn, inTimeZone, inUnit, isAffiliatedTo, isCalled, isCitizenOf, isLeaderOf, isMarriedTo, isMemberOf, isNativeNameOf, isNumber, isOfGenre, isPartOf, isSubstanceOf, livesIn, locatedIn, madeCoverFor, means, musicalRole, originatesFrom, participatedIn, politicianOf, produced, publishedOnDate, range, since, subClassOf, subPropertyOf, type, until, using, worksAt, writtenInYear, wrote

Also the converted RDF dump is much, much bigger than the previous one. In fact, the RDF dump that is generated is about 15 gigabytes.

Trying to slim the RDF dump using N3 serialization

It is after noticing the size of the RDF dump serialized in XML that I checked if we could slim this data dump a bit by serializing all the RDF using N3/Turtle instead of XML.

However it was not concluding. Except for the friendliness aspect of the N3 code compared to the XML one, there is no real gain in term of space. The reason is that Yago extensively use reification to assert a statement about a triple (a fact). Since there is no reification syntax in N3 (or N3 Turtle), we have to describe the reification statement at length like this:

A RDF/XML Yago fact:

<?xml version=”1.0″?>
<!DOCTYPE rdf:RDF [<!ENTITY d “http://www.w3.org/2001/XMLSchema#”>
<!ENTITY y “http://www.mpii.de/yago#”>]>

<rdf:RDF xmlns:rdf=”http://www.w3.org/1999/02/22-rdf-syntax-ns#”
xmlns:base=”http://www.mpii.de/yago”
xmlns:y=”http://www.mpii.de/yago/”>
<rdf:Description rdf:about=”&y;Abraham_Lincoln”><y:hasSuccessor rdf:ID=”f200876173″ rdf:resource=”&y;Thomas_L._Harris”/></rdf:Description>
<rdf:Description rdf:about=”#f200876173″><y:confidence rdf:datatype=”&d;double”>0.9486150988008782</y:confidence></rdf:Description>
</rdf:RDF>

And its RDF/N3 counterpart:

@base <http://www.mpii.de/yago> .
@prefix y: <http://www.mpii.de/yago/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
<#Abraham_Lincoln> y:politicianOf <#United_States> .
<#f201920397> rdf:type rdf:Statement ;
rdf:subject <#Abraham_Lincoln> ;
rdf:predicate y:politicianOf ;
rdf:object <#United_States> ;
y:confidence “0.967356428105286”^^xsd:decimal.

Since Yago is a special case that uses extensively reification for all of its facts, you can’t gain significant hard drive space by serializing in N3: it is at best marginal.

Some queries

What would be the usefulness of Yago without being able to query it? There won’t be any; so lets test it with some SPARQL queries.

Question 1: How is called the place where Andre Agassi is living?


SPARQL query:

sparql
select *
from <http://www.mpii.de/yago/>
where
{
<http://www.mpii.de/yago#Andre_Agassi> <http://www.mpii.de/yago/livesIn> ?place.
?place <http://www.mpii.de/yago/isCalled> ?place_name.
}

Result: “Las Vegas”

Question 2: What are the other film produced by the guy that produced the movie Blade Runner?

SPARQL query:

sparql
select *
from <http://www.mpii.de/yago/>
where
{
?producer <http://www.mpii.de/yago/produced> <http://www.mpii.de/yago#Blade_Runner>.
?producer <http://www.mpii.de/yago/produced> ?other_movies.
}

Result: “The Italian Job”, “Murphy’s War”, “Robbery”

And so on. It is that simple. If you do not know the URI of an entity, you only have to refer to its label using the property isCalled.

Considering that fact that we know the properties that are describing within Yago, and considering that all properties are consistent within Yago, it become quite easy to get interesting stuff by querying the dataset.

Conclusion

This new version is a clear leap ahead. It continues to be as simple as the first version. It is enhanced with more entities and more predicates; but is always consistent with a really good accuracy level.

I would like to see one more thing with Yago: being able to dereference these URIs on the Web. I will check with Fabian to make all these URIs dereferencable on the Web. So expect another blog post announcing this in the following days or weeks.

Why reading DataViewer pages instead of conventional web pages?

Yesterday Georgi Kobilarov asked this question reading this blog post about the DataViewer:

“could you elaborate on an example in which the Zitgist Browser / DataViewer enables me (the end-user) to do something I couldn’t do by reading web documents or even do something faster than by reading web documents?”

This is a good and legitimate question. However the first question would be: what Semantic Web documents (RDF documents) are used for? The idea was probably to gives the Web back to machines, so that they can have access and process the data more easily.

In such a vision, the use of such a DataViewer can raise some questions, as you did.

So, what is the usefulness of such a tool to end-users?

If you tell me that your home page contains the same information as the RDF document describing entities (and all the information about them) available from your home page, then yes, a human should read your home page more easily (since HTML documents are built for humans).

But one characteristic of RDF, and we can see it with the emergence of the Linking Open Data Community, is that data (entities) can be linked together, as webpage are.

Entities, so the data describing these entities, are meshed together. This means that my personal profile can be linked to the semantic web document describing the city where I live; it can also be linked to the articles I wrote; be linked to my friends and co-workers; be linked to the company I currently work for; to the projects I worked on and the ones I am currently working on; and so on.

All this information is accessible from one place: my personal profile semantic web document. All the other information is meshed and displayed in the DataViewer. So, instead of browsing all these web pages, you will have all the information displayed in a DataViewer page.

Given this vision of things, I envision that in the future people will craft data: describing things and linking things together, instead of crafting web pages. The data itself will drive the generation (and optimization) of user interfaces that will display that data.

The idea of the current DataViewer is simple: shapeshifting the user interface to maximize the display of meshed data. The same principles can apply to other systems such as emails clients, rss readers, web widgets, etc. What the DataViewer is, is a kind of semantic web data management system. What it produces, for the moment, is an HTML document. But I can assure you that you will see it incarnated in other services and in
other environments.

The DataViewer is not the answer to all problems of the World; however it tries to do one thing: managing the presentation of semantic web data to end-users. At the moment it takes the incarnation of a web page, but it the future it will be extended to other things. What is important is the principles, concepts and vision supporting the development of such presentation and management tools.

But we shouldn’t fool ourselves here: web page (HTML pages) have always been created, developed, crafter for humans. They are optimized for human understanding. An interface such as the DataViewer can’t compete with a will crafted web page. However it can do some things that web page can’t do if it has access to well crafted data sources. It can present, in a meaningful way, information that comes from a myriad of data sources, all in the same page, ready to gives immediately the information the user is looking for.