Frederick Giasson

Zitgist Got its Orchestrator

March 21, 2008 Frederick Giasson

I am pleased to finally be able to say that Mike Bergman is the new Chief Executive Officer of Zitgist LLC. After months of discussions, hard work, planning and development, Mike became officially the new CEO and Zitgist made a giant leap ahead.

The first contact

The first time I started to collaborate with Mike was related to the UMBEL project. Mike had an idea and I wanted to help him to make it real. At that time I didn’t know that my participation in UMBEL and my collaboration with Mike would impact Zitgist forever.

Months later I released a new prototype project called zLinks. This project has been the tipping point of my collaboration with Mike. However, even at that time, I didn’t know how these two projects would change Zitgist forever.

Those first months were a warm-up session for Mike and me. Everything started from there; we were ready to work together.

Working together

Since that time we have worked together to forge Zitgist, to shape it to Kingsley‘s, Mike’s and my vision. The process hasn’t always been easy. Each day brings its challenges, opportunities and work. We spent months to talk about Zitgist’s vision, voice, goals and direction.

Considering Zitgist’s business, people could think that everything was related to technologies, high-tech research and development. But today I would say that those things are nearly secondary. It is sure that activities, services and products are at the center of our discussions; however, we found that the center of everything was: communication.

Communication

Mike lives in Iowa, Kingsley in Boston, me in Quebec City. The three of us have different cultures, different native languages, and live in different places.

On the other hand, Zitgist is a company that gives services and creates products to help people and businesses interlink their data: to make real the value of the global data assets. We try to make data easier to communicate, publish and share.

We belong to the semantic web community. We talk and collaborate with people from around the World: with different cultures and languages. We talk about a domain (the semantic web) that is not yet fully defined and that is still highly academic. We are still juggling with concepts and terminology that we try to share with the community and people from outside this community.

Given that, all challenges can be captured in one word: communication.

We have to communicate our ideas and vision; we have to sell our services and products; we have to make data richer and easier to use and understand; we have to create a vision, a voice and a language. So yes, this is all about communication. But even more: it is all about human communication; communicating to people and companies in different languages with different cultures.

We understand one aspect of the semantic web vision as machines talking to machines. But Zitgist’s challenge is to talk with people.

Mike is now the new orchestrator of Zitgist; it is time for us to communicate our voice to the World.

A new Zitgist

This process forged Zitgist. All the discussions we had, all the ideas we challenged and all the ways we experimented to speak with the outside World forged Zitgist’s vision and voice. The time we put into making Mike the new CEO completely changed Zitgist’s dynamic. We were not just talking about hiring someone; we were talking about growing up a business and achieving a shared vision and voice. Once more, it was about communicating ideas, concepts and vision.

It is all about communication.

Thanks for joining us, Mike.

More references about this news

The official press release
Mike’s personal perspective

Networks are everywhere

March 11, 2008June 19, 2008 Frederick Giasson

Never forget that networks are everywhere. In fact, I have the feeling that anything that has relations with other things can be seen as being part of a network: the so-called social networks, phone networks, DNA networks, protein networks, subject networks, web pages networks, airport networks, street networks, etc, etc, etc.

In an article about the upcoming Twine, Marshall Kirkpatrick said one particular thing that makes my eyebrows rise:

“I would use Twine for recommendation alone, but the value of that feature is minimal until the service finds a large number of users. As it stands, that’s not likely to occur. When it comes to collective organization and discovery of content – nothing is as important as network effect.”

The problem I have with this sentence if that it makes me think that Marshall is saying that: network effect == people collaborating in a same, closed, system (à la Del.icio.us).

The key thing here is that a network effects can take place in many kind of networks, and in many places. So, does Twine or any other so-called semantic web application, need million of users to leverage (create value of) network effects of different kind of networks? I don’t think so.

Network effects will emerge from the interaction of different services, the linkage of different data sources, and the work of millions of people. Who will own all these things? The Web. Then businesses will leverage that Web, like they currently do, to create value for users.

So, is Twine, or any other so-called semantic web application, doomed because of their lack of a user base? I would guess no. It all depends on what network you’re talking about…

Trusting people on the Web

March 8, 2008July 20, 2008 Frederick Giasson

An interesting post appeared in my feed reader this morning. This post, published on Slashdot, is saying:

“[…] a Newsweek piece suggests that the era of user-generated content is going to change in favor of fact-checking and more rigorous standards. […] “User-generated sites like Wikipedia, for all the stuff they get right, still find themselves in frequent dust-ups over inaccuracies, while community-posting boards like Craigslist have never been able to keep out scammers and frauds. Beyond performance, a series of miniscandals has called the whole “bring your own content” ethic into question. Last summer researchers in Palo Alto, Calif., uncovered secret elitism at Wikipedia when they found that 1 percent of the reference site’s users make more than 50 percent of its edits. Perhaps more notoriously, four years ago a computer glitch revealed that Amazon.com’s customer-written book reviews are often written by the book’s author or a shill for the publisher. ‘The wisdom of the crowds has peaked,’ says Calacanis. ‘Web 3.0 is taking what we’ve built in Web 2.0–the wisdom of the crowds–and putting an editorial layer on it of truly talented, compensated people to make the product more trusted and refined.’”

What is probably the best way to sell something to someone? When someone of trust recommends buying something for X, Y and Z reasons, to someone else. It is possibly why blogs are so powerful to sell things. You have people that write about their lives and their passions. From time to time they write about things they bought and they really liked. They are not paid for it; they just share their experience with other people. What if someone you learned to trust over time, by reading its blog, tell you that one of the thing you wanted to buy, but that you were was not sure to buy for some reasons, tell you that it is an awesome thing to have? Probably that you will more than likely be willing to buy the thing right away, online or in a local store. This is only possible because of the trust you have in this blogger, a trust that you learned over time, while reading its blog.

At least, it is what happens with me, and I hope I am not alone.

The problem they outline in this article is that the trust link has been broken between web readers and content creator. In systems such as Amazon.com and Ebay.com your user identity lives by its own, only within these systems. So you, as a reader and consumer on these web sites, only have access to things these content creator said, on these specific web sites only. You don’t have access the other things they written about, elsewhere on the Web. This means that you only have this partial and incomplete information to trust a person that said something about something you are reading, or that you are about to buy. This is more a question of faith than a question of “trusting the crowd”.

Calacanis said ‘Web 3.0 is taking what we’ve built in Web 2.0–the wisdom of the crowds–and putting an editorial layer on it of truly talented, compensated people to make the product more trusted and refined’. First of all, please stop using the Web 3.0 term for anything; just stop using it at all… Otherwise, I don’t think the benefits would be enough to justify the costs of such a system powered by a crowd of “expert”. In that case, is the whole thing doomed?

The main force in action here is trust. The idea is to strengthen the trust level between people across all web sites. What if, from a comment published by a user on Amazon.com, I could end up knowing the URL of its blog, if I could see the ratings he got from Ebay.com users, if I could read other comments he wrote on other web sites and blogs? What if I could know more about a person from any location on the Web, by referring to a comment he wrote?

Then I could start building a better trust relationship with that person, and put more weight in what he said.

Welcome on the Semantic Web.

Data Referencing, Data Mobility and the Semantic Web

January 20, 2008March 9, 2008 Frederick Giasson

I recently started to follow discussions evolving around the Data Portability project. It is an emerging community of people that tries to define the principles and push technologies to encourage the “portability” of data between people and systems. Other such initiative exists, such the Linking Open Data Community (that emerged from the semantic web community more than one year ago), The Open Knowledge Definition, and there are probably many others too. However DP is the one that recently got the biggest media coverage considering “support” and covering from some people and groups.

An interesting thread emerged from the mailing list that was trying to get a better definition of what “Data Portability” means.

Henry Story opened the door of the “linked data” (instead of moving data) and Kingsley nailed the two important distinction points:

Data Referencing
Data Mobility (moving data from distinct locations via Import and Export using agreed data formats)

What the Semantic Web means in this context?

What these two critical points mean in terms of semantic web concepts and technologies?

Defining the context

This discussion will be articulated in one context: the Web. The current discussion will take into consideration that all data is available on the Web. This means the use of Web technologies, protocols, standards and concepts. This could be extended to other networks, with other protocols and technologies, but we will focus the discussion on the Web.

Data Referencing

How data referencing is handled on the semantic web? Well, much information is available about that question on the Linked Data Wikipedia page. Basically it is about referencing data (resources) using URIs (unique resources identifiers), and these URIs should ideally be “dereferencable” on the Web. What “dereferencable on the Web” means? It means that if I have a user account on a certain web service, and that I have one URI that define that account, and that this URI is in fact a URL, so that I can get data (normally a RDF document; in this example it would be a RDF document describing that user account) by looking at this URL on the Web (in this case we say that the URI is dereferencable on the Web).

This means one wonderful thing: if I get a reference (URI) to something, this means that in the best of the cases, I can also get data describing this thing by looking on the Web for its description. So, instead of getting a HTML page describing that thing (this can be the case, but is not limited to) I can get the RDF description of that thing too (via web server content negotiation). This RDF description can be use by any web service, any
software agent, or whatever, to helps me to perform specific tasks using this data (Importing/Exporting my personal data? Merging two agendas in the same calendar? Planning my next trips? And so on).

Now that I have a way to easily reference and access any data on the Web, how that accessible data can become “mobile”?

RDF and Ontologies to makes data “mobile”

RDF is a way to describe things called “resources”. These resources can be anything: people, books, places, events, etc. There exists a mechanism that let anybody describing things according to their properties (predicates). The result of this mechanism is a graph of relationships describing a thing (a resource). This mechanism do not only describes properties of a Thing, but also describe relationship between different things. For example, a person (a resource) can be described by its physical properties, but it can also be described with its relation with other people (other resources). Think about a social graph.

What is this mechanism? RDF.

Ontologies as vocabularies standards

However, RDF can’t be used alone. In order to make this thing effective, one need to use “vocabularies”, called ontologies, to describe a resource and its properties. These ontologies can be seen as a controlled vocabulary defined by a community of experts to describe some domains of things (books, music, people, networks, calendar, etc). It is much more than a controlled vocabulary, but it is easier to understand what it is that way.

FOAF is one of these vocabularies. You can use this ontology to describe a person, and its relation with other people, in RDF. So, you will say: this resource is named Fred; Fred lives near Quebec City; and Fred knows Kingsley. And so on.

By using RDF + Ontologies, the data is easily made Mobile. By using such standards that communities, people and enterprises agree to uses; systems will be able to read, understand and manage data coming from multiple different data sources.

Ontologies are standards ensuring that all the people and systems that understand these ontologies can understand the data that is described, and then accessible. It is where data becomes movable (mobility is not only about accessibility for download, it is also about understanding the transmitted data).
Data description robustness

But you know what is the beauty with RDF? It is that if one of the system doesn’t know one ontology, or do not understand all classes and properties of an ontology used to describe a resource, it will only ignore that data and concentrate its effort to understand the thing being described with the ontologies it knows. It is like if I would speak to you, in the same conversation, in French, English, Italian and Chinese. You would only understand what I say in the languages you know, and you will act considering the things you understood of the conversation. You will only discard the things you don’t understand.

Conclusion

Well, it is hard to put all these things in one single blog post, but I would encourage people that are not familiar with these concepts, terminologies and technologies, and that are interested in the question, to start reading what the semantic web community wrote about these things, what are the standards supported and developed by the W3C, etc. There are so many things that can change the way people use the Web today. It is just a question of time in fact!

Second version of Yago: more facts and entities

December 20, 2007February 22, 2008 Frederick Giasson

In the past month or two I got more and more interested in the Yago project. First this gave me the opportunity to find a really interesting person, the main author of Yago, Fabian Suchanek. I have been impressed by the simplicity (and creating simple things with such complex stuff is certainly the harder task out there) and the coverage of Yago. It was well built and based on solid foundations. It is after downloading it, converting it into RDF, indexing it into a triple store and fixing serialization glitches and semantic relations issues that I really started to appreciate all the work that has been put in that project.

I am now pleased to write about the next version of Yago that has recently been released by Fabian & Co. The papers describing this new version has been published about a week ago (and written by Fabian, Gjergji Kasneci and Gerhard Weikum), and the new data set has been released a couple of days ago. After fixing one last RDF issue with the conversion of the Yago data set into RDF, I am now ready to write something about it.

First of all, what is Yago? Yago is some kind of ontologies. It is a dataset composed of entities and facts about these entities. It describes things such as Abraham Lincoln (entity) is the successor (fact) of James Buchanan (entity). All these entities and facts come from two data sources: Wikipedia and Wordnet. Please read Fabian’s paper to know exactly hat come from where.

Yago has its own representation and logic framework. However, converters exist to convert the Yago dataset into RDF serialized in XML or into other formats. Just to demonstrate how Yago is complete by itself, a query language has been created explicitly to query Yago. However, one can convert the Yago dataset into RDF, index it in a triple store, and query the same information using SPARQL (it is what I have done myself). To read about these frameworks, and to read about how Yago is working internally, you have to read the presentation paper written by Fabian.

So, what is new with this second version of Yago?

There is about 500 000 additional entities (now counting about 1 500 000 entities in the Yago dataset).

Also, many new predicates have been added in this new version, there is the list of 99 predicates available to build queries:

actedIn, bornIn, bornOnDate, created, createdOnDate, dealsWith, describes, diedIn, diedOnDate, directed, discovered, discoveredOnDate, domain, during, during, establishedOnDate, exports, familyNameOf, foundIn, givenNameOf, graduatedFrom, happenedIn, hasAcademicAdvisor, hasArea, hasBudget, hasCallingCode, hasCapital, hasChild, hasCurrency, hasDuration, hasEconomicGrowth, hasExpenses, hasExport, hasGDPPPP, hasGini, hasHDI, hasHeight, hasImdb, hasImport, hasInflation, hasISBN, hasLabor, hasMotto, hasNominalGDP, hasOfficialLanguage, hasPages, hasPopulation, hasPopulationDensity, hasPoverty, hasPredecessor, hasProduct, hasProductionLanguage, hasRevenue, hasSuccessor, hasTLD, hasUnemployment, hasUTCOffset, hasValue, hasWaterPart, hasWebsite, hasWeight, hasWonPrize, imports, influences, inLanguage, interestedIn, inTimeZone, inUnit, isAffiliatedTo, isCalled, isCitizenOf, isLeaderOf, isMarriedTo, isMemberOf, isNativeNameOf, isNumber, isOfGenre, isPartOf, isSubstanceOf, livesIn, locatedIn, madeCoverFor, means, musicalRole, originatesFrom, participatedIn, politicianOf, produced, publishedOnDate, range, since, subClassOf, subPropertyOf, type, until, using, worksAt, writtenInYear, wrote

Also the converted RDF dump is much, much bigger than the previous one. In fact, the RDF dump that is generated is about 15 gigabytes.

Trying to slim the RDF dump using N3 serialization

It is after noticing the size of the RDF dump serialized in XML that I checked if we could slim this data dump a bit by serializing all the RDF using N3/Turtle instead of XML.

However it was not concluding. Except for the friendliness aspect of the N3 code compared to the XML one, there is no real gain in term of space. The reason is that Yago extensively use reification to assert a statement about a triple (a fact). Since there is no reification syntax in N3 (or N3 Turtle), we have to describe the reification statement at length like this:

A RDF/XML Yago fact:

<?xml version=”1.0″?>
<!DOCTYPE rdf:RDF [<!ENTITY d “http://www.w3.org/2001/XMLSchema#”>
<!ENTITY y “http://www.mpii.de/yago#”>]>

<rdf:RDF xmlns:rdf=”http://www.w3.org/1999/02/22-rdf-syntax-ns#”
xmlns:base=”http://www.mpii.de/yago”
xmlns:y=”http://www.mpii.de/yago/”>
<rdf:Description rdf:about=”&y;Abraham_Lincoln”><y:hasSuccessor rdf:ID=”f200876173″ rdf:resource=”&y;Thomas_L._Harris”/></rdf:Description>
<rdf:Description rdf:about=”#f200876173″><y:confidence rdf:datatype=”&d;double”>0.9486150988008782</y:confidence></rdf:Description>
</rdf:RDF>

And its RDF/N3 counterpart:

@base <http://www.mpii.de/yago> .
@prefix y: <http://www.mpii.de/yago/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
<#Abraham_Lincoln> y:politicianOf <#United_States> .
<#f201920397> rdf:type rdf:Statement ;
rdf:subject <#Abraham_Lincoln> ;
rdf:predicate y:politicianOf ;
rdf:object <#United_States> ;
y:confidence “0.967356428105286”^^xsd:decimal.

Since Yago is a special case that uses extensively reification for all of its facts, you can’t gain significant hard drive space by serializing in N3: it is at best marginal.

Some queries

What would be the usefulness of Yago without being able to query it? There won’t be any; so lets test it with some SPARQL queries.

Question 1: How is called the place where Andre Agassi is living?

SPARQL query:

sparql
select *
from <http://www.mpii.de/yago/>
where
{
<http://www.mpii.de/yago#Andre_Agassi> <http://www.mpii.de/yago/livesIn> ?place.
?place <http://www.mpii.de/yago/isCalled> ?place_name.
}

Result: “Las Vegas”

Question 2: What are the other film produced by the guy that produced the movie Blade Runner?

SPARQL query:

sparql
select *
from <http://www.mpii.de/yago/>
where
{
?producer <http://www.mpii.de/yago/produced> <http://www.mpii.de/yago#Blade_Runner>.
?producer <http://www.mpii.de/yago/produced> ?other_movies.
}

Result: “The Italian Job”, “Murphy’s War”, “Robbery”

And so on. It is that simple. If you do not know the URI of an entity, you only have to refer to its label using the property isCalled.

Considering that fact that we know the properties that are describing within Yago, and considering that all properties are consistent within Yago, it become quite easy to get interesting stuff by querying the dataset.

Conclusion

This new version is a clear leap ahead. It continues to be as simple as the first version. It is enhanced with more entities and more predicates; but is always consistent with a really good accuracy level.

I would like to see one more thing with Yago: being able to dereference these URIs on the Web. I will check with Fabian to make all these URIs dereferencable on the Web. So expect another blog post announcing this in the following days or weeks.

Machine Learning, Engineering & Data

Author: Frederick Giasson

Zitgist Got its Orchestrator

The first contact

Working together

Communication

A new Zitgist

More references about this news

Networks are everywhere

Trusting people on the Web

Data Referencing, Data Mobility and the Semantic Web

What the Semantic Web means in this context?

Defining the context

Data Referencing

RDF and Ontologies to makes data “mobile”

Ontologies as vocabularies standards

Conclusion

Second version of Yago: more facts and entities

So, what is new with this second version of Yago?

Trying to slim the RDF dump using N3 serialization

Some queries

Conclusion