One of Semantic Web’s Core Added Value

If I ask the question: “What added value(s) does the Semantic Web brings on the table?”. So, what are the benefits that companies and organizations would get from using the Semantic Web? I am pretty sure that after asking this question, I would get answers such as:
  • You will instantly be able to traverse graphs of relationships
  • You will be able to infer facts (so create/persist new knowledge) from other existing facts
  • You will be able to check to make sure that your knowledge base is consistent and satisfiable
  • You will be able to modify your ontologies/vocabularies/schemas without impacting the description of your instance records or the usability of any software that use it (unlike relation databases)
  • And so on…

All these answers would be accurate. However, what if these answers would only be a part of the real added value that the Semantic Web brings on the table?

Note: when I refer to the “Semantic Web” on this blog post (and across all my writings), I refer to a set of technologies, techniques and concepts referred as the Semantic Web. So it is not a single thing, but a complete set of things that creates new ways of working with, and manipulating, information.

Strong of about 7 years of research and development of Semantic Web technologies that includes about 3 years of developing the Open Semantic Framework, that the biggest added value that I found from utilizing Semantic Web technologies is only partially related to these answers. In fact the biggest added value for me, as a developer can be resumed in one word:


As simple as this. The biggest added value I gained from using and applying Semantic Web related technologies, techniques and concepts is an important increase in development, and data integration productivity.

Such productivity gain as to do with one of Semantic Web’s core attribute:


This is what I was suggesting in my latest blog post about Volkswagen’s use of the Open Semantic Framework: how Volkswagen uses the Open Semantic Framework to get flexibility that will lead to a gain in productivity to integrate, publish, and re-contextualize their data assets. The few gains that I listed above are part of the reason why the Semantic Web gives you flexibility that leads to an increase in productivity.

This same point as been re-affirmed today by Lee Feigenbaum in its latest blog post Saving Months, Not Milliseconds: Do More Faster with the Semantic Web:

Why is this? Ultimately, it’s because of the inherent flexibility of the Semantic Web data model (RDF). This flexibility has been described in many different ways. RDF relies on an adaptive, resilient schema (from Mike Bergman); it enables cooperation without coordination (from David Wood via Kendall Clark); it can be incrementally evolved; changes to one part of a system don’t require re-designs to the rest of the system. These are all dimensions of the same core flexibility of Semantic Web technologies, and it is this flexibility that lets you do things fast with the Semantic Web.

Warning: Productivity is not synonymous with simplicity

However, I would warn people that think that productivity gains are possible because semantic web technologies are simpler to use, manage and implement than other existing technologies.

It is certainly not the case, and I don’t think it will ever be. Semantic Web technologies, techniques and concepts are not easy to understand, and have a big learning curve. This is partly true because these techniques, technologies and concepts are relatively new in the field of the computer sciences, and because they are not fully understood, defined, implemented and used.

When Linked Data Rules Fail

Image Source:

High Visibility Problems with NYT, Show Need for Better

When I say, “shot”, what do you think of? A flu shot? A shot of whisky? A moon shot? A gun shot? What if I add the term “bank”? Do you now think of someone being shot in an armed robbery of a local bank or similar?

And, now, what if I add a reference to say, The Hustler, or Minnesota Fats, or “Fast Eddie” Felson? Do you now see the connection to a pressure-packed banked pool shot in some smoky bar room?

As humans we need context to make connections and remove ambiguity. For machines, with their limited reasoning and inference engines, context and accurate connections are even more important.

Over the past few weeks we have seen announcements of two large and high-visibility linked data

projects:  One, a first release of references for articles concerning about 5,000 people from the New York Times at; and Two, a massive exposure of 5 billion triples from datasets provided by the Tetherless World Constellation (TWC) at Rennselaer Polytechnic Institute (RPI).

On various grounds from licensing to data characterization and to creating linked data for its own sake, some prominent commentators have weighed in on what is good and what is not so good with these datasets. One of us, Mike, commented about a week ago that “we have now moved beyond ‘proof of concept’ to
the need for actual useful data of trustworthy provenance and proper mapping and characterization. Recent efforts are a disappointment that no enterprise would or could rely upon.”

Reactions to that posting and continued discussion on various mailing lists warrant a more precise dissection of what is wrong and still needs to be done with these datasets [1].

Berners-Lee’s Four Linked Data “Rules”

It is useful, then, to return to first principles, namely the original four “rules” posed by Tim Berners-Lee in his design note on linked data [2]:

  1. Use URIs as names for things
  2. Use HTTP URIs so that people can look up those names
  3. When someone looks up a URI, provide useful information, using thestandards (RDF, SPARQL)
  4. Include links to other URIs so that they can discover more things.

The first two rules are definitional to the idea of linked data. They cement the basis of linked data in the Web, and are not at issue with either of the two linked data projects that are the subject of this posting.

However, it is the lack of specifics and guidance in the last two rules where the breakdowns occur. Both the NYT and the RPI datasets suffer from a lack of “providing useful information” (Rule #3). And, the nature of the links in Rule #4 is a real problem for the NYT dataset.

What Constitutes “Useful Information”?

The Wikipedia entry on linked data expands on “useful information” by augmenting the original rule with the parenthetical clause, ” (i.e., a structured description — metadata).” But even that expansion is insufficient.

Fundamentally, what are we talking about with linked data? Well, we are talking about instances that are characterized by one or more attributes. Those instances exist within contexts of various natures. And, those contexts may relate to other existing contexts.

We can break this problem description down into three parts:

  • A vocabulary that defines the nature of the instances and their descriptive attributes
  • A schema of some nature that describes the structural relationships amongst instances and their characteristics, and, optimally,
  • A mapping to existing external schema or constructs that help place the data into context.

At minimum, ANY dataset exposed as linked data needs to be described by a vocabulary. Both the NYT and RPI datasets fail on this score, as we elaborate below. Better practice is to also provide a schema of relationships in which to embed each instance record. And, best practice is to also map those structures to external schema.

Lacking this “useful information”, especially a defining vocabulary, we cannot begin to understand whether our instances deal with drinks, bank robberies or pool shots. This lack, in essence, makes the information worthless, even though available via URL.

The (RPI) Case

With the support of NSF and various grant funding, RPI has set up the
Data-Gov Wiki [3], which is in the process of converting the datasets on to RDF,placing them into a semantic wiki to enable comment and annotation, and providing that data as RSS feeds. Other demos are also being placed on the site.

As of the date of this posting, the site had a catalog of 116 datasets from the 800 or so available on, leading to these statistics:

  • 459,412,419 table entries
  • 5,074,932,510 triples, and
  • 7,564 properties (or attributes).

We’ll take one of these datasets, #319, and look a bit closer at it:

Wiki Title Agency Name Link No Properties No Triples RDF File
Dataset 319 Consumer Expenditure Survey Department of Labor LABOR-STAT 22 1,583,236

This report was picked solely because it had a small number of attributes (properties), and is thus easier to screen capture. The summary report on the wiki is shown by this page:

Data-gov-Wiki Dataset #319

(click to expand)

So, we see that this specific dataset contains about 22 of the nearly 8,000 attributes across all datasets.

When we click on one of these attribute names, we are then taken to a specific wiki page that only reiterates its label. There is no definition or explanation.

When we inspect this page further we see that, other than the broad characterization of the dataset itself (the bulk of the page), we see at the bottom 22 undefined attributes with labels such as item code, periodicity code, seasonal, and the like. These attributes are the real structural basis for the data in this dataset.

But, what does all of this mean???

To gain a clue, now let’s go to the source site for this dataset (#319). Here is how that report looks: Dataset #319

(click to expand)

Contained within this report we see a listing for additional metadata. This link tells us about the various data fields contained in this dataset; we see many of these attributes are “codes” to various data categories.

Probing further into the dataset’s technical documentation, we see that there is indeed a rich structure underneath this report, again provided
via various code lookups. There are codes for geography, seasonality (adjusted or not), consumer demographic profiles and a variety of consumption categories. (See, for example, the link to this glossary page.) These are the keys to understanding the actual values within this dataset.

For example, one major dimension of the data is captured by the attribute item_code. The survey breaks down consumption expenditures within the broad categories of  Food, Housing, Apparel and Services, Transportation, Health Care, Entertainment, and Other. Within a category, there is also a rich structural breakdown. For xample, expenditures for Bakery Products within Food is given a code of FHC2.

But, nowhere are these codes defined or unlocked in the RDF datasets. This absence is true for virtually all of the datasets exposed on this wiki.

So, for literally billions of triples, and 8,000 attributes, we have ABSOLUTELY NO INFORMATION ABOUT WHAT THE DATA CONTAINS OTHER THAN A PROPERTY LABEL. There is much,much rich value here in, but all of it remains locked up and hidden.

The sad truth about this data release is that it provides absolutely no value in its current form. We lack the keys to unlock the value.

To be sure, early essential spade work has been done here to begin putting in place the conversion infrastructure for moving text files, spreadsheets and the like to an RDF form. This is yeoman work important to ultimate access. But, until a vocabulary is published that defines the attributes and their codes so we can unlock this value, it will remain hidden. And only when its further value (by connecting attributes and relations across datasets) through a schema of some nature is also published, the real value from connecting the dots will also remain hidden.The Hustler

These datasets may meet the partial conditions of providing clickable URLs, but the crucial “useful information” as to what any of this data means is absent.

Every single dataset on has supporting references to text files, PDFs, Web pages or the like that describe the nature of the data within each dataset. Until that information is exposed and made usable, we have no linked data.

Until ontologies get created from these technical documents, the value of these data instances remain locked up, and no value can be created from having these datasets expressed in RDF.

The devil lies in the details. The essential hard work has not yet begun.

The NYT Case

Though at a much smaller scale with many fewer attributes, the NYT dataset suffers from the same failing: it too lacks a vocabulary.

So, let’s take the case of one of the lead actors in The Hustler, Paul Newman, who played the role of “Fast Eddie” Felson. Here is the NYT record for the “person” Paul
(which they also refer to as Note the header title of Newman, Paul:

NYT 'Paul Newman Articles' Record

(click to expand)

Click on any of the internal labels used by the NYT for its own attributes (such as nyt:first_use), and you will be given this message:

“An RDFS description and English language documentation for the NYT namespace will be provided soon. Thanks for your patience.”

We again have no idea what is meant by all of this data except for the labels used for its attributes. In this case for nyt:first_use we have a value of “2001-03-18”.

Hello? What? What is a “first use” for a “Paul Newman” of “2001-03-18”???

The NYT put the cart before the horse: even if minimal, they should have released their ontology first — or at least at the same time — as they released their data instances. (See further this discussion about how an ontology creation workflow can be incremental by starting simple and then upgrading as needed.)

Links to Other Things

Since there really are no links to other things on the Data-Gov Wiki, our focus in this section continues with the NYT dataset using our same example.

We now are in the territory of the fourth “rule” of linked data: 4. Include links to other URIs so that they can discover more things.

This will seem a bit basic at first, but before we can talk about linking to other things, we first need to understand and define the starting “thing” to which we are linking.

What is a “Newman, Paul” Thing?

Of course, without its own vocabulary, we are left to deduce what this thing “Newman, Paul” is that is shown in the previous screen shot. Our first clue comes from the statement that it is of rdf:type SKOS concept. By looking to the SKOS vocabulary, we see that concept is a class and is defined as:

A SKOS concept can be viewed as an idea or notion; a unit of thought. However, what constitutes a unit of thought is subjective, and this
definition is meant to be suggestive, rather than restrictive. The notion of a SKOS concept is useful when describing the conceptual or intellectual structure of a knowledge organization system, and when referring to specific ideas or meanings established within a KOS.

We also see that this instance is given a foaf:primaryTopic of Paul Newman.

So, we can deduce so far that this instance is about the concept or idea of Paul Newman. Now, looking to the attributes of this instance — that is the defining properties provided by the NYT — we see the properties of nyt:associated_article_count, nyt:first_use, nyt:last_use and nyt:topicPage. Completing our deductions, and in the absence of its own vocabulary, we can now define this concept instance somewhat as follows:

New York Times articles in the period 2001 to 2009 having as their primary topic the actor Paul Newman

(BTW, across all records in this dataset, we could see what the earliest first use was to better deduce the time period over which these articles have been assembled, but that has not been done.)

We also would re-title this instance more akin to “2001-2009 NYT Articles with a Primary Topic of Paul Newman” or some such and use URIs more akin to this usage.

sameAs Woes

Thus, in order to make links or connections with other data, it is essential to understand what the nature is of the subject “thing” at hand. There is much confusion about actual “things” and the references to “things” and what is the nature of a “thing” within the literature and on mailing lists.

Our belief and usage in matters of the semantic Web is that all “things” we deal with are a reference to whatever the “true”, actual thing is. The question then becomes:  What is the nature (or scope) of this referent?

There are actually quite easy ways to determine this nature. First, look to one or more instance examples of the “thing” being referred to. In our case above, we have the “Newman, Paul” instance record. Then, look to the properties (or attributes) the publisher of that record has used to describe that thing. Again, in the case above, we have nyt:associated_article_count, nyt:first_use, nyt:last_use and nyt:topicPage.

Clearly, this instance record — that is, its nature — deals with articles or groups of articles. The relation to Paul Newman occurs as a basis of
the primary topic of these articles, and not a person basis for which to describe the instance. If the nature of the instance was indeed the person Paul Newman, then the attributes of the record would more properly be related to “person” properties such as age, sex, birthdate, death date, marital status, etc.

This confusion by NYT as to the nature of the “things” they are describing then leads to some very serious errors. By confusing the topic (Paul Newman) of a record with the nature of that record (articles about topics), NYT next misuses one of the most powerful semantic Web predicates available, owl:sameAs.

By asserting in the “Newman, Paul” record that the instance has a sameAs relationship with external records in Freebase and DBpedia, the NYT both entails that properties from any of the associated records are shared and infers a chain of other types to describe the record. More precisely, the NYT is asserting that the “thing” referred to by these instances are identical resources.

Thus, by the sameAs statements in the “Newman, Paul” record, the NYT is also asserting that that record is an instance of all these classes:

Furthermore, because of its strong, reciprocal entailments, the owl:sameAs assertion would also now entail that the person Paul Newman has the nyt:first_use and nyt:last_use attributes, clearly illogical for a “person” thing.

This connection is clearly wrong in both directions. Articles are not persons and don’t have marital status; and persons do not have first_uses. By misapplying this sameAs linkage relationship, we have screwed things up in every which way. And the error began with misunderstanding what kinds of “things” our data is about.

Some Options

However, there are solutions. First, the sameAs assertions, at least involving these external resources, should be dropped.

Second, if linkages are still desired, a vocabulary such as UMBEL [4] could be used to make an assertion between such a concept, and these other related resources. So, even though these resources are not the same, they are closely related. The UMBEL ontology helps us to define this kind of relation between related, but non-identical, resources.

Instead of using the owl:sameAs

property, we would suggest the usage of the umbel:linksEntity, which links a skos:Concept to related named entities resources. Additionally, Freebase, which also currently asserts a sameAs relationship to the NYT resource, could use the umbel:isAbout relationship to assert that their resource “is about” a certain concept, which is the one defined by the NYT.

Alternatively, still other external vocabularies that more precisely capture the intent of the NYT publishers could be found, or the NYT editors could define their own properties specifically addressing their unique linkage interests.

Other Minor Issues

As a couple of additional, minor suggestions for the NYT dataset, we would suggest:

  • Create a foaf:Organization description of the NYT organization, then use it with dc:creator and dcterms:rightsHolder rather than using a literal, and
  • The dual URIs such as “” and “” are not wrong in themselves, but the purpose is hard to understand. Why does a single organization need to create multiple resources for the identical resource, when it comes from the same system and has the same purpose?

Re-visiting the Linkage “Rule”

There are very valuable benefits from entailment, inference and logic to be gained from linking resources. However, if the nature of the “things” being linked — or the properties that define these linkages — are incorrect, then very wrong logical implications result. Great care and understanding should be applied to linkage assertions.

In the End, the Challenge is Not Linked Data, but Connected Data

Our critical comments are not meant to be disrespectful and are not being picky. The NYT and TWC are prominent institutions for which we should expect leadership on these issues. Our criticisms (and we believe those of others) are also not an expression of a “trough of disillusionment” as some have been pointing out.

This posting is about poor practices, pure and simple. The time to correct them is now. If asked, we would be pleased to help either institution establish exemplar practices. This is not automatic, and it is not always easy. The datasets, in particular, will require much time and effort to get right. There is much documentation that needs to be transitioned and expressed in semantic Web formats.

In a broader sense, we also seem to lack a definition of best practices related to vocabularies, schema and mappings. The Berners-Lee rules are imprecise and insufficient as is. Prior best guidance documents tend to
be more how to publish and make URIs linkable, than to properly characterize, describe and connect the data.

Perhaps, in part, this is a bit of a semantics issue. The challenge is not the mechanics of linking data, but the meaning and basis for connecting that data. Connections require logic and rationality sufficient to reliably inform inference and rule-based engines. It also needs to pass the sniff test as we “follow our nose” by clicking the links exposed by the data.

It is exciting to see high-quality content such as from national governments and major publishers like the New York Times begin to be exposed as linked data. When this content finally gets embedded into usable contexts, we should see manifest uses and benefits emerge. We hope both institutions take our criticisms in that spirit.

This posting has been jointly authored by Mike Bergman and Fred Giasson and simultaneously published on both of their blogs, hoping to draw more attention to the need for better practices in publishing linked data.

[1] The NYT has been updated with improvements and they fixed multiple issues from the first release. The
problems listed herein, however, still pertain after these improvements.
[2] Tim Berners-Lee, 2006. Linked Data (Design Issues), first posted on 2006-07-27; last updated on
2009-06-18. See Berners-Lee refers to the steps above as “rules,” but he elaborates they are expectations of behavior. Most later citations refer to these as “principles.”
[3] Li Ding, Dominic DiFranzo, Sarah Magidson, Deborah L. McGuinness and Jim Hendler, 2009. Data-GovWiki: Towards Linked Government Data. See
[4] UMBEL (Upper Mapping and Binding Exchange Layer) is a lightweight ontology structure in development for relating Web content and data to a standard set of subject concepts. It purpose has resulted in its creation of an associated vocabulary geared to both class-instance and reciprocal relationships, as well as partial or likelihood relationships. See

Zitgist’s definition of Linked Data

Mike Bergman just published a really good blog post that describes Zitgist’s definition of Linked Data. Zitgist define Linked Data has:

Linked Data is a set of best practices for publishing and deploying instance and class data using the RDF data model, naming the data objects using uniform resource identifiers (URIs), and exposing the data for access via the HTTP protocol, while emphasizing data interconnections, interrelationships and context useful to both humans and machine agents.

Mike explains this definition in 15 steps. One thing he stressed, and that I want to emphasis too is: Linked Data != Linked Open Data. Linked Data is not necessarily “open” in the sense of Open Source software and the freeware movement. Linked Data is about what we defined above. Enterprises can privately exchange data with business partners and clients. Enterprises can even do linked data between divisions of the company. Linked Data can be open, but is not limited to. Linked Data can be freely published on the Web; but Linked Data can also be published over private networks for limited use.

The emergence of UMBEL and Linked Data

Since Mike and I first released UMBEL in 2007, we have not stopped working on it: we have done much research, we defined its concepts and principles, we designed and created it:

the ontology and the instantiation of its subject concepts, abstract concepts, semsets and named ontologies. We intensified our efforts in the last six months so that we nearly worked full time on this project.

We are now starting to release more documentation about the outcome of our work so far. Mike starts to release a really good series of blog posts describing the grounding of this effort. The first blog post that has been published is called A re-Introduction of UMBEL – Part 1 of 4 on foundations of UMBEL. This blog post explains the foundation concepts of UMBEL.

Later this week he will publish three other blog posts that explains what UMBEL adds to Linked Data, how named entities are integrated in this framework and finally how UMBEL relates to its older brother: Cyc and OpenCyc.

So stay tuned on Mike’s blog to read the series of four blog posts that put the basis to future releases and discussions about UMBEL and Linked Data.

Next development of UMBEL

In mean time, we continue our hard work to release the first draft of the UMBEL ontology and a first version of the instantiation of its subject concepts, its abstract concepts, its named entities and their related semsets. Also we will release a first mapping between UMBEL’s subject concepts and related external ontologies classes along with the proper grounding documentation that explains all the things evolved with these instantiations, these linkages and the UMBEL ontology itself.

Data Referencing, Data Mobility and the Semantic Web

I recently started to follow discussions evolving around the Data Portability project. It is an emerging community of people that tries to define the principles and push technologies to encourage the “portability” of data between people and systems. Other such initiative exists, such the Linking Open Data Community (that emerged from the semantic web community more than one year ago), The Open Knowledge Definition, and there are probably many others too. However DP is the one that recently got the biggest media coverage considering “support” and covering from some people and groups.

An interesting thread emerged from the mailing list that was trying to get a better definition of what “Data Portability” means.

Henry Story opened the door of the “linked data” (instead of moving data) and Kingsley nailed the two important distinction points:

  1. Data Referencing
  2. Data Mobility (moving data from distinct locations via Import and Export using agreed data formats)

What the Semantic Web means in this context?

What these two critical points mean in terms of semantic web concepts and technologies?

Defining the context

This discussion will be articulated in one context: the Web. The current discussion will take into consideration that all data is available on the Web. This means the use of Web technologies, protocols, standards and concepts. This could be extended to other networks, with other protocols and technologies, but we will focus the discussion on the Web.

Data Referencing

How data referencing is handled on the semantic web? Well, much information is available about that question on the Linked Data Wikipedia page. Basically it is about referencing data (resources) using URIs (unique resources identifiers), and these URIs should ideally be “dereferencable” on the Web. What “dereferencable on the Web” means? It means that if I have a user account on a certain web service, and that I have one URI that define that account, and that this URI is in fact a URL, so that I can get data (normally a RDF document; in this example it would be a RDF document describing that user account) by looking at this URL on the Web (in this case we say that the URI is dereferencable on the Web).

This means one wonderful thing: if I get a reference (URI) to something, this means that in the best of the cases, I can also get data describing this thing by looking on the Web for its description. So, instead of getting a HTML page describing that thing (this can be the case, but is not limited to) I can get the RDF description of that thing too (via web server content negotiation). This RDF description can be use by any web service, any
software agent, or whatever, to helps me to perform specific tasks using this data (Importing/Exporting my personal data? Merging two agendas in the same calendar? Planning my next trips? And so on).

Now that I have a way to easily reference and access any data on the Web, how that accessible data can become “mobile”?

RDF and Ontologies to makes data “mobile”

RDF is a way to describe things called “resources”. These resources can be anything: people, books, places, events, etc. There exists a mechanism that let anybody describing things according to their properties (predicates). The result of this mechanism is a graph of relationships describing a thing (a resource). This mechanism do not only describes properties of a Thing, but also describe relationship between different things. For example, a person (a resource) can be described by its physical properties, but it can also be described with its relation with other people (other resources). Think about a social graph.

What is this mechanism? RDF.

Ontologies as vocabularies standards

However, RDF can’t be used alone. In order to make this thing effective, one need to use “vocabularies”, called ontologies, to describe a resource and its properties. These ontologies can be seen as a controlled vocabulary defined by a community of experts to describe some domains of things (books, music, people, networks, calendar, etc). It is much more than a controlled vocabulary, but it is easier to understand what it is that way.

FOAF is one of these vocabularies. You can use this ontology to describe a person, and its relation with other people, in RDF. So, you will say: this resource is named Fred; Fred lives near Quebec City; and Fred knows Kingsley. And so on.

By using RDF + Ontologies, the data is easily made Mobile. By using such standards that communities, people and enterprises agree to uses; systems will be able to read, understand and manage data coming from multiple different data sources.

Ontologies are standards ensuring that all the people and systems that understand these ontologies can understand the data that is described, and then accessible. It is where data becomes movable (mobility is not only about accessibility for download, it is also about understanding the transmitted data).
Data description robustness

But you know what is the beauty with RDF? It is that if one of the system doesn’t know one ontology, or do not understand all classes and properties of an ontology used to describe a resource, it will only ignore that data and concentrate its effort to understand the thing being described with the ontologies it knows. It is like if I would speak to you, in the same conversation, in French, English, Italian and Chinese. You would only understand what I say in the languages you know, and you will act considering the things you understood of the conversation. You will only discard the things you don’t understand.


Well, it is hard to put all these things in one single blog post, but I would encourage people that are not familiar with these concepts, terminologies and technologies, and that are interested in the question, to start reading what the semantic web community wrote about these things, what are the standards supported and developed by the W3C, etc. There are so many things that can change the way people use the Web today. It is just a question of time in fact!