Archive for the 'UMBEL' Category

Schema.org: Forcing the Emergence of a New Web Paradigm

schema-org1Sometime this week I was reading a blog post that was giving some statistics related to Schema.org‘s usage on the Web. It states:

36.6 percent of Google's search results include "at least one snippet with information derived from Schema.org."

 

only about 0.3 percent of domains are using the markup code on their websites.

Someone may be surprise to see how that little number of domains produces that much snippet uses in Google searches. But this is not what interest me in this blog post. What I am interested in is that considering that 36.6% of the Google search appears to be returning structured information that uses Schema.org microdata, why is there only 0.3 percent of the domains that are using the markup?

Introduction of a new paradigm on the Web

I think that what is happening at the moment is the emergence of a new paradigm on the Web: publication of structured data. Some may say that this is happening for a long time1 and I agree with them. However, what is happening is that this structured data starts to emerge to the end users. This is not something that happened until recently (the last year or so).

What the major search engines, which participate to Schema.org, are doing is to push (to force?) this new paradigm to emerge. The thing is that to my experience the management of structured data to be published on the Web needs a different set of concepts, minding, terminology, specifications and more importantly tools.

It is true that current tools and techniques can be used to publish Schema.org markup in HTML Web pages, but to me, they are sub-optimal for the task at hands. This is probably one of the reasons why the authors of this blog posts stated:

Not surprisingly, the study also found that "larger sites" are more likely to use Schema markup. There's no definition given in the study on what makes a site big or small, but this has long been one of the concerns about Schema.org – whether small businesses/websites would have the technical chops to take advantage of the rich snippet opportunity, or if that would be left to bigger companies with more skilled webmasters and more organized online marketing efforts.

I tend to agree with that. However, this shouldn’t be the case. I think that the reason for that is that people doesn’t tend to use the proper frameworks (CMS, programming API, etc.) and data management systems that are optimized for that task. Another reason is that there is no widespread understanding and adoption of the new underlying concepts, technologies and techniques that are emerging with this new paradigm.

Coping with the evolution of Schema.org

One of the core concept introduced by this new paradigm is the Open World Assumption. This assumption basically means that we don’t know if something exists or not, if something is true or not, until it is explicitly stated. This means that it is not because we (our systems) doesn’t have some information, that this information doesn’t exists.

This is really important to understand, and this assumption has a dramatic impact on how we develop the systems that will publish this structured information on the Web. On the Web, there is no one system that has complete control over the information that may exists. Major search engines such as Google have this Open World Assumption at the core of their system. It is why they are pushing initiatives such as Schema.org, of their Knowledge Graph. Because this is how they can try to cope with the constantly evolving Web.

How does this relate to Schema.org?

Right now, there is 585 types and 807 properties in schema.org2. and there are even ways to extend the vocabulary.What that means is that this vocabulary is constantly evolving, changing, improving and increasing. If the vocabularies (ontologies) changes that often, it means that the data may should as well. However, the way most of the data management systems are currently used to publish content on the Web (mostly relational databases) can hardly cope with these kind of changes in the data, and its structure.

This is the reason why I am stating that new concepts, techniques, technologies, methods and tools needs to be used in order to be able to cope with these constant changes.

With traditional (relational) systems, every time someone would want to add new micro-data in their webpage, they would have to do an analysis of their relational data, and then to map it to different Schema.org types and properties, and then to create all the code to perform this linkage, and generate the enhanced HTML code which includes the Schema.org micro data.

Then once this is done, what happens if the vocabulary changed? If the data to publish changed? Well, all this analysis and work will need to be done again to reflect the changes in the vocabulary and the data.

However, what if a different set of concepts, techniques and tools are used to publish structured content on the Web?

What I am proposing here is a system, a framework, that manipulate entities as its core: things that are described with attributes and values. Then, these entities descriptions are carried around within your code. The logic required to handle the use case I outlined above is embedded into the ontologies, the system, the framework, the API… The only thing a developer should need to do is to care about its code and the functionalities of the system.

In such an information system, all the entities are described using internal and external ontologies. All these ontologies concepts (types and properties) need to be linked to the ones of Schema.org (or any other sources of information). Every time something change, the changes should be reflected, accounted for, into these ontologies, not into the code, the templates, or whatever. It need be transparent to the developers.

In the next section, I will show you how this can be done using the Open Semantic Framework (OSF). However keep in mind that what I am discussing in this blog post is much more general than that, and can be implemented using different tools. I used the Schema.org example, but the same minding can be applied to lot of different use cases.

Care about the code, not the data

To make my point, I will demonstrate how publishing Schema.org microdata in a Web portal can be done using a new set of techniques, concepts and tools.

The initial goal is to split the concerns: ontologists should care about the ontologies and their linkage, and developers should care about the code and the functionalities of the system. The best way to make sure that a developer cares about the code, is to abstract this complexity of the Open World Assumption behind a programming API. In this example, we will demonstrate that using the OSF PHP API.

Such a API should use the resources provided by the framework to determine if the properties and types that are used to describe a given entity can be expressed/serialized in Schema.org microdata. All this mechanic should be hidden the the developer, and should be driven by the ontologies.

This is the crux of the matter. We want to manage this complexity where it is much easier to manage: at the level of the ontologies 3. These Ontologies Driven Applications (in this case, the Ontologies Driven Frameworks or Systems) will abstract this complexity to the developers.

Let’s take this PHP [nearly pseudo] code as an example. It uses the OSF PHP API to retrieve information about an entity from a OSF Web Services instance by querying the CRUD: Read web service endpoint. Then it uses the Subject class to determine if the property(ies) and type(s) of the entity can be serialized in Schema.org microdata. In this example, the Subject class is using non-existing function calls. The goal is to show how such a basic programming API can abstract all the complexity of an evolving Schema.org vocabulary.

Let’s take that pseudo PHP code:

<?php

  // Specify the unique identifier (URI) of the entity
  $entityIdentifier = 'http://foo.com/datasets/movies/Avatar';

  // Use the CRUD: Read web service endpoint to get the
  // description of the record from OSF
  $crudRead = CrudRead();

  $crudRead->uri($entityIdentifier)
           ->send();

  $resultset = $crudRead->getResultset();

  // Get the entity (instance of the class Subject) from the resultset
  $entity = $resultset->getSubject($entityIdentifier);

  // Get the first type of the entity
  $type = current($entity->getTypes());

  // Get the name of the entity
  $name = $entity->getPrefLabel();

  // Get the genre of the entity
  $genre = $entity->getDataAttribute('http://purl.org/ontology/movies#genre');

  // Get the director of the entity
  $director = $entity->getDataAttribute('http://purl.org/ontology/movies#director');
 
  // Then run this template to generate the HTML which will embed,
  // or not, some Schema.org microdata
?>

<div <? print $type->serializeMicroformat(); ?>>
  <h1 <? print $name->serializeMicroformat(); ?>><? $name->getValue() ?></h1>
  <span>Director: <? print $director->serializeMicroformat(); ?>><? $director->getValue() ?></span> (born August 16, 1954)</span>
  <span <? print $genre->serializeMicroformat(); ?>><? $genre->getValue() ?></span>
</div>

What this template does, is to generate the HTML code, enhanced with Schema.org microdata. The serializeMicroformat($format) function does:

  1. Get the URI reference of the type/property
  2. Query the ontology to check if the type/propertyis linked to a Schema.org concept
    1. If it is not, then an empty string is returned
    2. If it is, then it serializes the micro data to add to the HTML and return it

It is as simple as that. All the “complexity”, all the work, is done at the level of the reference structure (the ontology). The result would be something like:

<div itemscope itemtype ="http://schema.org/Movie">
  <h1 itemprop="name">Avatar</h1>
  <span>Director: <span itemprop="director">James Cameron</span> (born August 16, 1954)</span>
  <span itemprop="genre">Science fiction</span>
</div>

Here is another example that does exactly the same, but that produces RDFa Lite markup:

<div vocab="http://schema.org/" <? print $type->serializeRDFaLite(); ?>>
  <h1 <? print $name->serializeRDFaLite(); ?>><? $name->getValue() ?></h1>
  <span>Director: <? print $director->serializeRDFaLite(); ?>><? $director->getValue() ?></span> (born August 16, 1954)</span>
  <span <? print $genre->serializeRDFaLite(); ?>><? $genre->getValue() ?></span>
</div>

This would produces that HTML code with RDFa Lite embedded:

<div itemscope typeof="Movie">
  <h1 property="name">Avatar</h1>
  <span>Director: <span property="director">James Cameron</span> (born August 16, 1954)</span>
  <span property="genre">Science fiction</span>
</div>

What happens there is that the API uses the Ontology (which is linked to Schema.org concepts) to determine if the entity can be rendered in Schema.org microdata. What it does is to check if the type used to describe the entity we retrieved from OSF is linked to a Schema.org concept. If it is, then the API get that reference to Schema.org, and properly serialize the Schema.org microdata snippet. The only thing the developer need to do, is to properly use the API functions. Nothing else need to be determined by him, the system will take care of the rest.

The beauty of this is that you don’t have to worry about any kind of mapping between the vocabulary (ontology) you use for describing your entities, and the Schema.org types and properties. The only thing you have to do is to re-use such a mapping with your ontologies. The PHP API will take care to produce the proper Schema.org microdata, only if the linkage exists between the content you are publishing and the schema.org vocabulary. The only thing you have to worry about is to use the API when you create your code to publish your content on the Web.

Is this vision possible? The platform that manipulates entities that way is already existing: it is the Open Semantic Framework. Everything you manipulate are such entities descriptions. Then you have the PHP API available to query the web services to get the descriptions of your entities. The only missing piece is the glue that map your entities’ types and properties to the schema.org vocabulary.

The good news is that this glue already exists, but will greatly improve in the coming months. We are currently working on a completely new version of UMBEL (Upper Mapping and Binding Exchange Layer) that will include, amongst other things, a fully updated Schema.org mapping between UMBEL and Schema.org. Note that UMBEL and its linkages is meant to be that reference structure to be used in the code I outlined above.

Conclusion

A new Web paradigm is being pushed, is being forced, by the major search engines. However, the issue that is emerging is that the current systems that are used by 98%4 of the people are not geared toward that kind of data management and these new development concepts and techniques. If this paradigm shift continues, then it will force developers to adopt completely new platforms, which rely on new technologies, concepts and specifications such as the Open Semantic Framework. The way people of that field are working will change quite significantly.

This blog post focused on the Web, and companies that are publishing content on the Web. However, to my experience with multiple different kind of organizations (municipalities, governments, Fortune 500 companies, etc.) are now experiencing the influences of the Open World Assumption in what they thought to be a Closed World. The data they are using to give different kind of services, changes, evolves. New acquisitions and new projects challenge their Close World Assumption. These changes have dramatic impacts on their infrastructure, their data and their ability to evolve and adapt to the constantly (fast) changing World .

The leap is big since the minding is quite different, investments will be required in terms of software and data migration and training of the staff. But in my view, it is essential, things are changing and organizations will need to adapt.

To conclude, how many time in a day can I read blog posts, tweets and forums where naysayers state that the Semantic Web never worked, never existed and it is doomed to be used by academicians only? Multiple… but who cares?

This is the Semantic Web. This is Linked Data. And it is changing the way people works.

  1. I am professionally working in the Semantic Web field for more than seven years now
  2. according to their RDFa schema
  3. Ontologies Driven Applications is a concept Structured Dynamics introduced a few years ago. Keep in mind that some of the systems referenced in this blog post are not existing anymore, and have been superseded by the Open Semantic Framework (OSF)
  4. this is a completely random number coming out of my intuition

Volkswagen’s Use of structWSF in their Semantic Web Platform

TribalDDB London, Volkswagen UK‘s partner, mentioned earlier this week that Volkswagen are using some parts of the Open Semantic Framework to develop the next generation of their online platform.

This story has been published by Jennifer Zaino’s in her article: Volkswagen: Das Auto Company is Das Semantic Web Company!

I can now talk about this project that uses some pieces of the framework that we have been developing for more than 3 years now.

The Objective

Volkswagen’s main objective behind the development of the next version of their Web platform started by improving their online search engine, but as William Greenly mentioned, it quickly became a strategic decision:

"So the objectives were about site search and improving it, but in the long-run it was always the idea to contextualize content, to facet content, to promote it in different contexts."

The objective is to create a platform that gives them the flexibility to leverage all the data assets they own. This flexibility will help them to leverage the data assests they have to improve not only their search engine, but also to contextualize it in different parts of their websites, partner’s websites or to promote, and publish that same information on different communication channels or devices.

The Flexibility


What is a flexible platform in that context? A flexible platform is one that can integrate any kind of information sources. Such information sources in the context of Volkswagen can be a series of relational dataset schemas spread around the World, Excel spreadsheets, CSV files, old plain text technical documents about past model of cars, semi-structured documents such as webpages, etc.

A flexible platform is also one that minimally impact (if at all) the data consumers if the data structure changes in the system. This is really important since the World we live in constantly changes. This means that things constantly change and we have to reflect these changes in the data we own and maintain. This is why this point is so important, because we want to minimize the impact of the data structure changes that will happen all the time.

Having the flexibility to constantly adapt your data, while minimally impacting the data consumers of the system, enables you to make quick decision to adapt your strategy in a highly competitive World. This flexibility gives you a clear business advantage.

A flexible platform is also one that let you publish your data the way you want, in the format that is needed. Such a flexible platform has to give you access to an interface that give you access to all the functionalities of the platform without having to care about what happens under the hood.

A flexible system is one that can communicate your information on any kind of communication channels, and to any devices that have access to the Web.

Under the Hood

That next generation platform that Volkswagen is currently developing is partly based on a few of the main pieces of the Open Semantic Framework. These pieces help them to reach their goal by helping them giving the flexibility their platform needs.

The first step they gone thru was to create their Volkswagen Vehicles Ontology that is used to describe all the entities they want to index into their platform. The Web Ontology Language (OWL), along with the Resource Description Framework (RDF) is what gives them the complete flexibility on how they can integrate all the pieces of information they want, in a canonical format.

Then they choose to use structWSF (the structured data web services framework). This piece gives them the flexibility to get a series of web interfaces (web service endpoints) to create, update, manage and query their data. This web service layer enables them to do anything they want with their data, from anywhere on the Web. This is possible because all the functionalities of the framework are exposed as web service endpoints. StructWSF also gives them the possibility to communicate their data in multiple different formats. This makes it the perfect flexible system to feed their information in different contexts, in different communication channels or on different devices.

At Volkswagen, structWSF is used to populate, and keep in sync, their Solr and Triple Store instances. It gives them the time to care about the more important aspects of their platform, and to care about how the data should be synced between the various specialized data management systems.

By using structWSF to manage their data, they are able to reach some objectives to make their platform as flexible as possible:

  • To be able to minimize the impact of data changes to the data consumers
    • Because structWSF uses OWL & RDF to describe all the data it index
  • To be able to manipulate their data from anywhere
    • Because all the functionalities of structWSF are exposed as web service endpoints
  • To be able to communicate the information in different contexts, communication channels and devices
    • Because structWSF has, in its core, is designed to transform all the data it indexes in any other kind of format

The Next Step

One of their longer term goal and objective is to analyze their unstructured and semi-structured textual documents to extract some structure out of them, and to index them into their semantic platform. To do this, they are looking at using Scones, which is the structWSF semantic tagger web service endpoint. Scones will use some subject reference structures such as UMBEL to semantically tag the textual document. Once the document as been processed by Scones, and indexed in structWSF, it can now be re-published in different contexts based on the reference concepts that have been tagged to it. This gives them the flexibility to leverage non-structured sources of data and to re-purpose it in different ways by publishing it in different context and in different systems.

This second system will enable them to leverage the investment they made in the past, by writing all these textual documents, and to re-purpose, and re-contextualizing, them in all kind of different contexts.

Conclusion

I think that TribalDDB and Volkswagen make the good decision for their future. Taking the business decision to develop and maintain a completely new kind of information system is not an easy decision to take. I am not saying that they made the good choice to use our pieces of the stack. The decision goes far beyond this. Such a Semantic Platform challenges everything in an organization: the people that takes the decisions, the people that create and manage the data, the people that develop the system, the people that maintain that system, the consumers of the system, the customers, the partners, etc. This is a big decision; whatever the technology stack you plan to use. I congratulate them for the decision they took.

I strongly believe that this was the right decision to take considering the future opportunities they are creating to themselves.

 

 

UMBEL Blooms with New Colors

We are happy to announce the new, intermediary, UMBEL version 0.80. This is a major upgrade of the UMBEL ontology: both its vocabulary and its reference structure have been greatly enhanced, an upper structure called the SuperTypes has been added and everything got updated to OWL 2. You can read more about the overall changes on Mike’s blog post.

In this blog post I will focus on two topics: using some existing tools and frameworks to view and manage the reference concepts structure, and how one can use and leverage the coherency of the reference structure.

Navigating and Updating the Reference Structure

One thing that was lacking with the previous version of UMBEL was to have access to a user interface tool that would let you navigate and update the reference structure as you want. Because of the way the conceptual structure was created, it was hard for tools such as Protégé to load it because of all the individuals that were created (such as the SemSet individuals, etc.).

As stated in Mike’s blog post, we made significant changes to the UMBEL vocabulary, and how we instantiate the reference structure. Along with the OWL 2 upgrade, we made sure that the Protégé version 4.1 and the latest version of the OWLAPI could easily load both the UMBEL vocabulary and the reference structure.

Reasoning

One of the major additions to UMBEL v080 is the SuperTypes upper structure, an organizational layer above the UMBEL reference structure. We created these SuperTypes because we found that we could effectively cluster most UMBEL reference concepts into a small set of mostly distinct upper concepts (33 in fact, 29 of which are designed as disjoint).

This new SuperTypes structure helps us mine external sources of information by leveraging related concepts in the reference structure. Moreover, SuperTypes also help us perform easier, simpler, better and faster reasoning over the entire 21 K reference concepts structure.

Thus, SuperTypes provide a new tool to help determine if the UMBEL reference structure is consistent and coherent within itself. This is important, of course, to ensure that linkages between UMBEL and external ontologies is consistent and coherent as well.

So far, the entire reference concepts structure has been tested for its coherency according to the restrictions we defined at the level of the SuperTypes upper structure. Using different reasoners such as Pellet, Fact++ and Hermit (available by default with Protégé 4.1), we made sure that all the statements made between all the RefConcept classes and individuals, and all the statements made between these and the SuperTypes upper structure, are consistent within themselves. This method enabled us to find and fix some early assignment issues.

This new upper structure, along with its now consistent reference structure, helps provide confidence that statements based on UMBEL reference concepts are also consistent. And, all of this is made more testable by virtue of being able to use the OWL API and Protégé with its embedded reasoners.


How is Coherency Tested?

This is the core question. In fact, the more informative answer to this question will be part of a forthcoming blog post. But let’s start here.

The current way to check if the structure is coherent is by making sure that we don’t have an individual that belongs to two different SuperTypes that are stated to be disjoint. What we did with the SuperType upper structure is really simple: we categorized each and every RefConcept (using rdfs:subClassOf) under a SuperType. Most of the SuperTypes are disjoint: this means that if an individual is of rdf:type for two SuperTypes that are stated to be disjoint, then you will end-up with an incoherent structure because you are making a statement that is not permitted by the reference structure.

So, the way to check if your statements are coherent according to this structure, is to make your statements (right now, in terms of individual instantiation), and then to check using a reasoner such as Pellet. There is now a general testing structure to see if any ontology is coherent with respect to the UMBEL reference structure.

In the next blog post in this series, I will tell you how to use exactly the same method for coherency testing, but now for testing if linkages between external ontologies and the UMBEL reference structure are consistent. In that case, you will make the class-to-class assertions you want, and then you will instantiate individuals of these classes, then run the reasoner. Then, the reasoner will tell you if your ontology is still consistent according to the structure and the new statements you created.

Next Step

In parallel with these tutorials, we are also working hard on the next version of UMBEL. As outlined in the Next Changes section of the new UMBEL website, the next step is to release UMBEL v1.0, with a set of new features, before Christmas.

A New Home for UMBEL Web Services

umbel_wsEight months ago we announced the dissolution of Zitgist LLC. This event led to the creation of a sandbox to keep alive all the online assets of the company. Since this sandbox server was not owned by Structured Dynamics, it was becoming hard for us to update UMBEL and its online services. It is why we took the time to move the services back on to our new servers.

A New Home

sd_logo_260Structured Dynamics LLC now hosts a new version for the UMBEL Web services. From the main menu at the SD Web site you can access these services under the “umbel ws” menu option (you can also bookmark the Web services site at umbel.structureddynamics.com or ws.umbel.org.)

This move of UMBEL’s Web services to a new home will make the future upgrade of UMBEL easier, and this will make the maintenance of the Web services endpoints easier as well. With this move, I am pleased to announce the release of five initial Web services and one visualization tool:

Lookup Web Services:

Inference Engine Web Services:

SPARQL endpoint Web Service:

Visual Tool:

Note that the visual tool is using Moritz Stefaner’s Relation Browser.


Ping the Semantic Web

ptswlogo160.gifAdditionally, the Ping the Semantic Web RDF pinging service is now the property of OpenLink Software Inc. OpenLink is now hosting, maintaining and developing the service.

New release of UMBEL: v072

umbel_medium.pngI am pleased to announce that we resumed our work with UMBEL. We just released the version v0.72, which is based on the OpenCyc version 2009-01-31. This new version is intermediary and has been created mostly to check the evolution of OpenCyc vis-à-vis UMBEL. Within the next month or so, we will release a new version (v.080), which will introduce a major new concept that should help systems and users manipulating the entire UMBEL Subject Concepts structure.

For them who want to know what changed between versions v071 and v072, here is CVS file that list all the changes between the versions. There are four columns: (1) source node, (2) attribute, (3) target node and (4) version number. This file list all triples that are present in a version, but not in the other. So, you have all changes (nodes & arcs) between the two versions. Mostly all the changes come from internal changes to OpenCyc. We did fix a couple of things such as removing cycles in the graph, etc. But 99% of the changes come from changes within OpenCyc.

Finally note that the web services endpoints will be updated with this new version of UMBEL subject concepts in the coming week along with the dereferencing of their URIs. Stay tuned!




This blog is a regularly updated collection of my thoughts, tips, tricks and ideas about my semantic Web researches and related software development.


RSS Twitter LinkedIN


Follow

Get every new post on this blog delivered to your Inbox.

Join 62 other followers:

Or subscribe to the RSS feed by clicking on the counter:




RSS Twitter LinkedIN