Archive for the 'UMBEL' Category

New UMBEL Concept Noun Tagger Web Service & Other Improvements

Last week, we released the UMBEL Concept Plain Tagger web service endpoint. Today we are releasing the UMBEL Concept Noun Tagger. umbel_ws

This noun tagger uses UMBEL reference concepts to tag an input text, and is based on the plain tagger, except as noted below.

The noun tagger uses the plain labels of the reference concepts as matches against the nouns of the input text. With this tagger, no manipulations are performed on the reference concept labels nor on the input text except if you specify the usage of the stemmer. Also, there is NO disambiguation performed by the tagger if multiple concepts are tagged for a given keyword.

Intended Users

This tool is intended for those who want to focus on UMBEL and do not care about more complicated matches. The output of the tagger can be used as-is, but it is intended to be the input to more sophisticated reference concept matching and disambiguation methods. Expect additional tagging methods to follow.

Stemming Option

This web service endpoint does have a stemming option. If the option is specified, then the input text will be stemmed and the matches will be made against an index where all the preferred and alternative labels have been stemmed as well. Then once the matches occurs, the tagger will recompose the text such that unstemmed versions of the input text and the tagged reference concepts are presented to the user.

Depending on the use case. users may prefer turning on or off the stemming option on this web service endpoint.

The Web Service Endpoint

The web service endpoint is freely available. It can return its resultset in JSON, Clojure code or EDN (Extensible Data Notation).

This endpoint will return a list of matches on the preferred and alternative labels of the UMBEL reference concepts that match the noun tokens of an input text. It will also return the number of matches and the position of the tokens that match the concepts.

The Online Tool

We also provide an online tagging tool that people can use to experience interacting with the web service.

The results are presented in two sections depending on whether the preferred or alternative label(s) were matched. Multiple matches, either by concept or label type, are coded by color. Source words with matches and multiple source occurrences are ranked first; thereafter, all source words are presented alphabetically.

The tagged concepts can be clicked to have access to their full description.

umbel_tagger_noun

Other UMBEL Website Improvements

We also did some more improvements to the UMBEL website.

Search Autocompletion Mode

First, we created a new autocomplete option on the UMBEL Search web service endpoint. Often people know the concept they want to look at, but they don’t want to go to a search results page to select that concept. What they want is to get concept suggestions instantly based on the letters they are typing in a search box.

Such a feature requires a special kind of search which we call an “autocompletion search”. We added that special mode to the existing UMBEL search web service endpoint. Such a search query takes about 30ms to process. Most of that time is due to the latency of the network since the actual search function takes about 0.5 millisecond the complete.

To use that new mode, you only have to append /autocomplete to the base search web service endpoint URL.

Search Autocompletion Widget

Now that we have this new autocomplete mode for the Search endpoint, we also leveraged it to add autocompletion behavior on the top navigation search box on the UMBEL website.

Now, when you start typing characters in the top search box, you will get a list of possible reference concept matches based on the preferred labels of the concepts. If you select one of them, you will be redirected to their description page.

concept_autocomplete

Tagged Concepts Within Concept Descriptions

Finally, we improved the quality of the concept description reading experience by linking concepts that were mentioned in the descriptions to their respective concept pages. You will now see hyperlinks in the concept descriptions that link to other concepts.

linked_concepts

New UMBEL Web Services

umbel_logo_260_160I am happy to announce the immediate availability of a brand new UMBEL website and a new set of eight UMBEL web services.

UMBEL (Upper Mapping and Binding Exchange Layer) is a general reference structure of 28,000 concepts, which provides a scaffolding to link and interoperate other datasets and domain vocabularies. This project is now six years old.

I would recommend that your read Mike’s blog post about this new release if you want more background information about UMBEL and to have a better understanding of how it can help you integrate, manage, publish and reason over your data.

In this blog post, I will focus on the technical aspects of this new web site and the new set of web service endpoints.

Toward a Better Web Experience

The Web is changing fast. Techniques for developing web sites are constantly and quickly evolving. People uses all kind of devices with different sizes of screens to consume Web content. Websites are more and more responsive by their clever architecture design, and their simpler user interfaces. This is the kind of website we wanted to create for the new UMBEL website.

Clojure Web Service Endpoints at the Core

The core of the new UMBEL website are the new web services. As soon as you are performing a search, or looking at the description of a reference concept or a super type, your browser is making a series of asynchronous queries to the UMBEL web service endpoints.

The average query time is about 60 milliseconds for any of the web service query. This means that a web page is fully loaded within 300 to 500 milliseconds where most of the time is spent downloading the web files (the JavaScript, CSS, HTML and image files) and not querying the web service endpoints. Bearing in mind that the website currently run on a small server with a single core and 1.8G of RAM, these are really good performance figures.

We are initially releasing 8 web service endpoints (with more to follow). They have been created to help developers quickly start using the reference structure without having to download and deploy the entire structure on their own infrastructure. The 8 web services are:

  1. Search concept
  2. Get concept
  3. Get super type
  4. Get narrower concepts
  5. Get broader concepts
  6. Get sub-classes
  7. Get super-classes
  8. Degree

All these web services are calculating the results at runtime. For example, if you want to find the degree between two reference concepts, then the degree is calculated at runtime. It is the same for all the web services that does inferencing like the Get narrower concepts or Get broader concepts web service endpoints.

What we did to get these excellent performance measures is to use Clojure as the programming language and framework to develop the new web service endpoints. Then we define the UMBEL structure as Clojure code.

Each web service endpoint is comprised of simple pure functions that perform calculations on the UMBEL graph of 28 000 nodes. None of the functions are more than 30 lines of code (per endpoint) which greatly simplifies their creation, debugging, maintenance and optimization. Then we use contributed libraries such as Ring and Compojure to manage the creation of the web service endpoints, and Clucy/Lucene for the search engine.

The web services can easily be scaled horizontally since everything is self contained in a single WAR file that can be deployed on new servers in a few clicks. Then the new servers can participate into a cluster of UMBEL web service servers.

Another advantage of using this technology stack for creating the UMBEL web service endpoints is that UMBEL is not just a reference structure nor a set of web service endpoints. It is also a programming API that could be used in any Clojure or Java applications. The UMBEL reference structure, along with all the functions that uses it will be available as a JAR file. That way, UMBEL become portable. It could be used as a library in any JVM application without requiring it to send queries to external web services, or to create complex stacks to deploy and use the UMBEL reference structure in different applications.

Bootstrap as the HTML/CSS/JavaScript Framework

The previous UMBEL website was using Drupal 6. For the ones that were using it, it was sometimes clunky, less responsive and more heavy weight. The problem is that we were not requiring a full CMS system for developing a simple UMBEL website that is only informational.

We wanted a responsive experience for the UMBEL user. We wanted to have the fastest experience possible and we wanted to have this experience on any kind of device: desktop computers, tables, mobile phones, etc.

This is why we choose to develop the new UMBEL website using Twitter’s Bootstrap HTML, CSS and JavaScript framework. This is a framework that anybody can use to quickly create simple, beautiful and modern websites. It uses a grid system to create responsive user interfaces on any kind of device (screen size). That way, UMBEL users have the same kind of experience whether they are using a normal desktop screen, a tablet of their mobile phone.

This choice enabled us to create a simple, modern, nice looking and responsive website for UMBEL.

Introduction to the UMBEL Web Services

Now let’s take the time to introduce each of the UMBEL web service endpoint. The first thing to know is that the UMBEL web service endpoints are free to use, have no usage limits and there is no throttling.

Search Concept Web Service

The Search Web service is used to find UMBEL reference concepts that match a search string. This is the primary tool for finding available concepts in the reference structure. It supports the Lucene query syntax and search queries can be constrained on different fields like the preferred label, alternative labels, descriptions and URI.

Get Concept Web Service

The Get Concept Web service is used to get the full description of a UMBEL Reference Concept. By querying this Web service endpoint, you will get the preferred label, all the alternative labels (namely, the items in the semset), the sub/super classes of the concept, the broader/narrower concepts and the description of that concept.

This is the Web service endpoint that should be used to get the direct relationships with any other reference concept.

Reference concepts descriptions are available as N-Triples, RDF+XML, structJSON or Clojure code.

Get Super Type Web Service

The Get Super Type Web service is used to get the full description of a UMBEL Super Type. By querying this Web service endpoint, you will get the preferred label, all of the alternative labels, the description, and the disjoint super types of a target super type.

Get Narrower Concept Web Service

The Get Narrower Concept Web service is used to get the list of all the narrower concepts of a given reference concept. This processing is done by inference, which means that if A -> B -> C are narrower concepts, then the narrower concepts of A are both B and C, which is what will be returned by the endpoint.

Get Broader Concept Web Service

The Get Broader Concept Web service is used to get the list of all the broader concepts for a given reference concept. This processing is done by inference, which means that if A -> B -> C are broader concepts, then the broader concepts of C are both A and B, which is thus what will be returned by the endpoint.

The broader reference concepts do not include the super type as their top concept (use the Get Super-Class-Of web service endpoint for that).

Get Sub Classes Web Service

The Get Sub Classes Web service is used to get the list of all the sub classes of a given reference concept. This processing is done by inference, which means that if A -> B -> C are sub classes, then the sub classes of A are both B and C, which is what will be returned by the endpoint.

Get Super Classes Web Service

The Get Super Classes Web service is used to get the list of all the super classes of a given reference concept. This processing is done by inference, which means that if A -> B -> C are super classes, then the super classes of C are both A and B, which is what will be returned by the endpoint.

The super classes do include the super types as their top concept (use the Get Super-Class-Of web service endpoint for that).

Degree Web Service

The Degree Web service is used to get the degree (measure of distance) between two UMBEL reference concepts by following the path of a transitive property.

Conclusion

This new website along with these new web service endpoints are still using the UMBEL reference structure version 1.05. However, in the coming month or two, a new version of the reference structure should be released. The structure itself won’t change much except the introduction of a few new reference concepts. But new mechanisms (mostly related to attributes) will be introduced. It will also come with a brand new mapping with external data schemas and data sources such as Schema.org, Wikipedia, etc.

On my side, I will start writing more about UMBEL. New web service endpoints will be released over time. The API available to use, manage and leverage the structure will constantly expand.

On the other side, I will write about how the UMBEL reference structure can be used, how it can be leveraged to integrate data sources, to expend search queries, etc.

Schema.org: Forcing the Emergence of a New Web Paradigm

schema-org1Sometime this week I was reading a blog post that was giving some statistics related to Schema.org‘s usage on the Web. It states:

36.6 percent of Google's search results include "at least one snippet with information derived from Schema.org."

 

only about 0.3 percent of domains are using the markup code on their websites.

Someone may be surprise to see how that little number of domains produces that much snippet uses in Google searches. But this is not what interest me in this blog post. What I am interested in is that considering that 36.6% of the Google search appears to be returning structured information that uses Schema.org microdata, why is there only 0.3 percent of the domains that are using the markup?

Introduction of a new paradigm on the Web

I think that what is happening at the moment is the emergence of a new paradigm on the Web: publication of structured data. Some may say that this is happening for a long time1 and I agree with them. However, what is happening is that this structured data starts to emerge to the end users. This is not something that happened until recently (the last year or so).

What the major search engines, which participate to Schema.org, are doing is to push (to force?) this new paradigm to emerge. The thing is that to my experience the management of structured data to be published on the Web needs a different set of concepts, minding, terminology, specifications and more importantly tools.

It is true that current tools and techniques can be used to publish Schema.org markup in HTML Web pages, but to me, they are sub-optimal for the task at hands. This is probably one of the reasons why the authors of this blog posts stated:

Not surprisingly, the study also found that "larger sites" are more likely to use Schema markup. There's no definition given in the study on what makes a site big or small, but this has long been one of the concerns about Schema.org – whether small businesses/websites would have the technical chops to take advantage of the rich snippet opportunity, or if that would be left to bigger companies with more skilled webmasters and more organized online marketing efforts.

I tend to agree with that. However, this shouldn’t be the case. I think that the reason for that is that people doesn’t tend to use the proper frameworks (CMS, programming API, etc.) and data management systems that are optimized for that task. Another reason is that there is no widespread understanding and adoption of the new underlying concepts, technologies and techniques that are emerging with this new paradigm.

Coping with the evolution of Schema.org

One of the core concept introduced by this new paradigm is the Open World Assumption. This assumption basically means that we don’t know if something exists or not, if something is true or not, until it is explicitly stated. This means that it is not because we (our systems) doesn’t have some information, that this information doesn’t exists.

This is really important to understand, and this assumption has a dramatic impact on how we develop the systems that will publish this structured information on the Web. On the Web, there is no one system that has complete control over the information that may exists. Major search engines such as Google have this Open World Assumption at the core of their system. It is why they are pushing initiatives such as Schema.org, of their Knowledge Graph. Because this is how they can try to cope with the constantly evolving Web.

How does this relate to Schema.org?

Right now, there is 585 types and 807 properties in schema.org2. and there are even ways to extend the vocabulary.What that means is that this vocabulary is constantly evolving, changing, improving and increasing. If the vocabularies (ontologies) changes that often, it means that the data may should as well. However, the way most of the data management systems are currently used to publish content on the Web (mostly relational databases) can hardly cope with these kind of changes in the data, and its structure.

This is the reason why I am stating that new concepts, techniques, technologies, methods and tools needs to be used in order to be able to cope with these constant changes.

With traditional (relational) systems, every time someone would want to add new micro-data in their webpage, they would have to do an analysis of their relational data, and then to map it to different Schema.org types and properties, and then to create all the code to perform this linkage, and generate the enhanced HTML code which includes the Schema.org micro data.

Then once this is done, what happens if the vocabulary changed? If the data to publish changed? Well, all this analysis and work will need to be done again to reflect the changes in the vocabulary and the data.

However, what if a different set of concepts, techniques and tools are used to publish structured content on the Web?

What I am proposing here is a system, a framework, that manipulate entities as its core: things that are described with attributes and values. Then, these entities descriptions are carried around within your code. The logic required to handle the use case I outlined above is embedded into the ontologies, the system, the framework, the API… The only thing a developer should need to do is to care about its code and the functionalities of the system.

In such an information system, all the entities are described using internal and external ontologies. All these ontologies concepts (types and properties) need to be linked to the ones of Schema.org (or any other sources of information). Every time something change, the changes should be reflected, accounted for, into these ontologies, not into the code, the templates, or whatever. It need be transparent to the developers.

In the next section, I will show you how this can be done using the Open Semantic Framework (OSF). However keep in mind that what I am discussing in this blog post is much more general than that, and can be implemented using different tools. I used the Schema.org example, but the same minding can be applied to lot of different use cases.

Care about the code, not the data

To make my point, I will demonstrate how publishing Schema.org microdata in a Web portal can be done using a new set of techniques, concepts and tools.

The initial goal is to split the concerns: ontologists should care about the ontologies and their linkage, and developers should care about the code and the functionalities of the system. The best way to make sure that a developer cares about the code, is to abstract this complexity of the Open World Assumption behind a programming API. In this example, we will demonstrate that using the OSF PHP API.

Such a API should use the resources provided by the framework to determine if the properties and types that are used to describe a given entity can be expressed/serialized in Schema.org microdata. All this mechanic should be hidden the the developer, and should be driven by the ontologies.

This is the crux of the matter. We want to manage this complexity where it is much easier to manage: at the level of the ontologies 3. These Ontologies Driven Applications (in this case, the Ontologies Driven Frameworks or Systems) will abstract this complexity to the developers.

Let’s take this PHP [nearly pseudo] code as an example. It uses the OSF PHP API to retrieve information about an entity from a OSF Web Services instance by querying the CRUD: Read web service endpoint. Then it uses the Subject class to determine if the property(ies) and type(s) of the entity can be serialized in Schema.org microdata. In this example, the Subject class is using non-existing function calls. The goal is to show how such a basic programming API can abstract all the complexity of an evolving Schema.org vocabulary.

Let’s take that pseudo PHP code:

<?php

  // Specify the unique identifier (URI) of the entity
  $entityIdentifier = 'http://foo.com/datasets/movies/Avatar';

  // Use the CRUD: Read web service endpoint to get the
  // description of the record from OSF
  $crudRead = CrudRead();

  $crudRead->uri($entityIdentifier)
           ->send();

  $resultset = $crudRead->getResultset();

  // Get the entity (instance of the class Subject) from the resultset
  $entity = $resultset->getSubject($entityIdentifier);

  // Get the first type of the entity
  $type = current($entity->getTypes());

  // Get the name of the entity
  $name = $entity->getPrefLabel();

  // Get the genre of the entity
  $genre = $entity->getDataAttribute('http://purl.org/ontology/movies#genre');

  // Get the director of the entity
  $director = $entity->getDataAttribute('http://purl.org/ontology/movies#director');
 
  // Then run this template to generate the HTML which will embed,
  // or not, some Schema.org microdata
?>

<div <? print $type->serializeMicroformat(); ?>>
  <h1 <? print $name->serializeMicroformat(); ?>><? $name->getValue() ?></h1>
  <span>Director: <? print $director->serializeMicroformat(); ?>><? $director->getValue() ?></span> (born August 16, 1954)</span>
  <span <? print $genre->serializeMicroformat(); ?>><? $genre->getValue() ?></span>
</div>

What this template does, is to generate the HTML code, enhanced with Schema.org microdata. The serializeMicroformat($format) function does:

  1. Get the URI reference of the type/property
  2. Query the ontology to check if the type/propertyis linked to a Schema.org concept
    1. If it is not, then an empty string is returned
    2. If it is, then it serializes the micro data to add to the HTML and return it

It is as simple as that. All the “complexity”, all the work, is done at the level of the reference structure (the ontology). The result would be something like:

<div itemscope itemtype ="http://schema.org/Movie">
  <h1 itemprop="name">Avatar</h1>
  <span>Director: <span itemprop="director">James Cameron</span> (born August 16, 1954)</span>
  <span itemprop="genre">Science fiction</span>
</div>

Here is another example that does exactly the same, but that produces RDFa Lite markup:

<div vocab="http://schema.org/" <? print $type->serializeRDFaLite(); ?>>
  <h1 <? print $name->serializeRDFaLite(); ?>><? $name->getValue() ?></h1>
  <span>Director: <? print $director->serializeRDFaLite(); ?>><? $director->getValue() ?></span> (born August 16, 1954)</span>
  <span <? print $genre->serializeRDFaLite(); ?>><? $genre->getValue() ?></span>
</div>

This would produces that HTML code with RDFa Lite embedded:

<div itemscope typeof="Movie">
  <h1 property="name">Avatar</h1>
  <span>Director: <span property="director">James Cameron</span> (born August 16, 1954)</span>
  <span property="genre">Science fiction</span>
</div>

What happens there is that the API uses the Ontology (which is linked to Schema.org concepts) to determine if the entity can be rendered in Schema.org microdata. What it does is to check if the type used to describe the entity we retrieved from OSF is linked to a Schema.org concept. If it is, then the API get that reference to Schema.org, and properly serialize the Schema.org microdata snippet. The only thing the developer need to do, is to properly use the API functions. Nothing else need to be determined by him, the system will take care of the rest.

The beauty of this is that you don’t have to worry about any kind of mapping between the vocabulary (ontology) you use for describing your entities, and the Schema.org types and properties. The only thing you have to do is to re-use such a mapping with your ontologies. The PHP API will take care to produce the proper Schema.org microdata, only if the linkage exists between the content you are publishing and the schema.org vocabulary. The only thing you have to worry about is to use the API when you create your code to publish your content on the Web.

Is this vision possible? The platform that manipulates entities that way is already existing: it is the Open Semantic Framework. Everything you manipulate are such entities descriptions. Then you have the PHP API available to query the web services to get the descriptions of your entities. The only missing piece is the glue that map your entities’ types and properties to the schema.org vocabulary.

The good news is that this glue already exists, but will greatly improve in the coming months. We are currently working on a completely new version of UMBEL (Upper Mapping and Binding Exchange Layer) that will include, amongst other things, a fully updated Schema.org mapping between UMBEL and Schema.org. Note that UMBEL and its linkages is meant to be that reference structure to be used in the code I outlined above.

Conclusion

A new Web paradigm is being pushed, is being forced, by the major search engines. However, the issue that is emerging is that the current systems that are used by 98%4 of the people are not geared toward that kind of data management and these new development concepts and techniques. If this paradigm shift continues, then it will force developers to adopt completely new platforms, which rely on new technologies, concepts and specifications such as the Open Semantic Framework. The way people of that field are working will change quite significantly.

This blog post focused on the Web, and companies that are publishing content on the Web. However, to my experience with multiple different kind of organizations (municipalities, governments, Fortune 500 companies, etc.) are now experiencing the influences of the Open World Assumption in what they thought to be a Closed World. The data they are using to give different kind of services, changes, evolves. New acquisitions and new projects challenge their Close World Assumption. These changes have dramatic impacts on their infrastructure, their data and their ability to evolve and adapt to the constantly (fast) changing World .

The leap is big since the minding is quite different, investments will be required in terms of software and data migration and training of the staff. But in my view, it is essential, things are changing and organizations will need to adapt.

To conclude, how many time in a day can I read blog posts, tweets and forums where naysayers state that the Semantic Web never worked, never existed and it is doomed to be used by academicians only? Multiple… but who cares?

This is the Semantic Web. This is Linked Data. And it is changing the way people works.

  1. I am professionally working in the Semantic Web field for more than seven years now
  2. according to their RDFa schema
  3. Ontologies Driven Applications is a concept Structured Dynamics introduced a few years ago. Keep in mind that some of the systems referenced in this blog post are not existing anymore, and have been superseded by the Open Semantic Framework (OSF)
  4. this is a completely random number coming out of my intuition



This blog is a regularly updated collection of my thoughts, tips, tricks and ideas about my semantic Web researches and related software development.


RSS Twitter LinkedIN


Follow

Get every new post on this blog delivered to your Inbox.

Join 66 other followers:

Or subscribe to the RSS feed by clicking on the counter:




RSS Twitter LinkedIN