August 2009 – Frederick Giasson

New release of UMBEL: v072

August 21, 2009 Frederick Giasson

I am pleased to announce that we resumed our work with UMBEL. We just released the version v0.72, which is based on the OpenCyc version 2009-01-31. This new version is intermediary and has been created mostly to check the evolution of OpenCyc vis-à-vis UMBEL. Within the next month or so, we will release a new version (v.080), which will introduce a major new concept that should help systems and users manipulating the entire UMBEL Subject Concepts structure.

For them who want to know what changed between versions v071 and v072, here is CVS file that list all the changes between the versions. There are four columns: (1) source node, (2) attribute, (3) target node and (4) version number. This file list all triples that are present in a version, but not in the other. So, you have all changes (nodes & arcs) between the two versions. Mostly all the changes come from internal changes to OpenCyc. We did fix a couple of things such as removing cycles in the graph, etc. But 99% of the changes come from changes within OpenCyc.

Finally note that the web services endpoints will be updated with this new version of UMBEL subject concepts in the coming week along with the dereferencing of their URIs. Stay tuned!

structWSF Early Querying Metrics

August 18, 2009 Frederick Giasson

We have been running different structWSF instances for about two months now. Each instance is hosting different dataset(s) that are queried for different purposes. I think that it worth taking some time starting to analyze the querying stats of two of these instances of the early Alpha version of structWSF.

The goal is to create some kind of checkpoints that we will be able to use in the future to check how the system improved or deteriorated. It is also to check what kind of metrics we could derive from the current logging system, and to check if we could find any bottle neck or issues with any of the endpoints.

The data used to analyze the instance A span from the 2009-06-08 at 7:16:38 to the 2009-08-18 at 12:28:37.

The data used to analyze the instance B span from the 2009-05-20 at 1:46:31to the 2009-08-18 at 12:40:28.

structWSF Instance A

The instance A only has 1 dataset with about 1000 instance records in it. As we can notice bellow, the average time of a query to that instance for all web service endpoints is about 210 milliseconds.

Number of queries	Average time for each query in seconds
27956	0.218252857656909

The table bellow give us the total number of queries sent to each web service endpoint with an average time for each web service.


Web Service	Number of queries	Average time for each query in seconds
dataset_create	265	0.126993534699919
converter/tsv	48	0.128808428843714
dataset_update	17	0.140141641392576
dataset_read	11780	0.144073766884864
auth_registrar_access	883	0.145781793788779
converter/bibtex	49	0.149710825511323
auth_lister	1970	0.159979685066925
search	1397	0.180938945980523
browse	8949	0.199636802392004
crud_read	638	0.241032384406063
dataset_delete	263	0.420157149717388
crud_delete	3	0.637878338496
converter/irv	792	0.661979901670313
sparql	715	1.123084135322358
crud_create	187	1.486844727060763

This table gives the number of queries for each returned HTTP response status code by the endpoint. This kind of metrics is useful to debug potential issues


Web Service	Number of queries	HTTP Response Status
auth_lister	1968	200
auth_lister	2	400
auth_registrar_access	883	200
browse	8949	200
converter/bibtex	45	200
converter/bibtex	2	400
converter/bibtex	2	406
converter/irv	740	200
converter/irv	51	400
converter/irv	1	406
converter/tsv	43	200
converter/tsv	2	400
converter/tsv	3	406
crud_create	66	200
crud_create	116	400
crud_create	5	500
crud_delete	3	200
crud_read	480	200
crud_read	158	400
dataset_create	265	200
dataset_delete	261	200
dataset_delete	2	500
dataset_read	11767	200
dataset_read	9	400
dataset_read	4	500
dataset_update	17	200
search	1393	200
search	4	400
sparql	693	200
sparql	19	400
sparql	3	406

structWSF Instance B

The instance B has 25 datasets with about 2 312 000 instance records in it. As we can notice bellow, the average time of a query to that instance for all web service endpoints is about 550 milliseconds.

Why the average query time per query double with the size of that instance? It is what we will check.


Number of queries	Average time for each query in seconds
37575	0.556303637714566

The table bellow give us the total number of queries sent to each web service endpoint with an average time for each web service. What we can notice is that the time it takes to create, delete and update records in the database management systems is related to the size of the dataset. So, what happened and is there anything we can do?

Most of the queries used for this analysis come from queries sent to structWSF v.1.0a1 and v1.0a2. However, something that has a major impact on these results changed in v1.0a3 that has been released last week. The big problem with these numbers is Solr’s commit time. In version v1.0a1 and v1.0a2, a Solr commit was issued each time something was updated in the index. Commit could take up to minutes sometimes with the size of its index. Since v1.0a3, we give that choice to the system administrator: he can issue commit each time something change in the index, or setup Solr’s AutoCommit setting properly. That means that we increased the performance of these CUD endpoints by about 95%.

For the SPARQL endpoint, the reason is that it is mostly exclusively used to export data from a structWSF instance. This means that big dump of RDF triples are incurred for each query, which justify the average time per query of 2.1 seconds.


Web Service	Number of queries	Average time for each query in seconds
dataset_create	173	0.09835156953404
auth_registrar_access	1135	0.114255581658327
dataset_update	121	0.119028852005636
dataset_read	12683	0.159165935205064
crud_read	8546	0.23457546435556
converter/bibtex	109	0.405608450600873
auth_lister	2315	0.471687612780759
search	2313	0.533951056245796
browse	9103	0.758227908033767
converter/tsv	8	0.863690733909698
sparql	650	2.115058046487879
converter/irv	166	2.681712512510398
crud_update	13	4.649851157114154
crud_create	75	11.306954870223277
dataset_delete	140	27.511527856750207
crud_delete	25	34.33350466727492

This table gives the number of queries for each returned HTTP response status code by the endpoint.


Web Service	Number of queries	HTTP Response Status
auth_lister	2275	200
auth_lister	11	400
auth_lister	2	406
auth_lister	27	500
auth_registrar_access	1110	200
auth_registrar_access	25	400
browse	9084	200
browse	18	400
browse	1	406
converter/bibtex	108	200
converter/bibtex	1	400
converter/irv	154	200
converter/irv	12	400
converter/tsv	8	200
crud_create	41	200
crud_create	33	400
crud_create	1	500
crud_delete	24	200
crud_delete	1	400
crud_read	8268	200
crud_read	273	400
crud_read	5	406
crud_update	4	200
crud_update	9	400
dataset_create	171	200
dataset_create	2	400
dataset_delete	79	200
dataset_delete	61	500
dataset_read	12647	200
dataset_read	11	400
dataset_read	25	500
dataset_update	113	200
dataset_update	8	500
search	2286	200
search	24	400
search	3	406
sparql	618	200
sparql	22	400
sparql	6	406
sparql	4	500

Generating the Stats

Here is the list of SQL query used to create these stat tables. You can run them locally on your structWSF instance to generate the same kind of statistics.

Timespan of the queries

select min(request_datetime) as startdate, max(request_datetime) as enddate from SD.WSF.ws_queries_log;

Get the average number of milliseconds per query sent to the syste

select count(request_processing_time) as nb_queries, avg(request_processing_time) as average_query_time from SD.WSF.ws_queries_log order by ID desc;

Get the average query time for each web service of a structWSF instance.

select requested_web_service, count(request_processing_time) as nb_queries, avg(request_processing_time) as average_query_time from SD.WSF.ws_queries_log GROUP BY requested_web_service ORDER BY average_query_time ASC;

Status messages counts per web service endpoint

select requested_web_service, count(request_http_response_status) as nb_queries, request_http_response_status from SD.WSF.ws_queries_log GROUP BY requested_web_service, request_http_response_status ORDER BY requested_web_service, request_http_response_status;

conStruct: a skin for structWSF

August 12, 2009 Frederick Giasson

As I said in my previous blog post, a conStruct instance is nothing more than a skin for one or multiple structWSF instances. conStruct is a user of a structWSF network.

But… what that means?

That means that each conStruct tools communicate with one or multiple structWSF instances. Each each feature of conStruct comes from structWSF. The only thing it does is presenting information to users, and give them some tool to manipulate the data.

A structWSF instances network

A structWSF instance is a set of web service endpoints. Each endpoint gets registered in a network. Each query sent to any of the web service endpoint of the network gets authenticated (and possibly rejected) by the network.

All structWSF instances share the same basic web services endpoints, however some specialized structWSF instance can add new functionality to the framework by developing new endpoints that does special things. Others can un-register services that has nothing to do with the mission of the instance, etc.

Not all structWSF instances are the same, but all of them share the same interface.

Individual people or organizations can choose to create structWSF nodes. The purposes can be quite different. Some organizations could choose to create structWSF nodes for internal purposes only: to help their departments to share different kind of data for example. Some people could want to setup a structWSF node where they can archive and share all data specific to their hobbies. Whatever the use-case is: they want a platform to ingest, manage, interact with and publish data; publicly or privately.

In the schema above, we can notice that different structWSF instances have been created and are maintained by different organizations, for different purposes. Some of the clients will communicate with these structWSF instances as a public user of the datasets published on the node(s), and other users will access to datasets that only them have access to.

As you can see, some users communicate with multiple structWSF instances. This means that these user cares about data of different datasets, maintained by different organizations. Why and what for? We don’t know. It can be for any reasons. It can be as a web portal that aggregates all the information about a specific domain that is shared amongst multiple nodes or it can be because the user get information from his client’s networks to get things done.

What is important to keep in mind with the schema above is that any kind of people, any kind of organizations and any kind of systems can leverage the structured data they have access to that is hosted by different organizations that make available different datasets and different web services endpoints (maybe some organizations can even create a web service endpoint that works with their dataset and to expose some special algorithms they use to disambiguate/tag entities, etc.)

A network in action

You are probably telling yourself: well, the grand vision is good… but where is the meat around the bone?

Lets take a look at the conStructSCS sandbox demo. You have two datasets in there: (1) the Sweet Tools and (2) RePEc. There is one thing that you probably don’t notice: both datasets live on two different structWSF instances (each structWSF instance is hosted on a different web server). This means that if you perform a search, or a browse query, all results you get in the conStruct user interface come from two totally different servers, with different data maintainers, hosted by different organizations, etc. Still, all results are displayed in the same user interface, which is the conStructSCS demo sandbox.

Under the curtain

Lets take a look at what is happening. First, run this search query for “rdf”. You see what appears in the yellow box? This is a list of the queries exchanged between conStruct and two structWSF instances. You want more? Try this other search query for “rdf”. Now you also have access to the body of the messages.

For this demo sandbox, we enabled the “wsf_debug” parameter so that users of the sandbox can see how a conStruct node can interact with structWSF instances. If the value of this URL parameter is “1”, then the header + body of the query is displayed to the users. If the value is “2”, only the header is displayed.

This means that you can happen the “&wsf_debug=1” parameter to any URL of the demo sandbox and you will be able to see the messages exchanged between the systems. Why? Because all conStruct tools communicate with one or multiple web service endpoint(s) and one or multiple structWSF instances.

Now, lets take a look at the output of the search query above.

Web service query: [[url: http://localhost/ws/search/] [method: post] [mime: text/xml] [parameters: ] [execution time: 0.279745101929]] (status: 200) OK – .
Web service query: [[url: http://bknetwork.org/ws/search/] [method: post] [mime: text/xml] [parameters: query=rdf&types=all&datasets=http%3A%2F%2Fbknetwork.org%2Fwsf%2Fdatasets%2F283%2F%3Bhttp%3A%2F%2Fconstructscs.com%2Fwsf%2Fdatasets%2F160%2F&items=10&page=0&inference=on&include_aggregates=true&registered_ip=self%3A%3A0] [execution time: 0.289397001266]] (status: 200) OK – .
Web service query: [[url: http://localhost/ws/dataset/read/] [method: get] [mime: text/xml] [parameters: uri=all&registered_ip=self%3A%3A0] [execution time: 0.123399972916]] (status: 200) OK – .
Web service query: [[url: /ws/dataset/read/] [method: get] [mime: text/xml] [parameters: uri=all&registered_ip=self%3A%3A0] [execution time: 0.18315911293]] (status: 200) OK – .

Each dot is a query sent to a specific structWSF instance. For each query, you have this information:

URL of the web service endpoint where the query has been sent.
HTTP method used to send the query
MIME type (Accept HTTP header parameters) requested
Parameters of the query
Time it took to execute the query (including network latency & query processing)
Status of the query from the web service endpoint

Since this conStruct instance is linked to two different structWSF instances, the search tool will send a search query to two different search web service endpoints. Additionally, it will query these structWSF instances to get the description of the searched dataset (to display the proper name of the datasets in the user interface).

Each query is validated by the structWSF instances to make sure that they are legitimate queries. If they are, then results are returned. Once these queries are sent and answers received, the structSearch tool can then generate the page and display it to the user.

Do you want more? Here is a list of queries sent by different conStruct tools to different web services endpoints:

(Note: this debug info tabs has been added so that people can see what is happening under the hood. However this information is only accessible to the registered conStruct instance and the administrator of that instance).

Do it by yourself, from your desktop computer

I said that people or organizations that managed to create content data on these structWSF instances were able to manage/manipulate their data from anywhere: not only from within conStruct. Lets test this.

I changed the permissions on the Sweet Tools List dataset so that it is publicly available for reading. That way, any anyone will be able to send Curl queries against the dataset, to that structWSF instance.

Now, lets try a couple of queries to different web services endpoints. Let start with a query for the keyword “rdf” on the Sweet Tools dataset:

curl -H “Accept: text/xml” “http://constructscs.com/ws/search/” -d “query=rdf&types=all&datasets=http%3A%2F%2Fconstructscs.com%2Fwsf%2Fdatasets%2F122%2F&items=10&inference=on”

What you will get for this query is a list of 10 instance records that match this query. You don’t like the internal XML representation of the system? Then try the internal JSON representation by running this query:

Maybe this is not good enough for you? Then lets try in RDF+XML:

curl -H “Accept: application/rdf+xml” “http://constructscs.com/ws/search/” -d “query=rdf&types=all&datasets=http%3A%2F%2Fconstructscs.com%2Fwsf%2Fdatasets%2F122%2F&items=10&inference=on”

I think you understood the point here, so I won’t continue.

Now, lets send a query to get all the datasets accessible by you:

curl -H “Accept: application/rdf+xml” “http://constructscs.com/ws/auth/lister/” -d “mode=adataset”

If you can query all these things with Curl, this mean that anything can query these services. Standalone softwares can be developed to leverage these content nodes as well as other online applications.

Conclusion

As you probably learned with this blog post, one of the powers of structWSF is that it creates networks of structured content nodes that can be accessed by any thing, from anywhere, publicly or privately.

As you noticed, all this stuff is not only about integrating any kind of data, but also to publish it in a flexible way.

Re-Introduction

August 10, 2009 Frederick Giasson

I haven’t been active on this blog for more than half a year now. I was telling myself that I was too busy coding to write anything meaningful to my readers. I did write a couple of things, but nothing of importance related to all the things I was working on. I did publish announcements and such, but didn’t really take the time to write about these things. A lot of things have been done and published recently, but little has been said. So, lets try to rectify the shot so that I share more about what I am currently working on, the concepts I am playing with, the systems I am releasing, etc. So, lets restart to write about these things that I really do believe in, and that I put all my time, efforts and energy in. Lets restart writing about things that I do believe in and that are valuable to me.

As you probably know, my company Structured Dynamics released a series of products: structWSF and conStruct. I spent the last six months developing these two products. However, what are they? Why did I spend all my time working on these products? Why does they matter? Why do I think that they are valuable?

Let me outline what they are, what they do and what they are useful at. Then think if they could be of any value to you, your organizations, your enterprises, etc.

StructWSF

StructWSF is a web services framework (WSF) that basically does four things: it ingest, manage, interact with and publish data. What kind data? Any kind of data

Ingesting: the aim is to be able to ingest data from any data source (so data formatted using any language, or described using any vocabularies/schemas techniques). The framework has to be able to ingest any data that come from any data sources with a single conversion step.

Managing: the aim is to be able to manage the data. Managing the data means being able to collectively (with permissions and authentication) manage datasets available in a framework instance. Being about the create, modify, delete or update data. It also means being able to browse and search the data. It means making it publicly available, or to restrict its access to a user or group of users. This means merging datasets together too.

Interacting: but there is another facet to data management. We don’t only want to be able to manage data in a locked system. What we want is to be able to manage its data from anywhere. It can be from my browse, from my website, from some other applications on my desktop, from my home, from my office: from anywhere. All functions of a structWSF instance are accessible as web services endpoints. This means that you can perform any action, on your data, from anywhere you want: from a conStruct node or from a local Curl query. This is I think how people / organizations want to be able to manage the data they create and curate data.

Publishing: like ingesting, we want to be able to publish, to communicate the data we create to other people, other organizations or other entities. We want to do this in such a way that these external entities doesn’t have to recreate/reinvent themselves. We want to be able to communicate data the way they understand it: using any format and any vocabulary/schema.

The mindset behind structWSF is the following: we can ingest any kind of data, we can manage that data in multiple ways, we can interact with that data from anywhere and we can publish-back this data in any ways. structWSF is friction less in the sense of data communication between systems, users and entities.

conStruct

conStruct is just a skin over one, or multiple, structWSF instances. The conStruct software is an example of how a system can interact with a structWSF data provider. conStruct is a suite of generic tools that can be used to search, browse, visualize (template), import, export, create, delete and update data. All these tools interact with one or multiple structWSF functions by using their web service endpoints.

Since conStruct can interact with a single structWSF instance, it can also interact with multiple structWSF instances. That means that conStruct can be a user interface that communicates with multiple data providers (structWSF instances) and display all the results, from all these providers, in a canonical user interface.

But as I said, conStruct is one skin over structWSF instances. We could think about the integration of structWSF into other CMS systems. We could even think about having different CMS systems integrating with the same structWSF instance(s) so that if one user update/create/delete some data, it appears in other CMS systems as well.

The Magic Twist

However, all this is done with a twist: everything is structured. This means that everything that is in the system has a structure: is described using some vocabularies (full blow ontologies; or naive vocabularies). This enable all kind of valuable functionalities: inferencing capabilities in search and browse activities, filtering on types and attributes, helps integrating different datasets from different systems and organizations.

This is the magic twist that make this system different: everything in there is structured in such a way that everything can be ingested and published in any format; in such a way that basic inferencing or more complex reasoning is possible. It integrates data and let users use it the way they want from where they are. The capabilities are there; use it if you need them.

Next steps

The next steps for me will be to describe the features of the system: how the data is managed, how permissions work, what is the granularity of permissions available, etc. These will be more technical blog posts, but they will give you the full potential of the systems and concepts I have been talking in this blog post.

Frederick Giasson

Machine Learning, Engineering & Data

Month: August 2009