New versions of structWSF and conStruct

triple_120construct_logo_120

We just released a new (major) version of both structWSF and conStruct. Though some months had passed since we last released this software, we finally got the time and opportunity to make these important upgrades.Many things have changed in both packages. I don’t want to iterate all the changes in this blog post, so I would suggest you to read the changes log files here:

These new versions have greatly been impacted by the needs of our clients. We also started to introduce some new concepts we wrote about the last few months.

A really good addition to this release is the a brand new Installation Manual. Hopefully people will be able to “easily” and properly install and setup a Web server to host these two packages.

All documentation files have been updated:

You can download both software packages from here:

An Amazon EC2/EBS Architecture

Some of the changes to these new versions have been made to help create, setup and maintain Web servers that host structWSF and conStruct instances.

At Structured Dynamics, we have developed and use a server architecture that leverages Amazon computer-in-the-clouds services such as: EC2, EBS, Elastic IP in the Cloud. Such an architecture is giving us the flexibility to easily maintain and upgrade server instances, to instantly create new structWSF instances in one click (without performing all these steps everytime), etc.

You can contact us for more information about these EC2 AMIs and EBS Volumes that we developed for this purpose. Here is an overview of the architecture that is now in place:

structwsf_amazon

There is a clear separation of concerns between three major things:

  • Software & libraries
  • Configuration files
  • Data files.

We chose to put all software and libraries needed to create a stand-alone structWSF instance in an EC2 AMI. This means that all needed software to run a structWSF instance is present on the Virtuoso server running Ubuntu server.

Then we chose to put all configuration and data files on an EBS volume that we attach, and mount, on the EC2 instance. You can think about a EBS volume as a physical hard drive: it can be mounted on a server instance, but it can’t be shared between multiple instances.

By splitting the software & libraries, configuration and data files, we make sure that we can easily upgrade a structWSF server in production with the latest version of structWSF (its code base and all related software such as Virtuoso, Solr, etc). Since the configuration and data files are not on the EC2 instance, we can easily create a new EC2 instance by using the latest structWSF AMI we produced, and then to mount the configuration and data files EBS volume on the new (and upgraded) structWSF instance. That way, in a few clicks, we can fully upgrade a server in production without fear of disturbing the configuration or data files.

Additionally, we can easily create backups of configuration and data files at different intervals by using Amazon’s Snapshot technology.

Finally, we chose to put all related software and configuration files needed to run a conStruct instance in another, separate, EBS volume. That way, we have a clean structWSF AMI instance that can be upgraded at any time, and we can plug (mount) a conStruct instance (EBS instance) into a structWSF server at any time. This means that we can easily have structWSF instances with or without a conStruct instance. The same strategy can easily be used to create plugin packages that can be mounted and unmounted to any structWSF instance at any time, depending on the needs.

All this makes structWSF server instances maintenance easier, simpler and faster.

structWSF Early Querying Metrics

We have been running different structWSF instances for about two months now. Each instance is hosting different dataset(s) that are queried for different purposes. I think that it worth taking some time starting to analyze the querying stats of two of these instances of the early Alpha version of structWSF.

The goal is to create some kind of checkpoints that we will be able to use in the future to check how the system improved or deteriorated. It is also to check what kind of metrics we could derive from the current logging system, and to check if we could find any bottle neck or issues with any of the endpoints.

The data used to analyze the instance A span from the 2009-06-08 at 7:16:38 to the 2009-08-18 at 12:28:37.

The data used to analyze the instance B span from the 2009-05-20 at 1:46:31to the 2009-08-18 at 12:40:28.

structWSF Instance A

The instance A only has 1 dataset with about 1000 instance records in it. As we can notice bellow, the average time of a query to that instance for all web service endpoints is about 210 milliseconds.

Number of queries
Average time for each query in seconds
27956 0.218252857656909

The table bellow give us the total number of queries sent to each web service endpoint with an average time for each web service.

Web Service Number of queries Average time for each query in seconds
dataset_create 265 0.126993534699919
converter/tsv 48 0.128808428843714
dataset_update 17 0.140141641392576
dataset_read 11780 0.144073766884864
auth_registrar_access 883 0.145781793788779
converter/bibtex 49 0.149710825511323
auth_lister 1970 0.159979685066925
search 1397 0.180938945980523
browse 8949 0.199636802392004
crud_read 638 0.241032384406063
dataset_delete 263 0.420157149717388
crud_delete 3 0.637878338496
converter/irv 792 0.661979901670313
sparql 715 1.123084135322358
crud_create 187 1.486844727060763

This table gives the number of queries for each returned HTTP response status code by the endpoint. This kind of metrics is useful to debug potential issues

Web Service Number of queries HTTP Response Status
auth_lister 1968 200
auth_lister 2 400
auth_registrar_access 883 200
browse 8949 200
converter/bibtex 45 200
converter/bibtex 2 400
converter/bibtex 2 406
converter/irv 740 200
converter/irv 51 400
converter/irv 1 406
converter/tsv 43 200
converter/tsv 2 400
converter/tsv 3 406
crud_create 66 200
crud_create 116 400
crud_create 5 500
crud_delete 3 200
crud_read 480 200
crud_read 158 400
dataset_create 265 200
dataset_delete 261 200
dataset_delete 2 500
dataset_read 11767 200
dataset_read 9 400
dataset_read 4 500
dataset_update 17 200
search 1393 200
search 4 400
sparql 693 200
sparql 19 400
sparql 3 406

structWSF Instance B

The instance B has 25 datasets with about 2 312 000 instance records in it. As we can notice bellow, the average time of a query to that instance for all web service endpoints is about 550 milliseconds.

Why the average query time per query double with the size of that instance? It is what we will check.

Number of queries Average time for each query in seconds
37575 0.556303637714566

The table bellow give us the total number of queries sent to each web service endpoint with an average time for each web service. What we can notice is that the time it takes to create, delete and update records in the database management systems is related to the size of the dataset. So, what happened and is there anything we can do?

Most of the queries used for this analysis come from queries sent to structWSF v.1.0a1 and v1.0a2. However, something that has a major impact on these results changed in v1.0a3 that has been released last week. The big problem with these numbers is Solr’s commit time. In version v1.0a1 and v1.0a2, a Solr commit was issued each time something was updated in the index. Commit could take up to minutes sometimes with the size of its index. Since v1.0a3, we give that choice to the system administrator: he can issue commit each time something change in the index, or setup Solr’s AutoCommit setting properly. That means that we increased the performance of these CUD endpoints by about 95%.

For the SPARQL endpoint, the reason is that it is mostly exclusively used to export data from a structWSF instance. This means that big dump of RDF triples are incurred for each query, which justify the average time per query of 2.1 seconds.

Web Service Number of queries Average time for each query in seconds
dataset_create 173 0.09835156953404
auth_registrar_access 1135 0.114255581658327
dataset_update 121 0.119028852005636
dataset_read 12683 0.159165935205064
crud_read 8546 0.23457546435556
converter/bibtex 109 0.405608450600873
auth_lister 2315 0.471687612780759
search 2313 0.533951056245796
browse 9103 0.758227908033767
converter/tsv 8 0.863690733909698
sparql 650 2.115058046487879
converter/irv 166 2.681712512510398
crud_update 13 4.649851157114154
crud_create 75 11.306954870223277
dataset_delete 140 27.511527856750207
crud_delete 25 34.33350466727492

This table gives the number of queries for each returned HTTP response status code by the endpoint.

Web Service Number of queries HTTP Response Status
auth_lister 2275 200
auth_lister 11 400
auth_lister 2 406
auth_lister 27 500
auth_registrar_access 1110 200
auth_registrar_access 25 400
browse 9084 200
browse 18 400
browse 1 406
converter/bibtex 108 200
converter/bibtex 1 400
converter/irv 154 200
converter/irv 12 400
converter/tsv 8 200
crud_create 41 200
crud_create 33 400
crud_create 1 500
crud_delete 24 200
crud_delete 1 400
crud_read 8268 200
crud_read 273 400
crud_read 5 406
crud_update 4 200
crud_update 9 400
dataset_create 171 200
dataset_create 2 400
dataset_delete 79 200
dataset_delete 61 500
dataset_read 12647 200
dataset_read 11 400
dataset_read 25 500
dataset_update 113 200
dataset_update 8 500
search 2286 200
search 24 400
search 3 406
sparql 618 200
sparql 22 400
sparql 6 406
sparql 4 500

Generating the Stats

Here is the list of SQL query used to create these stat tables. You can run them locally on your structWSF instance to generate the same kind of statistics.

Timespan of the queries

select min(request_datetime) as startdate, max(request_datetime) as enddate from SD.WSF.ws_queries_log;

Get the average number of milliseconds per query sent to the syste

select count(request_processing_time) as nb_queries, avg(request_processing_time) as average_query_time from SD.WSF.ws_queries_log order by ID desc;

Get the average query time for each web service of a structWSF instance.

select requested_web_service, count(request_processing_time) as nb_queries, avg(request_processing_time) as average_query_time from SD.WSF.ws_queries_log GROUP BY requested_web_service ORDER BY average_query_time ASC;

Status messages counts per web service endpoint

select requested_web_service, count(request_http_response_status) as nb_queries, request_http_response_status from SD.WSF.ws_queries_log GROUP BY requested_web_service, request_http_response_status ORDER BY requested_web_service, request_http_response_status;

conStruct: a skin for structWSF

As I said in my previous blog post, a conStruct instance is nothing more than a skin for one or multiple structWSF instances. conStruct is a user of a structWSF network.

But… what that means?

That means that each conStruct tools communicate with one or multiple structWSF instances. Each each feature of conStruct comes from structWSF. The only thing it does is presenting information to users, and give them some tool to manipulate the data.

A structWSF instances network

A structWSF instance is a set of web service endpoints. Each endpoint gets registered in a network. Each query sent to any of the web service endpoint of the network gets authenticated (and possibly rejected) by the network.

All structWSF instances share the same basic web services endpoints, however some specialized structWSF instance can add new functionality to the framework by developing new endpoints that does special things. Others can un-register services that has nothing to do with the mission of the instance, etc.

Not all structWSF instances are the same, but all of them share the same interface.

Individual people or organizations can choose to create structWSF nodes. The purposes can be quite different. Some organizations could choose to create structWSF nodes for internal purposes only: to help their departments to share different kind of data for example. Some people could want to setup a structWSF node where they can archive and share all data specific to their hobbies. Whatever the use-case is: they want a platform to ingest, manage, interact with and publish data; publicly or privately.

In the schema above, we can notice that different structWSF instances have been created and are maintained by different organizations, for different purposes. Some of the clients will communicate with these structWSF instances as a public user of the datasets published on the node(s), and other users will access to datasets that only them have access to.

As you can see, some users communicate with multiple structWSF instances. This means that these user cares about data of different datasets, maintained by different organizations. Why and what for? We don’t know. It can be for any reasons. It can be as a web portal that aggregates all the information about a specific domain that is shared amongst multiple nodes or it can be because the user get information from his client’s networks to get things done.

What is important to keep in mind with the schema above is that any kind of people, any kind of organizations and any kind of systems can leverage the structured data they have access to that is hosted by different organizations that make available different datasets and different web services endpoints (maybe some organizations can even create a web service endpoint that works with their dataset and to expose some special algorithms they use to disambiguate/tag entities, etc.)

A network in action

You are probably telling yourself: well, the grand vision is good… but where is the meat around the bone?

Lets take a look at the conStructSCS sandbox demo. You have two datasets in there: (1) the Sweet Tools and (2) RePEc. There is one thing that you probably don’t notice: both datasets live on two different structWSF instances (each structWSF instance is hosted on a different web server). This means that if you perform a search, or a browse query, all results you get in the conStruct user interface come from two totally different servers, with different data maintainers, hosted by different organizations, etc. Still, all results are displayed in the same user interface, which is the conStructSCS demo sandbox.

Under the curtain

Lets take a look at what is happening. First, run this search query for “rdf”. You see what appears in the yellow box? This is a list of the queries exchanged between conStruct and two structWSF instances. You want more? Try this other search query for “rdf”. Now you also have access to the body of the messages.

For this demo sandbox, we enabled the “wsf_debug” parameter so that users of the sandbox can see how a conStruct node can interact with structWSF instances. If the value of this URL parameter is “1”, then the header + body of the query is displayed to the users. If the value is “2”, only the header is displayed.

This means that you can happen the “&wsf_debug=1” parameter to any URL of the demo sandbox and you will be able to see the messages exchanged between the systems. Why? Because all conStruct tools communicate with one or multiple web service endpoint(s) and one or multiple structWSF instances.

Now, lets take a look at the output of the search query above.

  • Web service query: [[url: http://localhost/ws/search/] [method: post] [mime: text/xml] [parameters: ] [execution time: 0.279745101929]] (status: 200) OK – .
  • Web service query: [[url: http://bknetwork.org/ws/search/] [method: post] [mime: text/xml] [parameters: query=rdf&types=all&datasets=http%3A%2F%2Fbknetwork.org%2Fwsf%2Fdatasets%2F283%2F%3Bhttp%3A%2F%2Fconstructscs.com%2Fwsf%2Fdatasets%2F160%2F&items=10&page=0&inference=on&include_aggregates=true&registered_ip=self%3A%3A0] [execution time: 0.289397001266]] (status: 200) OK – .
  • Web service query: [[url: http://localhost/ws/dataset/read/] [method: get] [mime: text/xml] [parameters: uri=all&registered_ip=self%3A%3A0] [execution time: 0.123399972916]] (status: 200) OK – .
  • Web service query: [[url: /ws/dataset/read/] [method: get] [mime: text/xml] [parameters: uri=all&registered_ip=self%3A%3A0] [execution time: 0.18315911293]] (status: 200) OK – .

Each dot is a query sent to a specific structWSF instance. For each query, you have this information:

  • URL of the web service endpoint where the query has been sent.
  • HTTP method used to send the query
  • MIME type (Accept HTTP header parameters) requested
  • Parameters of the query
  • Time it took to execute the query (including network latency & query processing)
  • Status of the query from the web service endpoint

Since this conStruct instance is linked to two different structWSF instances, the search tool will send a search query to two different search web service endpoints. Additionally, it will query these structWSF instances to get the description of the searched dataset (to display the proper name of the datasets in the user interface).

Each query is validated by the structWSF instances to make sure that they are legitimate queries. If they are, then results are returned. Once these queries are sent and answers received, the structSearch tool can then generate the page and display it to the user.

Do you want more? Here is a list of queries sent by different conStruct tools to different web services endpoints:

(Note: this debug info tabs has been added so that people can see what is happening under the hood. However this information is only accessible to the registered conStruct instance and the administrator of that instance).

Do it by yourself, from your desktop computer

I said that people or organizations that managed to create content data on these structWSF instances were able to manage/manipulate their data from anywhere: not only from within conStruct. Lets test this.

I changed the permissions on the Sweet Tools List dataset so that it is publicly available for reading. That way, any anyone will be able to send Curl queries against the dataset, to that structWSF instance.

Now, lets try a couple of queries to different web services endpoints. Let start with a query for the keyword “rdf” on the Sweet Tools dataset:

curl -H “Accept: text/xml” “http://constructscs.com/ws/search/” -d “query=rdf&types=all&datasets=http%3A%2F%2Fconstructscs.com%2Fwsf%2Fdatasets%2F122%2F&items=10&inference=on”

What you will get for this query is a list of 10 instance records that match this query. You don’t like the internal XML representation of the system? Then try the internal JSON representation by running this query:

Maybe this is not good enough for you? Then lets try in RDF+XML:

curl -H “Accept: application/rdf+xml” “http://constructscs.com/ws/search/” -d “query=rdf&types=all&datasets=http%3A%2F%2Fconstructscs.com%2Fwsf%2Fdatasets%2F122%2F&items=10&inference=on”

I think you understood the point here, so I won’t continue.

Now, lets send a query to get all the datasets accessible by you:

curl -H “Accept: application/rdf+xml” “http://constructscs.com/ws/auth/lister/” -d “mode=adataset”

If you can query all these things with Curl, this mean that anything can query these services. Standalone softwares can be developed to leverage these content nodes as well as other online applications.

Conclusion

As you probably learned with this blog post, one of the powers of structWSF is that it creates networks of structured content nodes that can be accessed by any thing, from anywhere, publicly or privately.

As you noticed, all this stuff is not only about integrating any kind of data, but also to publish it in a flexible way.

Release of structWSF, conStruct and the Community Web Site

The last few months have been challenging in term of amount of work to get done, in focusing on deliverables and in getting ready for the release of conStruct and structWSF sources codes, documentations, tutorials, web sites and demos.

I am now really happy to be able to finally announce the release of both software code sources along with a new development community website where users and developers can exchange ideas about these two news projects.

The biggest milestone of the last months is now behind us. However, this is just the beginning of everything!

I think that many things have been written about these two projects already. I don’t want to write any tutorial at this point. So the only thing I will do right now is to point you the more relevant documentation, web sites, blog posts and demos about each project. The next step will be to write about specific use cases, features, etc.

Community Web Site

The community Web site is a place where developers and users of structWSF and conStruct can meet to talk about both projects, to report bugs and issues, to submit new enhancements, to find tips and tricks, etc.

I would suggest you to create a new user profile on the community Web site if you are interested in communicating with other members.

structWSF

structWSF is a platform-independent Web services framework for accessing and exposing structured RDF data. Its central organizing perspective is that of the dataset. These datasets contain instance records, with the structural relationships amongst the data and their attributes and concepts defined via ontologies (schema with accompanying vocabularies).

The structWSF middleware framework is generally RESTful in design and is based on HTTP and Web protocols and open standards. The initial structWSF framework comes packaged with a baseline set of about a dozen Web services in CRUD, browse, search and export and import. All Web services are exposed via APIs and SPARQL endpoints. Each request to an individual Web service returns an HTTP status and optionally a document of resultsets. Each results document can be serialized in many ways, and may be expressed as either RDF or pure XML.

conStruct

conStruct is a distro of the Drupal framework that aims to set a new standard in data integration and as a structured content system (SCS). With conStruct, you can let your data and its structure drive your applications. You can easily interoperate your diverse internal information with public content on the Web. And you can leverage a platform designed from the ground up for knowledge management and collaboration.

structWSF and conStruct websites unveiled

I am proud to announce the release the websites of two of our products to come: structWSF and conStruct. Both products will be available in open source under the Apache 2 license. Mike just unveiled and demoed the two projects in his talk at SemTech 2009.

As we describe them on Structured Dynamics‘ website:

structWSF

structWSF is a platform-independent Web services framework for accessing and exposing structured  RDF data. Its central organizing perspective is that of the dataset. These datasets contain instance records, with the structural relationships amongst the data and their attributes and concepts defined via ontologies (schema with accompanying vocabularies).

The structWSF middleware framework is generally RESTful in design and is based on HTTP and Web protocols and open standards. The initial structWSF framework comes packaged with a baseline set of about a dozen Web services in CRUD, browse, search and export and import.

All Web services are exposed via APIs and SPARQL endpoints. Each request to an individual Web service returns an HTTP status and optionally a document of resultsets. Each results document can be serialized in many ways, and may be expressed as either RDF or pure XML.

In initial release, structWSF has direct interfaces to the Virtuoso RDF triple store (via ODBC, and later HTTP) and the Solr faceted, full-text search engine (via HTTP). However, structWSF has been designed to be fully platform-independent. Support for additional datastores and engines is planned. The design also allows other specialized systems to be included, such as analysis or advanced inference engines.

The framework is open source (Apache 2 license) and designed for extensibility. structWSF and its extensions and enhancements are distributed and documented on the OpenStructs Web site.

conStruct

conStruct SCS is a structured content system that extends the basic Drupal content management framework. conStruct  enables structured data and its controlling vocabularies (ontologies) to drive applications and user interfaces.

Users and groups can flexibly access and manage any or all datasets exposed by the system depending on roles and permissions. Report and presentation templates are easily defined, styled or modified based on the underlying datasets and structure. Collaboration networks can readily be established across multiple installations and non-Drupal endpoints. Powerful linked data integration can be included to embrace data anywhere on the Web.

Depending on roles and permissions, a given user may or may not see specific datasets or tools within the Drupal interface. Search and browse results are similarly sequestered depending on access rights.

conStruct provides Drupal-level CRUD (create – read – update – delete), data display templating, faceted browsing, full-text search, and import and export over structured data stores based on RDF. It also provides a system for additional tools additions and expansions for this structured data. conStruct SCS is built on the platform-independent structWSF Web services framework.

Like Drupal and structWSF, conStruct is free and open source (GPL license). Versions of conStruct SCS are planned to adopt it to other content management systems (CMS).

Next

The alpha version of the code with all the proper documentation will be released later this summer. Everybody will be able to contribute to the project by enhancing/developing the core code or by extending it with new modules and web services. Stay tuned!