Benchmark of PHP’s main String Search Functions

I am currently upgrading the structWSF ontologies related web service endpoints along with the structOntology conStruct module to make them more performing so that we can load ontologies that have thousands of classes and properties (at least up to 30 000 of them).

While testing these new upgrades with them UMBEL ontology, I noticed that much of the time was spent by a few number of stripos() calls located in the loadXML() function of the ProcessorXML.php internal structXML parser. They were used to extract the prefixes in the header of the structXML files, and then to resolve them into the XML file. I was using stripos() instead of strpos() to make the parsing of these structXML files case-insensitive even if XML is case-sensitive itself. However, due to their processing cost, I did change this behaviors by using the strpos() function instead. Here are the main reasons to this change:

  • XML is itself case-sensitive, so don’t try to be too clever
  • These structXML files that are exchanged are mostly internal to structXML
  • Their parsing performances is critical

The Tests

This is a non-scientific post about some experimentation I made related to the various PHP 5.3 string search functions. These tests have been performed on a small Amazon EC2 instance using DBG and PHPeD.

[cc lang=’php’ line_numbers=’true’]
[raw]

[/raw]
[/cc]

The first test uses a text of 138 words. That text get exploded into an array where each value is a word of that text. Then, before each iteration, we randomly select a word that we will search, within the text, using each of the 4 search functions.

Note that in the result images below, each of the line in the left-most column are the ones of the PHP code above.

That first test starts with 10 000 iterations. Here are the results of the first run:


The second test uses the same 138 words, but the test is performed 100 000 times:

As we can see, strpos() and strstr() are clearly faster than their case-insensitive counterparts.

Now, let’s see what is the impact of the size of the text to search. We will now perform the two tests with 10 000 and 100 000 iterations but with a text that has 497 words.

[cc lang=’php’ line_numbers=’true’]
[raw]

[/raw]
[/cc]

That third test starts with 10 000 iterations. Here are the results of the third run:

The fourth test uses the same 497 words, but the test is performed 100 000 times:

As we can see, even if we add more words, the same kind of performances are experienced.

Conclusion

After many runs (I only demonstrated a few here). I think I can affirm that strpos() and strstr() are way faster than their case-insensitive counterparts. However, strpos() seems a little bit faster than strstr(), but it seems to depends of the context, and which random words are being searched for. In any cases, according to PHP’s documentation, we should always use strpos() instead of strstr() because it supposedly use less memory.

There may also be some unknown memory considerations that may affect the code I used to test these functions. In any case, I can affirm that in a real context, where queries are sent to the Ontology: Read web service endpoint that hosts the UMBEL ontology, that strpos() is a way faster than stripos().

UMBEL Blooms with New Colors

We are happy to announce the new, intermediary, UMBEL version 0.80. This is a major upgrade of the UMBEL ontology: both its vocabulary and its reference structure have been greatly enhanced, an upper structure called the SuperTypes has been added and everything got updated to OWL 2. You can read more about the overall changes on Mike’s blog post.

In this blog post I will focus on two topics: using some existing tools and frameworks to view and manage the reference concepts structure, and how one can use and leverage the coherency of the reference structure.

Navigating and Updating the Reference Structure

One thing that was lacking with the previous version of UMBEL was to have access to a user interface tool that would let you navigate and update the reference structure as you want. Because of the way the conceptual structure was created, it was hard for tools such as Protégé to load it because of all the individuals that were created (such as the SemSet individuals, etc.).

As stated in Mike’s blog post, we made significant changes to the UMBEL vocabulary, and how we instantiate the reference structure. Along with the OWL 2 upgrade, we made sure that the Protégé version 4.1 and the latest version of the OWLAPI could easily load both the UMBEL vocabulary and the reference structure.

Reasoning

One of the major additions to UMBEL v080 is the SuperTypes upper structure, an organizational layer above the UMBEL reference structure. We created these SuperTypes because we found that we could effectively cluster most UMBEL reference concepts into a small set of mostly distinct upper concepts (33 in fact, 29 of which are designed as disjoint).

This new SuperTypes structure helps us mine external sources of information by leveraging related concepts in the reference structure. Moreover, SuperTypes also help us perform easier, simpler, better and faster reasoning over the entire 21 K reference concepts structure.

Thus, SuperTypes provide a new tool to help determine if the UMBEL reference structure is consistent and coherent within itself. This is important, of course, to ensure that linkages between UMBEL and external ontologies is consistent and coherent as well.

So far, the entire reference concepts structure has been tested for its coherency according to the restrictions we defined at the level of the SuperTypes upper structure. Using different reasoners such as Pellet, Fact++ and Hermit (available by default with Protégé 4.1), we made sure that all the statements made between all the RefConcept classes and individuals, and all the statements made between these and the SuperTypes upper structure, are consistent within themselves. This method enabled us to find and fix some early assignment issues.

This new upper structure, along with its now consistent reference structure, helps provide confidence that statements based on UMBEL reference concepts are also consistent. And, all of this is made more testable by virtue of being able to use the OWL API and Protégé with its embedded reasoners.

How is Coherency Tested?

This is the core question. In fact, the more informative answer to this question will be part of a forthcoming blog post. But let’s start here.

The current way to check if the structure is coherent is by making sure that we don’t have an individual that belongs to two different SuperTypes that are stated to be disjoint. What we did with the SuperType upper structure is really simple: we categorized each and every RefConcept (using rdfs:subClassOf) under a SuperType. Most of the SuperTypes are disjoint: this means that if an individual is of rdf:type for two SuperTypes that are stated to be disjoint, then you will end-up with an incoherent structure because you are making a statement that is not permitted by the reference structure.

So, the way to check if your statements are coherent according to this structure, is to make your statements (right now, in terms of individual instantiation), and then to check using a reasoner such as Pellet. There is now a general testing structure to see if any ontology is coherent with respect to the UMBEL reference structure.

In the next blog post in this series, I will tell you how to use exactly the same method for coherency testing, but now for testing if linkages between external ontologies and the UMBEL reference structure are consistent. In that case, you will make the class-to-class assertions you want, and then you will instantiate individuals of these classes, then run the reasoner. Then, the reasoner will tell you if your ontology is still consistent according to the structure and the new statements you created.

Next Step

In parallel with these tutorials, we are also working hard on the next version of UMBEL. As outlined in the Next Changes section of the new UMBEL website, the next step is to release UMBEL v1.0, with a set of new features, before Christmas.

A New Home for UMBEL Web Services

umbel_wsEight months ago we announced the dissolution of Zitgist LLC. This event led to the creation of a sandbox to keep alive all the online assets of the company. Since this sandbox server was not owned by Structured Dynamics, it was becoming hard for us to update UMBEL and its online services. It is why we took the time to move the services back on to our new servers.

A New Home

sd_logo_260Structured Dynamics LLC now hosts a new version for the UMBEL Web services. From the main menu at the SD Web site you can access these services under the “umbel ws” menu option (you can also bookmark the Web services site at umbel.structureddynamics.com or ws.umbel.org.)

This move of UMBEL’s Web services to a new home will make the future upgrade of UMBEL easier, and this will make the maintenance of the Web services endpoints easier as well. With this move, I am pleased to announce the release of five initial Web services and one visualization tool:

Lookup Web Services:

Inference Engine Web Services:

SPARQL endpoint Web Service:

Visual Tool:

Note that the visual tool is using Moritz Stefaner’s Relation Browser.


Ping the Semantic Web

ptswlogo160.gifAdditionally, the Ping the Semantic Web RDF pinging service is now the property of OpenLink Software Inc. OpenLink is now hosting, maintaining and developing the service.

New release of UMBEL: v072

umbel_medium.pngI am pleased to announce that we resumed our work with UMBEL. We just released the version v0.72, which is based on the OpenCyc version 2009-01-31. This new version is intermediary and has been created mostly to check the evolution of OpenCyc vis-à-vis UMBEL. Within the next month or so, we will release a new version (v.080), which will introduce a major new concept that should help systems and users manipulating the entire UMBEL Subject Concepts structure.

For them who want to know what changed between versions v071 and v072, here is CVS file that list all the changes between the versions. There are four columns: (1) source node, (2) attribute, (3) target node and (4) version number. This file list all triples that are present in a version, but not in the other. So, you have all changes (nodes & arcs) between the two versions. Mostly all the changes come from internal changes to OpenCyc. We did fix a couple of things such as removing cycles in the graph, etc. But 99% of the changes come from changes within OpenCyc.

Finally note that the web services endpoints will be updated with this new version of UMBEL subject concepts in the coming week along with the dereferencing of their URIs. Stay tuned!

UMBEL Web Services Endpoints Released

After some delay, we are pleased to finally release the UMBEL Web services endpoints to the public. We have re-organized the Web services we introduced three months ago to add coherency and flexibility to the model.

The goal remains the same, but with a different flavor: these tools let ontologists and Web developers search, discover and use the UMBEL subject concept and named entity structures. The added flavor is that these Web services now fully embrace the HTTP 1.1 protocol and are provided via a series of well established data and serialization formats.

We now have RESTful Web services to add to our RESTful linked data. Pretty cool combination!

We are introducing two kinds of Web services: (1) atomic Web services and (2) compound Web services. An atomic Web service only performs one action: It takes some inputs and then outputs a resultset of the action. A compound Web service takes multiple atomic Web services, plugs them together in a pipeline model, and then takes some inputs and outputs a resultset arising from the compound action.

The communication between each of these Web service instances and the external World is the same: communication is governed by the HTTP 1.1 protocol. HTTP is generally RESTful and used to establish the communication, to determine mime type and serialization, to get inputs, to return status of the communication and possible errors, and to send back the resultset of the computation of the Web service.

That way, we can easily, within hours, programmatically pipeline these atomic Web services together to create new Web services. We can integrate external Web services endpoints into the same pipeline without modifying anything to the architecture. Status, errors and resultsets are propagated along the line, directly to the data consumer. This is the flexibility part of the story.

Now, how cool is that?

Overview of the UMBEL Web Services Endpoints

We are today releasing a couple of these atomic and compound Web service endpoints to the public, but others will follow in the coming weeks and months. Four families of Web services have been released that total seven Web service endpoints:

If you don’t know what UMBEL is, I would suggest you read a background information page that talks about the project.

The most important reading related to this blog post is the API philosophy documentation page that talks about the details of the design of this Web services architecture.

For Web developers that want to integrate these Web services endpoints within their application, an API documentation page explains how to communicate with these endpoints for each of the services.

Example of an Atomic Web Service

The Inference: Lister Web service is a good example of an atomic Web service. It takes a subject concept URI as the input and outputs a series of super-class-of, sub-class-of or equivalent-class-of classes for that concept. As an atomic service it does one thing and one thing only: Inferring relationships of a given subject concept URI.

Example of a Compound Web Service

The Reporter: Named Entity Web service is a good example of a compound Web service. This Web service displays full of information about a UMBEL named entity URI. However, not all the information returned by this Web service is directly computed by it. In fact, the information about broader and equivalent classes and subject concepts come from the Inference: Lister Web service. Results coming from this Web service are immediately integrated in the Reporter’s resultset. This is easily done considering that they share the same communication language (HTTP 1.1) and the same data and serialization formats (XML, RDF+XML and RDF+N3). This flexibility is priceless to quickly create resourceful compound Web services.

Conclusion

After some months to get the design right, we have finally released some of the UMBEL Web services to the public. These Web services can easily be integrated in current software architectures to leverage UMBEL’s vision of the World. The architecture underlying what we have released today will help to easily integrate UMBEL’s principles and concepts within new and existing projects. This will ultimately help people to quickly react to the changing World of needs and expectations of data users and consumers.

I hope you will enjoy using these new Web services, which Zitgist is freely hosting. The data you get from the Web service is open data and can be used freely with attribution.

Please do report any issues you may encounter. We also welcome any advice or suggestions that you would care to provide to enhance the overall system.