Benchmark of PHP’s main String Search Functions

I am currently upgrading the structWSF ontologies related web service endpoints along with the structOntology conStruct module to make them more performing so that we can load ontologies that have thousands of classes and properties (at least up to 30 000 of them).

While testing these new upgrades with them UMBEL ontology, I noticed that much of the time was spent by a few number of stripos() calls located in the loadXML() function of the ProcessorXML.php internal structXML parser. They were used to extract the prefixes in the header of the structXML files, and then to resolve them into the XML file. I was using stripos() instead of strpos() to make the parsing of these structXML files case-insensitive even if XML is case-sensitive itself. However, due to their processing cost, I did change this behaviors by using the strpos() function instead. Here are the main reasons to this change:

  • XML is itself case-sensitive, so don’t try to be too clever
  • These structXML files that are exchanged are mostly internal to structXML
  • Their parsing performances is critical

The Tests

This is a non-scientific post about some experimentation I made related to the various PHP 5.3 string search functions. These tests have been performed on a small Amazon EC2 instance using DBG and PHPeD.

[cc lang=’php’ line_numbers=’true’]
[raw]

[/raw]
[/cc]

The first test uses a text of 138 words. That text get exploded into an array where each value is a word of that text. Then, before each iteration, we randomly select a word that we will search, within the text, using each of the 4 search functions.

Note that in the result images below, each of the line in the left-most column are the ones of the PHP code above.

That first test starts with 10 000 iterations. Here are the results of the first run:


The second test uses the same 138 words, but the test is performed 100 000 times:

As we can see, strpos() and strstr() are clearly faster than their case-insensitive counterparts.

Now, let’s see what is the impact of the size of the text to search. We will now perform the two tests with 10 000 and 100 000 iterations but with a text that has 497 words.

[cc lang=’php’ line_numbers=’true’]
[raw]

[/raw]
[/cc]

That third test starts with 10 000 iterations. Here are the results of the third run:

The fourth test uses the same 497 words, but the test is performed 100 000 times:

As we can see, even if we add more words, the same kind of performances are experienced.

Conclusion

After many runs (I only demonstrated a few here). I think I can affirm that strpos() and strstr() are way faster than their case-insensitive counterparts. However, strpos() seems a little bit faster than strstr(), but it seems to depends of the context, and which random words are being searched for. In any cases, according to PHP’s documentation, we should always use strpos() instead of strstr() because it supposedly use less memory.

There may also be some unknown memory considerations that may affect the code I used to test these functions. In any case, I can affirm that in a real context, where queries are sent to the Ontology: Read web service endpoint that hosts the UMBEL ontology, that strpos() is a way faster than stripos().

commON and irJSON PHP parsers released

iron_logo_235Two days ago we released irON: Instance Record and Object Notation (irON) Specification. irON is a new notation that has been created to describe instance records. irON records can be serialized in 3 different formats: irXML (XML), irJSON (JSON) and commON (CSV: mainly for spreadsheet manipulations).

The release of irON has already been covered at length on Mike’s blog and in Structure Dynamics’s press room; so I won’t talk more about it here.

irON Parsers

What I am happy to release today are the first two parsers that can be used to parse and validate irON datasets of instance records. The first two parsers that have been developed so far are the ones for irJSON and commON. Each parser has been developed in PHP and is available under the Apache 2 licence. Now, lets take a look at each of them

irJSON Parser

The irJSON parser package can be downloaded here. Additionally, the source code can be browsed here.

First of all, to understand the code, you have to understand the specification of the irJSON serialization.

The irON parser package is everything you need to test and use the parser. The package is composed of the following files:

  • test.php – If you want to quick-start with this package, just run this test.php script and you will have an idea of what it can do for you. This script just runs the parser over a irJSON test file, and shows you some validation errors along with the internal parsed structure of the file. From there, you can simply use the irJSONParser class, with the structure that is returned to do whatever is needed for you: adding the information in you database, converting the data to another format, etc.
  • irJSONParser.php – This is the irJSON parser class. It parses the irJSON file and populates its internal structure that is composed of instances of the classes below.
  • Dataset.php – This class defines a Dataset records with all its attributes. It is the object that the developed has to manipulate that comes from the parser.
  • InstanceRecord.php – This class defines an Instance Records with all its attributes. It is the object that the developed has to manipulate that comes from the parser.
  • StructureSchema.php – This class defines a Structure Schema records with all its attributes. It is the object that the developed has to manipulate that comes from the parser.
  • LinkageSchema.php – This class defines a Linkage Schema records with all its attributes. It is the object that the developed has to manipulate that comes from the parser.

The irJSON parser also validates the incoming irJSON files according to these three levels of validation:

  1. JSON well-formedness validation – The first validation test occurs on the JSON serialization itself. A JSON file has to be a well formed in order to be processed. An error at this level will raise an error to the user.
  2. irJSON well-formedness validation – Once JSON is parsed and well formed, the parser make sure that the file is irJSON well-formed. If it is not well formed according to the irJSON spec, an error will be raised to the user.
  3. Structure Schema validation – The last validation that occurs is between instance records, and their related (if available) Structure Schema. If a validation error happens at this level, a notice will be raised to the user.

You can experiment with some of these validation errors and notices by running the test.php script in the package.

With this package, developers can already start to parse irJSON files and to integrate them with some of their prototype projects.

commON Parser

The commON parser package can be downloaded here. Additionally, the source code can be browsed here.

To understand the code, you have to understand the specification of the commON serialization.

The commON parser package is everything you need to test the parser. The package is composed of the following files:

  • test.php – If you want to quick-start with this package, just run this test.php script and you will have an idea of what it can do for you. This script just run the parser over a file, and shows you some validation errors along with the internal parsed structure of the file. From there, you can simply use the CommonParser class, with the structure that is returned to do whatever is needed for you: adding the information in you database, converting the data to another format, etc.
  • CommonParser.php – This is the commON parser class. It parses the commON file and populates its internal structure that is described in the code. the parser.

The commON parser also validates the incoming commON files according to these two levels:

  1. CSV well-formedness validation – The first validation test occurs on the CSV serialization itself. A CSV file has to be a well formed in order to be processed. An error at this level will raise an error to the user.
  2. commON well-formedness validation – Once CSV is parsed and well formed, the parser make sure that the file is CSV well-formed. If it is not well formed according to the CSV RFC, an error will be raised to the user.

You can experiment some of these validation errors and notices by running the test.php script in the package.

With this package, developers can already start to parsing commON files and to integrate them with some prototypes of their projects.

The commON parser is less advanced than the irJSON one. For example, the implementation of the “dataset” and the “schema” processor keywords are not yet done. Other keywords haven’t (yet) been integrated too. Take a look at the source code to know what is currently missing.

In any case, a lot of things can currently be done with this parser. We will publish specific commON usage use-cases in the coming weeks that will shows people are we are using commON internally and how we will expect our customers to use it to create and maintain different smaller datasets.

Conclusion

These are the first versions of the irJSON and commON parsers. We have to continue to development to make them perfectly reflecting the current and future irON specification. We yet have to write the irXML parser too.

I would encourage reporting any issues with these parsers, or any enhancement suggestions, on this issue tracked.

All discussions regarding these parsers and the irON specification document should happen on the irON group mailing list here.

Finally, another step for us will be to embed these parsers in converter web services for structWSF.