We just released a new UMBEL web service endpoint and online tool: the Concept Tagger Plain. |
This plain tagger uses UMBEL reference concepts to tag an input text. The OBIE (Ontology-Based Information Extraction) method is used, driven by the UMBEL reference concept ontology. By plain we mean that the words (tokens) of the input text are matched to either the preferred labels or alternative labels of the reference concepts. The simple tagger is merely making string matches to the possible UMBEL reference concepts.
This tagger uses the plain labels of the reference concepts as matches against the input text. With this tagger, no manipulations are performed on the reference concept labels nor on the input text (like stemming, etc.). Also, there is NO disambiguation performed by the tagger if multiple concepts are tagged for a given keyword.
Intended Users
This tool is intended for those who want to focus on UMBEL and do not care about more complicated matches. The output of the tagger can be used as-is, but it is intended to be the initial input to more sophisticated reference concept matching and disambiguation methods. Expect additional tagging methods to follow (see conclusion).
The Web Service Endpoint
The web service endpoint is freely available. It can return its resultset in JSON, Clojure code or EDN (Extensible Data Notation).
This endpoint will return a list of matches on the preferred and alternative labels of the UMBEL reference concepts that match the tokens of an input text. It will also return the number of matches and the position of the tokens that match the concepts.
The Online Tool
We also provide an online tagging tool that people can use to experience interacting with the web service.
The results are presented in two sections depending on whether the preferred or alternative label(s) were matched. Multiple matches, either by concept or label type, are coded by color. Source words with matches and multiple source occurrences are ranked first; thereafter, all source words are presented alphabetically.
The tagged concepts can be clicked to have access to their full description.
EDN and ClojureScript
An interesting thing about this user interface is that it has been implemented in ClojureScript and the data serialization exchanged between this user interface and the tagger web service endpoint is in EDN. What is interesting about that is that when the UI receives the resultset from the endpoint, it only has to evaluate the EDN code using the ClojureScript reader (cljs.reader/read-string)
to consider the output of the web service endpoint as native data to the application.
No parsing of non-native data format is necessary, which makes the code of the UI simpler and makes the data manipulation much more natural to the developer since no external API is necessary.
What is Next?
This is the first of a series of tagging web service endpoints that will be released. Our intent is to release UMBEL tagging services that have different level of sophistication. Depending on how someone wants to use UMBEL, he will have access to different tagging services that he could use and supplement with their own techniques to end up with their desired results.
The next taggers (not in order) that are planned to be released are:
- Plaintagger – no weighting or classification except by occurrence count
- Entity plain tagger (using the Wikidata dictionary)
- Scones plain tagger – concept + entity
- Nountagger – with POS, only tags the nouns; generally, the preferred, simplest baselinetagger
- Concept noun tagger
- Entity noun tagger
- Scones noun tagger
- N-gramtagger – a phrase-basedtagger
- Concept n-gram tagger
- Entity n-gram tagger
- Scones n-gram tagger
- Completetagger – combinations of above with different machine learning techniques
- Concept complete tagger
- Entity complete tagger
- Scones complete tagger.
So, we welcome you to try out the system online and we welcome your comments and suggestions.