Frederick Giasson – Page 31 – Machine Learning, Engineering & Data

Freshmeat.net now available in DOAP: 43 000 new DOAP projects

August 4, 2007 Frederick Giasson

Three weeks ago, Rob Cakebread contacted me vis-à-vis some possible issues with the PingtheSemanticWeb.com web service. Rob is involved with the development of Gentoo Linux and he wanted to come-up with a method to let their development teams know when new packages are released for some projects. Since the RSS feeds of Freshmeat.net are limited in length, and that they do not support DOAP, he started to think about a way to convert Freshmeat.net’s data projects into RDF using the DOAP ontology. That way, he could easily create services to track the release of new packages and then increasing the effectiveness of their development teams.

Rob wrote:

I came across ptsw.com when trying to determine how many package indexes are using DOAP right now (so far I’ve only found Python Package Index and O’Reilly’s Code Zoo).

I’m a developer for Gentoo Linux. A website I created (Meatoo) checks Freshmeat twice a day and finds packages that have new releases available. We have a database of all our maintainers grouped into ‘herds’, a Python herd, desktop herd, PHP herd etc. The developers in the herds can query my website by a command-line client that uses XML-RPC, or subscribe to RSS feeds by herd or package name, or read the website itself and see which packages have new releases.

DOAP fits into this because I was thinking about creating DOAP records for each release from each package index and making this available so people can write tools to find out information about software packages easily.

Its how we got in contact. Rob had a practical problem, then he tried to find a way to resolve it and to help other people to resolve it too; and its how he found PingtheSemanticWeb.com and other semantic web related projects and communities (such as the Linked Open Data).

Freshmeat.net in DOAP

Then a couple of days ago, Rob re-contacted me to let me know that the Freshmeat.net’s 43 000 projects description is now available in RDF.

He created a prototype service that converts the data he has aggregated from Freshmeat.net into DOAP. Its project is called DOAPSpace. The idea is to make available the Freshmeat.net projects into DOAP, then to ping PingtheSemanticWeb.com to ultimately make them available on Doapstore.org that is feeded by PTSW.

People can get the DOAP description of each project by going to:

http://doapspace.gentooexperimental.org/doap/<project_name>

There are some examples of URIs:

http://doapspace.gentooexperimental.org/doap/0verkill
http://doapspace.gentooexperimental.org/doap/amanda

RDF dump

I was really pleased to see how Rob managed to generate that data. Then I asked him if a RDF dump of that precious data would eventually be available for download? It is exactly what he is doing at the moment, and as soon as he send me the dump, I will make it available via PingtheSemanticWeb.com. Then, it will be ready to be integrated into the Linking Open Data project.

Content Negotiation

At the same time, Rob added a new feature to its service; a user only has to append the “?zg=1” parameter to the URL to get redirected to the Zitgist Browser. It was really nice from him to think about that; I really appreciated.

However, I introduced him at how he could use content-negotiation to do that and to make its service compatible with other tools such as other RDF browsers. So I pointed him to the How to Publish Linked Data on the Web draft document so that he can have a better understanding of the content-negotiation process.

Linking Open Data Community and Early Developers

Rob is certainly an early adopter of the Semantic Web. He is a developer that wants to solve problems with methods and technologies. He had the intuition that the DOAP ontology and other semantic web principles and technologies could help him to solve one of its problems. This intuition leaded him to discover what the semantic web community could do to help him.

It’s the kind of user we have to take care of; and that we have to help to release their projects. Its people like Rob that will make the Semantic Web a reality. Without such early adopters, from outside of the Semantic Web Community, the semantic web is probably doomed. We are there now; ready to help developers to integrate semantic web technologies into their projects; to generate their data into RDF and to link it with other data sources. It’s the goal of communities like the Linking Open Data Community and its what we are about to do.

News at Zitgist: the Browser, PTSW, the Bibliographic Ontology and the Query Service

July 31, 2007August 1, 2007 Frederick Giasson

It is not because we had some issues with the Zitgist Browser‘s server that things stopped at Zitgist. In fact, many projects evolved at the same time and I outline some of these evolutions bellow.

New version of the Zitgist Browser

A new version of the browser is already on the way. In fact, the pre-release version of the browser was a use case; a prototype. Now that we know that it works and that we faced most of the issues that have to be taken into account to develop such a service, we hired Christopher Stewart to work on the next version of the browser. He is already well into the problem now, so you could expect a release of this new version sooner than you could be expecting. At first, there won’t be many modifications at the user interface level, however, many things will be introduced in this new version that will help us to push the service at another level in the future.

New version of Ping the Semantic Web

The version 3.0 of the PingtheSemanticWeb.com web service should be put online next week. It will be a totally new version of the service. It won’t use MySQL anymore; Virtuoso has replaced it. The service will now fully validate RDF files before including them in the index. More stats will be available too. It is much faster (as long as remote servers are fast too) and I estimate that this only server could handle between 5 to 10 million pings per day (enough for the next year’s expansion). This said, the service will be pushed at another level and be ready for more serious traffic. After its release, a daily dump of all links will be produced as well.

The first draft of the Bibliographic Ontology

The Bibliographic Ontology Specification Group is on fire. We are now 55 members and generated 264 posts in July only. Many things are going on here and the ontology is well underway. We should expect to release a first draft of the ontology sometime in August. If you are interested in bibliographic things, I think it’s a good place to be.

The Zitgist Semantic Web Query Service

Finally, Zitgist’s Semantic Web Query Service should be available for alpha subscribed users sometime in September. You can register to get your account here. Also, take a look at what I wrote about vis-à-vis this search module (many things evolved since, but it’s a good introduction to the service).

Conclusion

So, many things are going on at Zitgist and many exiting things should happen this autumn, so stay tuned!

Zitgist Browser’s server stabilized

July 31, 2007 Frederick Giasson

Five weeks ago I introduced the Zitgist Browser on this blog. At that time, I talked about a pre-release of the service. These two little words probably helped to explain what followed in the following weeks.

In fact, some of you probably noticed that the Zitgist Browser was down half of the time for a couple of weeks. In fact, we found many issues at many levels that rendered the browser’s server unstable. In the last weeks, we performed a battery of tests to fix all issues that appeared. Now, about three weeks later, the server is back stable. At least, it has been online for the last couple of days without any issues.

Thanks to the OpenLink Software Inc. development team, we have been able to stabilize the service; and it wouldn’t have been possible without their help and expertise.

Finally, stay tuned for the next release of this service and continue to use it and report issues that you could encounter while browsing the semantic web (more information about the next version in the next blog post); and sorry about the possible frustrations you possibly had when you used the unstable version of the service.

UMBEL: Upper-level Mapping and Binding Exchange Layer

July 19, 2007 Frederick Giasson

Mike Bergman released the first draft of its UMBEL ontology. Me and some other people helped him to come up with that new ontology.What is UMBEL? UMBEL is a lightweight subject reference structure. People can see it as a pool of subjects. Subjects are related together at a synonymy level; so, subjects of related meaning will be binded together.

The objectives

The objectives of this new ontology are:

A reference umbrella subject binding ontology, with its own pool of high-level binding subjects
Lightweight mechanisms for binding subject-specific ontologies to this structure
A standard listing of subjects that can be refererenced by resources described by other ontologies (e.g., dc:subject)
Provision of a registration and look-up service for finding appropriate subject ontologies
Identification of existing data sets for high-level subject extraction
Codification of high-level subject structure extraction techniques
Identification and collation of tools to work with this subject structure, and
A public Web site for related information, collaboration and project coordination.

Main applications

Given these objectives, I see a couple of main applications where such ontology could be used:

Helping systems to find data sources for a given ontology. UMBEL is much more than a subject structure. In fact, UMBEL will bind subjects, with related ontologies and data sources for these related ontologies. So, for a given subject, people will be able to find related ontologies, and then related data sources.
Acting as a subject reference backbone. So, it could be use by people to links resources, using dc:subject, to its subject resource (the UMBEL subject proxy resource), etc.
Could be used by user interface to help them with handling subjects (keywords) references to find related ontologies (that have the power to describe these subjects).
Eventually it should be used by PingtheSemanticWeb to bind pinged data to the subject reference structure.
And probably many others.

Creation of the Ontology

A procedure will be created to automatically generate the ontology. The gross idea is to reuse existing knowledge bases to create the set of subjects, and their relationship, that will create the ontology. So, the idea is to come up with a representative, not too general, not too specialized, set of subjects. For that, we will play with knowledge bases such as WordNet, Wikipedia, Dmoz, etc. We will try to find out how we could prune unnecessary subjects out of them, how we could create such a subject reference framework by taking a look at the intersection of each data set, etc. The procedure is not yet developed, but the first experiments will look like that.

As explained in the draft:

The acceptance of the actual subjects and their structure is one key to the acceptance — and thus use and usefulness — of the UMBEL ontology. (The other key is simplicity and ease-of-use or tools.) A suitable subject structure must be adaptable and self-defining. It should reflect expressions of actual social usage and practice, which of course changes over time as knowledge increases and technologies evolve.

A premise of the UMBEL project is that suitable subject content and structures already exist within widely embraced knowledge bases. A further premise is that the ongoing use of these popular knowledge bases will enable them to grow and evolve as societal needs
and practices grow and evolve.

The major starting point for the core subject pool is WordNet. It is universally accepted, has complete noun and class coverage, has an excellent set of synonyms, and has frequency statistics. It also has data regarding hierarchies and relationships useful to the UMBEL look-up reference structure, the ‘unofficial’ complement to the core ontology.

A second obvious foundation to building a subject structure is Wikipedia. Wikipedia’s topic coverage has been built entirely from the bottom up by 75,000 active contributors writing articles on nearly 1.8 million subjects in English alone, with versions in other
degrees of completeness for about 100 different languages. There is also a wealth of internal structure within Wikipedia’s templates.

These efforts suggest a starting formula for the UMBEL project of W + W + S + ? (for WordNet + Wikipedia + SKOS + other?). Other potential data sets with rich subject coverage include existing library classification systems, upper-level ontologies such as SUMO, Proton or DOLCE, the contributor-built Open Directory Project, subject ‘primitives’ in other languages such as Chinese, or the other sources listed in Appendix 2 – Candidate Subject Data Sets.

Though the choice of the contributing data sets from which the UMBEL subject structure is to be built will never be unanimous, using sources that have already been largely selected by large portions of the Web-using public will go a long ways to establishing authoritativeness. Moreover, since the subject structure is only intended as a lightweight reference — and not a complete closed-world definition — the UMBEL project is also setting realistic thresholds for acceptance.

Conclusion

If you are interested in such an ontology project, please join us on the mailing list of the ontology’s development group, ask questions, writes comments and suggestions.

Next step is to start creating a first version of the subject proxies.

The Bibliographic Ontology: a first proposition

July 13, 2007July 17, 2007 Frederick Giasson

This Document is about the creation of The Bibliographic Ontology. It is the first proposition from Bruce D’Arcus and me that should lead to the writing of the first draft of the ontology. Some things have been developed, many questions have been raised, and the discussion that will arise from this first proposition will set the basis for the first draft of the ontology.

The goal of this ontology is simple: creating a bibliographic ontology that will set the basis to describes a document: so describing a writing that provides information. If well done, it will enable other people or organizations to create extension modules that will enable it to be expressive enough to describe more specialized sub-domains such as law documents, etc. It also re-use existing ontologies that already define some properties of documents.

Related materials

1. The proposed OWL/N3 file describing The Bibliographic Ontology (note: read the comment, FG are from me, and BD are from Bruce)
2. An enhanced version of the Zotero RDF dump of the book “Spinning the Semantic Web”, that shows the expressiveness power of the ontology by extending its content using the bibo:Part class and the locators properties (RDF/XML)
3. Other examples that shows other possible descriptions such as the description of events, places, etc.(RDF/N3)

Main concept of the ontology: a Document

The main concept of the ontology is bibo:Document. This class is described as “Writing that provides information” (from Wordnet). So, basically, any writing is a Document. It is equivalent to a foaf:Document and a dcterms:BibliographicResource. These two links are quite important since it will enable us to re-use these two widely used ontologies: FOAF and DCTERMS.

Second main concept: Contributions to these Documents

The second main concept of the ontology is bibo:Contribution. This class is described as “A part played by a person in bringing about a resulting Document”. The goal of this concept is to relate people, by their contributions, to documents they wrote, or helped to write. For now, contributions are defined by three properties:

bibo:role, that defines the role of the contributor: author, translator, publisher, distributor, etc.
bibo:contributor, that links a contribution to its contributor
bibo:position, that loselessly associates a “contribution” level for each contributors. This property is mainly used to sort multiple authors that worked on the writing of a document. More about that in the examples document.

With these two concepts, you can describe any Document and any Contribution to any document. So you can relate any piece of writing to its contributors.

What is really interesting with the concept (in my opinion) is that it opens the door the much more. In fact, by using this concept, we can now extend the idea and describe many more things about how people contributed to the writing of a document.

From these two concepts, we extended the idea to be able to cope with a larger range of use-cases.

Extensions of bibo:Document

The document class has been specialized in a series of more specialized type of documents, with restrictions of their own:

Article
LegalCase
Manuscript
Book
Manual
Legistlation
Patent
Report
Thesis
Transcript
Note
Law

Classes or individuals?

The development of this proposition has been made with Lee W. Lacy’s OWL book quote in mind:

“Individuals often mirror “real world” objects. When you stop having different property attributes (and just have different values) you have often identified an object (individual)“

This mean that if a subclass of a class didn’t have specific restrictions, or if no properties were restricted by using this class in their domain, then the class was dropped and an individuals of the super-class.

One example of this is the type bibo_types:dissertation. It is an individual of the class bibo:Thesis, but since it doesn’t have anything different other than its meaning, then we created an individual of the class bibo:Thesis. Check the examples document to see what it means concretely.

Collections of documents

Another main concept of the ontology is bibo:Collection. This concept has an aggregation inherent property. Its only purpose is to aggregate bibo:Document(s). An entity of this class will have a role of hubs into the RDF graph (network) created out of bibliographic relations (properties).

Other types of collections, with some restrictions of their own, have also been created. These other collections, such as bibo:CourtReporter are intended to be anchor points that can be extended by Bibliographic Ontology Extension Modules of particular specialized sub-domains such as Law documents.

There is the current list of specialized collections:

InternetSite
Series
Periodical
- Journal
- Magazine
- CourtReporter

Part of Documents

Another important concept is bibo:Part. This concept, along with locators (more about them in the next section), enables us to specify the components of Document. In fact, sometimes documents are aggregated to create collections, such as journals, magazines or court reporters. However, sometimes, documents are embedded within a document (embedded versus aggregated). This is the utility of bibo:Part; a bibo:Part is a document, but in fact, it’s a part of a document. The special property of a bibo:Part is dcterms:hasPart. So, a bibo:Part has use this property to relate it to the document it is part of. Check the examples document to know how bibo:Part can be used.

Locating Parts

To support the concept of Parts, a set of properties, called “locators” have been created. These locator properties will help to describe the relation between a Part and its related Document.

Three of these locators are bibo:volume, bibo:chapter and bibo:page. So, these properties will locate Parts inside documents. For example: a chapter within a book, or a volumne within a document that is a set of volumes.

Check the example about the document “The Art of Computer Programming” by Donald Knuth for a good example of how locators can be used.

This said, we could now think to describe a document by its parts, recursively from its volumes to its pages.

Open questions

Should we develop the ontology such that we can describe the entire workflow that lead to the creation and publication (possibly) of a document? All this workflow would be supported by the FRBR principles. At the moment, all the ontology describes the manifestation of a work, and not the work itself or its expression. Take a look at The Music Ontology (its workflow) to see how it could be done for the bibliographic ontology.
If the creation of classes and individuals of classes the good way to describe type of documents?
Is it the good way, or is there other ways, to describe contributions of people to the elaboration of documents?

Re-used ontologies

DCTERMS: re-used to describe main properties of document.
FOAF: re-used to describe people and organizations.
EVENT: re-used to describe events (example: conferences)
TIME: re-used to describe temporal properties
wgs84_pos: re-used to describe geographical entities

Conclusion

Please give any feedbacks, suggestions or comments directly on the mailing list of the group that develop this ontology. This group is intended to create an ontology that would create some type of consensus between people and organization working with bibliographical data.

Note: I disabled comment on this post only, to make sure that people comment on the mailing list.