openlibrary.png

“What if there was a library which held every book? Not every book on sale, or every important book, or even every book in English, but simply every book-a key part of our planet’s cultural legacy.” — The OpenLibrary Project

This is what I wanted to participate to.

The Open Library is a project that wants to archive information about every book (probably writings) created by mankind. Such a strong vision is naturally closely related to the semantic web.

I contacted Aaron Swartz about this project. I wanted to know what were their plans about making all this data available on the semantic web; what was their plan to describe these books into RDF.

I wanted to participate to the project by describing their information into RDF using the Bibliographic Ontology.

So it is what I started to do. Aaron sent me some snapshots of data using their current database schema (this schemas should be updated soon). Then I described one of them using BIBO. As you will see bellow, the ontology neatly describes the Open Library data and enable us to query, at the same time, the Open Library’s data, the data about the articles I wrote, eventually the Zotero citations if they choose to use BIBO, etc.

So, bellow is my proposition to Aaron and to the Open Library Project. From this post, we will be able to discuss about the implications, how this could be done, how the data could be made available for querying and browsing, etc.

How to Cook Revised Edition described using RDF and BIBO

The current use case is a book by Raymond Sokolov: “How to Cook Revised Edition“. It has been straightforward to map this data into BIBO using the current proposition.

The RDF/N3 example is available here: How to Cook Revised Edition in RDF/N3

Describing this data using BIBO leaded me to find out how to describe topical subjects of documents. It is a discussion we (the BIBO development community) already had, and here I think I found a solution.

Describing topical subjects for a bibo:Document

The goal is to relate a document resource with the concepts describing their topics. There are many ways to describe subjects of documents: it could be with a literal, a class, an individual, etc.

What I am proposing here is to re-use the dcterms:subject property (has we already do) to relate a bibo:Document with the concept of a taxonomy that will acts has the topical subject of a document.

The Open Library is using the BISAC subject standard to relate books with their topics. What I have done is to describe the BISAC standard as a taxonomy in RDF using SKOS. The resulting RDF is: BISAC taxonomy snapshot.

As you can notice, the BISAC taxonomy structure is well-described using SKOS concepts. The relation between these concepts is described as well. Also, the dcterm:identifier property is used to link a concept with its BISAC identifier.

From there, we only have to use the BISAC URIs to link a bibo:Document to its subjects like:

dcterms:subject <http://purl.org/ontology/bibo/bisac#Cooking_Regional_and_Ethnic_American_General> ;
dcterms:subject <http://purl.org/ontology/bibo/bisac#Cooking_General> ;

This is simple and effective. Also, we are not limited to the BISAC taxonomy; one can use the taxonomy he wants to describe subjects of its documents.

Some SPARQL queries

Nothing is better than SPARQL queries to “feel” the power of these RDF descriptions.

Queries related to contributions

The following query will display the documents’ title and the contribution role of Raymond Sokolov. So, if Raymond contributed to some documents as an author and editor, and all these documents will be returned in the resultset:

Finding documents where Raymond Sokolov contributed
The following query is a variable of the above. It will returns all the documents’ title where Raymond is an author.

Finding documents where Raymond Sokolov contributed as an author

Eventually we could also use the bibo:position to know all the documents wrote by Raymond where its author position if less than 2 (so, where he is a primary or secondary author of a document).

Queries related to documents and their subjects

If a user only has the BISAC identification number of a concept, and that he needs to find books about this topic, then he only has to run this query to get the titles with that topic:

Finding documents related to a BISAC identifier

However, it is not really handy. What if I only want books about “cooking”? There is a way to go:

Finding documents about “cooking”

That way, you will get all the “cooking” related concepts from the taxonomy, and you will find all the related books.

Note that there are many other ways to go such as browsing the graph of concepts using the skos:narrower and skos:broader properties from a given skos:Concept. However, the query above is simple and effective.

Other queries

Otherwise you can create a full set of other simple and effective queries by searching all the published books, all the published books by a given author or editor, etc.

There is no limit when all that information is available in RDF and BIBO.

More descriptions of the Open Library using BIBO

If you take a closer look at the current database schemas of the Open Library Project, you will notice that have data about “series”, “notes”, and other things. I don’t have such an example in hands at the moment, but we have to keep in mind that we can easily describe them using BIBO as well.

Conclusion

I described how RDF and The Bibliographic Ontology could be use to describe data from The Open Library Project. Doing this would enable them to easily and effectively publish their data so that other people and applications could take advantage of it.

We also found that it is a powerful method that we can easily use to search complex graphs of relations created by such data described in RDF and BIBO.

Finally, having all this data available in BIBO will enable us to easily merge it with other document data sources such as Zotero or any other writings described using RDF and BIBO. As a final example, we could, for example, find all the documents that Raymond Sokolov contributed to create, as an author, and editor, or whatever. With a single query, once could find out that he wrote some published books, and that he authored some posts on its blog. All that thanks to the RDF, BIBO, SPARQL and all the data sources exporting their data using RDF and BIBO.

17 thoughts on “The Open Library in RDF using The Bibliographic Ontology

  1. And what about using Z39.50 bibliographic information worldwide servers to sindicate / aggregate metadata?

    “All those -metadata- will be lost in time like tears in rain”…

    (Paraphrasing Replicant Roy)

    Mmmm, there must be a way to coordinate so many separated efforts, don´t you think so?

    Jorge Serrano-Cobos

  2. Hi Jorge,

    I never played with the Z39.50 gateways, however it would be a wonderful source of bibliographic data for sure 🙂

    [quote post=”833″]Mmmm, there must be a way to coordinate so many separated efforts, don´t you think so?[/quote]

    What if we could create an addon for these gateways such that they can export/publish this data in RDF using bibo?

    This is how these many separated efforts could be coordinated!

    In fact think about it. You have the Open Library data available in RDF BIBO. Then you have all these gateways. Then you have all these people describing articles and papers, and books, etc. using RDF BIBO. Then you have ontologies such as SIOC that use BIBO has their basis and use its concepts as super-class of their ontology classes.

    Then you end up with all these separated efforts coordonated using a commonly agreed vocabulary: BIBO.

    That way, you can search all these data sources at once, with the same query.

    It is how it will be done.

    What you think?

    Take care,

    Fred

  3. On the Open Library site we have a copy of the LC class number outline. How hard/easy would it be to skos-ify this? And if we did so, how would it help us create applications around it? I’d like to be able to work with the hierarchy in a variety of points within the application — such as in facets.

    http://www.archive.org/details/LcClassificationA-z

  4. Hi Karen,

    Thanks for your comment on this thread, I was waiting for a first contact for some time now 🙂

    Okay, first, I have no knowledge in the LC Classification system. It seems much more complete than the BISAC subject system: probably because both systems have completely different goals.

    After a quick look at the system and the file description of the system, I would say that it won’t be difficult to skos-ify. Some text file handling with a basic mapping.

    However, we should first think about how and why people would use these classification system. In fact, Wordnet could be seens as such a classicication system that could be use to relate a resource (a book, a document) to its subjects. Cyc (OpenCyc) could be a really good candidate as well. These systems have a clear hierarchy of concepts: classes and instances. So, I would see the LC CLassification as yet another of these systems.

    What interest me here, is the possibility to express all the OpenLibrary in RDF using the bibliographic ontology. This would gives an awesome basement to future works in that direction (representation of books and documents in rdf).

    [quote post=”833″]And if we did so, how would it help us create applications around it?[/quote]

    Well sure. First, it could help to search documents depending on topics (subjects) and other of their characteristics (title, author, creation date, etc, etc, etc).

    Could also help browsing the “Document Data Space”. For example, say that are browsing the rdf data of a document that is linked to the “History of Italy”. Then with a single click, you could start browsing the “History of Italy” resource subject and then having access to all documents linking to that resource subject. Documents could come from your personal library, the library of congress or any other website that publish rdf data about document and that related documents to this subject resource. This is all about the “linking data” portion of the semantic web (take a look at the new semweb logo: it suggest to open (and link) data on the Web.

    Also, this could be use, as you suggested, within a faceted browsing interface. This could be used within Zotero to classify documents, etc.

    There is a lot of potential; we only have to do it now.

    Take care,

    Fred

  5. [quote post=”833″]However, we should first think about how and why people would use these classification system. [/quote]

    The reason to work with LCC is that the bibliographic records created by libraries contain these classification numbers, since they are assigned to each book as a shelf location. However, we do not have the text that goes with the numbers in the records so there isn’t a good way to exploit them for searching. These class numbers are the only real hierarchical subjects in the records, so this is the only way to allow users to expand or narrow searches. The outline that we have is very simple; the full classification schedules are in about 20 large volumes and aren’t available in machine-readable form (only as printed text).

    [quote post=”833″]What interest me here, is the possibility to express all the OpenLibrary in RDF using the bibliographic ontology. [/quote]

    Hmmm. We have much to talk about, but let me say that there are different kinds of data that make up the bibliographic universe. Here are some:

    – description: this is where you describe the resource, its title, publisher, extent, format, etc. Much of this is not done using a controlled vocabulary, but there are controlled vocabularies for some aspects, such as language of text, type of musical composition, physical format etc..

    – identification: this is where you identify creators and resources in a way that allows you to say: “this is the same as that”

    – access: this is where you assign classifications or other topical vocabularies to a resource. These can be controlled or not, but in libraries are almost always from a controlled vocabulary.

    The controlled vocabularies in these may be “skos” or “rdf” -able. The vocabulary for description has not been formalized, although there is a proposal to do so. And we are sorely lacking in identifiers for this area. My feeling is that we need to do some preliminary work before we can start porting our data to RDF. But I’m still game to start trying, even if we’ll need to make changes down the road.

    What’s the best way to get this going? Wiki?

  6. Hi Karen,

    [quote post=”833″]The reason to work with LCC is that the bibliographic records created by libraries contain these classification numbers, since they are assigned to each book as a shelf location.[/quote]

    So, this is really a library classification. In that case, it is probably not the best system to use when talking about relating a document resource with its subjects (topics) resources.

    We had many discussions about how to describe the location (physical location of a document within a library, etc). I think the LCC would best fit this need.

    [quote post=”833″]However, we do not have the text that goes with the numbers in the records so there isn’t a good way to exploit them for searching.[/quote]

    Well, this is strictly a RDF engineering consideration. In fact, if you skos-ify these LCC categories, then each category will be a resource. This mean that they will have a URI defining the category resource and other properties. Two of these properties are (1) a dcterms:identifier that would be B9 (or whatever the code) and (2) a skos:prefLabel that would be “History of Italy”.

    Given this mapping of LCC into skos make all LCC categories queriable for all of these properties.

    [quote post=”833″]These class numbers are the only real hierarchical subjects in the records, so this is the only way to allow users to expand or narrow searches.[/quote]

    Well, now the hierarchy would be explicited using skos properties such as skos:broader, skos:narrower, etc.

    [quote post=”833″]20 large volumes and aren’t available in machine-readable form (only as printed text).[/quote]

    As long as there is some consistency in these text file it won’t be a problem.

    [quote post=”833″]- description: this is where you describe the resource, its title, publisher, extent, format, etc. Much of this is not done using a controlled vocabulary, but there are controlled vocabularies for some aspects, such as language of text, type of musical composition, physical format etc..[/quote]

    I would say that it is the first goal of BIBO. Describing documents and their relationship with other resources (entities) that can be other documents, authors, publisher, etc, etc, etc.

    [quote post=”833″]- identification: this is where you identify creators and resources in a way that allows you to say: “this is the same as that”[/quote]

    This is archiving with the inherent role of RDF.

    [quote post=”833″]- access: this is where you assign classifications or other topical vocabularies to a resource. These can be controlled or not, but in libraries are almost always from a controlled vocabulary.[/quote]

    Okay, and this is what we talked about with LCC, BISAC, etc. So, at my stand point, it is only to get these controlled vocabulary, to rdfize them using some suitable ontologies and then to link document resources to them.

    There are two main categories here: (1) localization (LCC) and (2) topical (BISAC and others).

    [quote post=”833″]And we are sorely lacking in identifiers for this area.[/quote]

    This is one of the big questions of the semantic web. How to identify (uniquely) things. There are always questions such as: what makes the best URIs to identify some classes and instances of these classes (City (class) and Paris (instance)). So, what is the URI for Paris the city? Some will tell you that it is wikipedia or dbpedia, other will tell you that it is geonames, etc.

    However I think we are wrong if we think that there exist only one identifier per thing. I mean, it is unworkable. We can’t get a unique identifier for a given book.

    The idea here is to link same resources together. So that I can surf from one representation of a thing (one URI) to another URI that describe the same thing.

    Given that. I think that you should develop (OpenLibrary) your own URI system, and create link to other (eventually since you would be the first real source of document resources).

    The links would be performed with a technique that we could call: graphs (rdf graphs) intersections. So, knowing which relation (properties) are the same for two given graphs and given some threshold, they are, or not, the same entity. If so, we link them together using some properties (the trend right now is to use owl:sameAs). If BIBO is mostly used to describe documents and other bibliographic things, this linkage will be much easier.

    [quote post=”833″]My feeling is that we need to do some preliminary work before we can start porting our data to RDF. But I’m still game to start trying, even if we’ll need to make changes down the road.[/quote]

    In fact, this would be great for both OpenLibrary and BIBO. OL would be a great usecase for BIBO; and both projects would benefit from these tests and experiences.

    [quote post=”833″]What’s the best way to get this going? Wiki?[/quote]

    Wiki could be good. Along with the BIBO mailing list.

    I really depends on what you want.

    Take care,

    Fred

  7. As long as there is some consistency in these text file it won’t be a problem.

    I think Karen’s point is it’s not available in electronic form at all (!).

    Karen, why is that? Is there some business reason for it that would preclude getting it in electronic form?

    Really, we need to move standard stuff like these subject headings, periodicals, etc., etc. to being on the semantic web. I should be able to do …


    <dc:subject rdf:resource="http://loc.gov/subject/1"/>

    … in some description, to ping that URI, and to get localized descriptions, links, etc.

  8. I couldn’t agree with you more about needing these resources to be available, but the classification schedules are available only in a Web service (http://www.loc.gov/cds/classweb/) that allows lookup of individual class numbers (and I’ve never seen this in action – minimum price is $375/yr for one person). I don’t see a way to purchase the whole file, although I assume it would be expensive. The individual hard copy books are $35-50 each and it looks like there are ~40 of them.

    There are various reasons why this is, but not the least of which is that as a US Federal agency LoC is required to do “cost recovery” for any service they provide to others. Some of us are trying to get grant money to start the process of converting some of the library data to a semantic web format since it seems obvious that LoC will not be able to do so on its own. It’s not been an easy sell to granters since they tend not to understand the underlying value of the semantic web formats. We’re still waiting to hear if we can proceed, but have had some disappointments.

    It would be ideal if we could start with data that we already have, such as the data that Simon captured. Even a not very sophisticated transform would allow us to work with the data in an experimental way. The catch, as always, is that we at least want the identifiers to be stable and somewhat durable. I agree with Fred, however, that we have to accept that there may be more than one identifier for a resource, especially over time, and that just means that we will have to manage equivalences and even “fuzzy” equivalences (“x” *might* be equivalent to “y”). That shouldn’t be a barrier.

    Would it help to have a brief description of the characteristics of, say, the subject or name authorities data in MARC21 format? Or is that already obvious?

  9. Hi Karen,

    [quote post=”833″]It would be ideal if we could start with data that we already have, such as the data that Simon captured. Even a not very sophisticated transform would allow us to work with the data in an experimental way.[/quote]

    All projects, even the more ambitious ones, have to start somewhere. Doing a first mapping between some of your data and bibo is certainly a good start. From there, we will be able to check if created graphs can be easily queried and browsed. For the browsing purposes, we will create a new bibo template within the Zitgist Browser (browser.zitgist.com). That way, converted data should be easily browsable (this new template will come along with a new version of the browser that will be released soon).

    About the queriability of the generated rdf graphs, we will be able to put online triples stores with this information to the community so that people can query the data and test things.

    [quote post=”833″]The catch, as always, is that we at least want the identifiers to be stable and somewhat durable. I agree with Fred, however, that we have to accept that there may be more than one identifier for a resource,[/quote]

    Well, are the openlibrary URLs unique? Then I think they do great URIs 🙂 From there, we will be able to dereference URIs once we are set and that the OpenLibrary data will be available in rdf. This is a pure technical consideration.

    [quote post=”833″]especially over time, and that just means that we will have to manage equivalences and even “fuzzy” equivalences (”x” *might* be equivalent to “y”). That shouldn’t be a barrier.[/quote]

    Exactly. It is why I personally dislike the use of the property owl:sameAs for that task (to use use of that property say that a resource is exactly the same as the other (considering some set theory consideration). Fuzzy relations can be described using other techniques, but I think we shouldn’t think about that stuff for now. Many other considerations on the table first.

    [quote post=”833″]Would it help to have a brief description of the characteristics of, say, the subject or name authorities data in MARC21 format? Or is that already obvious?[/quote]

    Well, it could certainly help. In fact, I think that bibo can already, easily be used to map openlibrary’s schema in RDF. One easy way for me to proceed is to get a snapshot of the openlibrary db (postgre). That way, I would load it on our servers, I would create a rdf view (a rdf view is a view that takes relational data and map it, on-the-fly, into RDF according to some ontologies). The server used would be virtuoso (available in open-source if you like to try it). So, data migration and rdf view creation would be easy to do. This view enable us to query the relation data like if it would be native RDF data. So you can send SPARQL queries, and such. And then, the data can easily be converted in native rdf (xml or n3).

    This could be an easy and somewhat quick way to start using an openlibrary (or whatever the document dataset) mapping into rdf according to bibo.

    Take care,

    Fred

  10. I haven’t read the whole thread here, but I’ve taken a stab at creating a SKOS version of the LC Classification Outline. It was a pain in the neck because it involved extracting information from the freely available PDFs, but it got done. If you don’t want to figure out how to run the code you can check out the SKOS RDF at a temporary location here.

    It still could use some work. Primarily, it needs to use proper URIs. Since I worked on this I’ve become an employee of the Library of Congress, and I’ve run into other people there who are interested in making data sets like LCCO/SKOS available to the public. So I think the reports that LC will be able to participate in the emerging web of linked data aka the semantic-web. At least that is my sincere hope, and plan 🙂

  11. Way cool Ed!

    So the obvious question is, what next? How are we going to get this up, with nice stable URIs? Is this something you are planning to host at the loc?

  12. Ah … nevermind; I guess you’re “working on it”?

  13. Well, yeah — in a large organization like LC it’s sometimes difficult to get things done quickly. But key people are sold on the idea, so it’s probably just a matter of getting things done ™.

    I realize there are already several discussion lists where we talk about these matters, but I wonder…would it make sense to start a semweb for libraries discussion list where we could talk about stuff like this? Or is bibont ok? I wasn’t sure if we could discuss issues that didn’t relate specifically to that ontology there.

  14. Ed,

    We do need a place for this discussion, because we are duplicating work ;-). There’s a tab-delimited version of the LC outline at:

    http://www.archive.org/details/LcClassificationA-z

    and someone on code4lib has created it in a database format (I don’t know if it differs at all from this).

    As you know, the real meat, and overwhelming complication of the “divide like” instructions, is in the full classification, which does not seem to be readily available in machine-readable form. I’m also not sure what we would do with that level of detail, so the next discussion I think has to be about how we see ourselves using this data. As I said, the Open Library is thinking of doing both faceting and browsing, although at this point we’re only thinking of using the upper two levels (mostly single and double letters). Those can’t be extracted algorithmically because there are some inconsistencies between notation and levels, but I create an un-authoritative version here:

    http://www.archive.org/details/LcClassificationLevels1-2

  15. Hi all,

    [quote post=”833″]Way cool Ed!

    So the obvious question is, what next? How are we going to get this up, with nice stable URIs? Is this something you are planning to host at the loc?[/quote]

    Bruce: I talked a lot about that with Edd over IRC. I think that Edd first have to get his hands on the data and then to get the rights to publish it. Once he is able to do this; we should check if we can help him archive his goal.

    In mean time, many things can be done with bibo openlibrary.

    In fact, if I check meta-data related to the openlibrary, I only see a BISAC subject field. It is why I proposed to describe bisac in rdf in this blog post.

    Once Edd would be ready, we could eventually link the bisca structure to the LCSH one. (normal linking of two datasets). And we could check with openlibrary if we could do something with that.

    [quote post=”833″]would it make sense to start a semweb for libraries discussion list where we could talk about stuff like this? Or is bibont ok? I wasn’t sure if we could discuss issues that didn’t relate specifically to that ontology there.[/quote]

    Well Edd, I think that the BIBO mailing list is there for this 🙂 In fact, bibo is not only a place where to develop a new ontology for bibliographic things; but a place to check for best practices and to develop such initiative! 🙂

    So I would encourage you to start the conversation on the list!

    [quote post=”833″]As I said, the Open Library is thinking of doing both faceting and browsing, although at this point we’re only thinking of using the upper two levels (mostly single and double letters).[/quote]

    Karen: great! However the facething and browsing is not only do over subject properties of the document resources. In fact, I really think that the first step is to map the openlibrary in rdf according to bibo. Once it is done, many projects will be able to spawn from this initiative: browsing interfaces; faceted interfaces; linkage with other datasets; extension of subject backbones (like lchs), etc.

    So, would you be willing to start this project?

    Taka care,

    Fred

  16. [quote post=”833″]In fact, I really think that the first step is to map the openlibrary in rdf according to bibo. [/quote]

    Not yet. We’re working on a new record structure that will have many more fields. You can see the almost finished schema at http://demo.openlibrary.org:9021/file/tip/catalog/schema.py. Some of those fields are for internal processing only, however.

    Also, we are talking here about at least three different things. The files that Simon downloaded are subject authority records. Those do not represent books, they represent subject headings. Then there is the LC classification schedule, which is also subject-oriented, but entirely different from the subjects in the subject authority records. And these two are very different from the bibliographic data in OL or in library catalogs. Those might have subject headings from the subject heading file and classifications from the LC classification schedules.

    The upshot is that this is a multi-part activity, with some interlocking parts.

  17. Hi Karen,

    [quote post=”833″]Not yet. We’re working on a new record structure that will have many more fields. You can see the almost finished schema at[/quote]

    Nice! I just took a quick look at it and it seems that almost everything can be described using BIBO. Except for a couple of really specialized fields, everything is in place. In fact, the exercise we should do if getting some records using this new schema, and trying to describe them in rdf using bibo.

    One thing I was surprised to see if the “contributions” field. are the “kind” of contributions limited, or it is an open field where people can write “anything” (any terms). These kind of things will have to be sorted out. But all in all everything is looking really good.

    [quote post=”833″]The upshot is that this is a multi-part activity, with some interlocking parts.[/quote]

    Yeah sure. How we have to see that, I think, is as 3 sources of information; and we will be able to inter-link each of these sources together.

    When do you think we could get some records with this new schema? And when are you planning to implement this new schema in OpenLibrary?

    Thanks,

    Take care,

    Fred

Leave a Reply

Your email address will not be published. Required fields are marked *