Sparql recipes for bibliographic data

One of the difficulties in searching RDF data is knowing what the data looks like. For instance, finding a book by its title means knowing something about what how a dataset has recorded the relationship between a book and its title. There is no real standard for publishing MARC/AACR2-style bibliographic data as RDF: it seems libraries publishing RDF are approaching this largely individually, although they are using many of the same vocabularies, dc, bibo, etc. This was one reason why I wanted to create Lodopac: to present some kind of interface so that searchers didn’t need to know these different models but could start to explore them. Below are the Sparql recipes for the different search criteria I used for the BNB and the Cambridge University Library datasets, so they can be compared, re-used, or corrected. All examples use prefixes, which are defined anew in each example. The examples are of course fragments and don’t have all the necessary SELECT and WHERE clauses.

By the way, for an excellent Sparql tutorial with ample opportunity to play as you go along, do have a look at the Cambridge University Library’s SPARQL tutorial. It also gives clues to the way their data is structured. Of use for the BNB is their data model (PDF), which is not nearly as scary as it looks at first, and incredibly helpful.

Author keyword search

This would be relatively straightforward-the unavoidable regular expression being the main complication- but for the fact that the traditional author/editor/etc of bibliographic records can be found in dc:creator as well as dc:contributor which necessitates a UNION. The BNB used foaf:name:

PREFIX dc: <http://purl.org/dc/terms/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>

SELECT ?book

WHERE {

{?book dc:creator ?author} UNION {?book dc:contributor ?author} .
?author foaf:name ?name .
FILTER regex(?name, “author”, “i”) .

}

Cambridge uses much the same recipe except that it uses rdfs:label instead of foaf:name:

PREFIX dc: <http://purl.org/dc/terms/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT ?book

WHERE {

{?book dc:creator ?author} UNION {?book dc:contributor ?author} .
?author rdfs:label ?name .
FILTER regex(?name, “author”, “i”) .

}

Title keyword searches

This is more straightforward and is in fact the same for both the BNB and Cambridge University Library:

PREFIX dc: <http://purl.org/dc/terms/>

SELECT ?book

WHERE {

?book dc:title ?title .
FILTER regex(?title, “title”, “i”) .

}

Date of publication (year)

I imagined this one being simple and for Cambridge University Library it is. However the BNB took some unravelling as they have modelled publication as an event related to a book. The various elements of publication are then related to the event. So, for the BNB we have this:

PREFIX bibliographic: <http://data.bl.uk/schema/bibliographic#>
PREFIX event: <http://purl.org/NET/c4dm/event.owl#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT ?book

WHERE  {

?book bibliographic:publication ?pub .
?pub event:time ?year .
?year rdfs:label “date” .

}

By contrast, Cambridge University Library has it in one line:

PREFIX dc: <http://purl.org/dc/terms/>

SELECT ?book

WHERE {

?book dc:created “date” .

}

ISBN

As an identifier, ISBN is relatively straightforward in both models, although care must be taken with the BNB as 10 and 13 digit ISBNs are treated as separate properties and the following assumes that the search will cover both:

PREFIX bibo: <http://purl.org/ontology/bibo/>

SELECT ?book

WHERE {

{?book bibo:isbn10 “isbn”} UNION {?book bibo:isbn13 “isbn”} .

}

For Cambridge University Library, also using the bibo ontology, this is:

PREFIX bibo: <http://purl.org/ontology/bibo/>

SELECT ?book

WHERE {

?book bibo:isbn “isbn” .

}

Conclusion

I didn’t set up to provide ground-breaking conclusions. However, it is remarkable how different data models can be formulated for modelling the same type of data by similar organisations. The real question is whether this is a good, a bad thing, or doesn’t really mattter. Will it need to be standardised? My understanding of how this works is probably not. I think the days of monolithic library standards are probably now gone. I wonder, for instance, if there ever will be a single MARC22 (or whatever you like to call it) and doubt RDA will ever completely replace AACR2 in the way we imagine. What will emerge I suspect will be a soup of various standards and data models, some of which will be more prevalent. One thing I picked up from various linked data talks is that information has frequently been published then re-used in ways that the issuers never imagined; if that is the case, the precise modelling and format is probably not as important as the fact that it is of good quality and intelligently put together. The BNB and Cambridge University Library models are clearly quite different but quite capable of being mapped and used despite this.

If there are any other bibliographic data Sparql endpoints, I would like to include them in a future version of the Lodopac search. Do let me know if you come across them.

More mundanely, do say if there are errors in my Sparql recipes or if there are ways they could be done more efficiently.