Archives

Categories

MRV MARC Record Viewer

I have finally completed a multiple record MARC Record Viewer. This has been rather long in the making but is essentially a quick and practical tool for looking at and assessing MARC records without having to load them into specialist software like MARCEdit or an LMS. It is essentially the same as the viewer built for my Codecademy project except that:

  • It reads multiple records in one file, rather than just one, and provides a count.
  • It has an input box so the records don’t have to be hard-coded into the script.

Some example .mrc records of varying lengths can be found here.

It is written in client-side Javascript, so you can view source and see how it works, copy it, and do what you like with it (although I would love to know if you do so). I quite defiantly haven’t used JQuery for this, which would probably have made the whole thing a bit easier; instead it uses proper old skool DOM scripting. It uses a minimal amount of CSS, in two files: a generic one, and one that roughly mimics how MARC records look in an Aleph editing screen. It should be fairly trivial to change this file to suit other purposes.
Thank you to those who have already have a shufti at earlier versions of this, especially on different browsers, and provided feedback! Please do let me know if you have any comments on this, suggestions for improvements, or if you come across errors. I have some ideas for improvements, mainly for making user input easier, and offering different formatting of results. I hope to start using JQuery for these too, and perhaps a later conversion of the whole thing would be in order.

One record in lots of data formats

For a Dev8d session I did with Owen Stephens in February I presented data for a single book and followed how it had changed as standards changed, trying above to explain to non-cataloguers why catalogue records look and work the way they do. At least one person found it useful. I am now drafting an internal session at work on the future of cataloguing and am planning to take a similar approach to briefly explain how we got to AARC2 and MARC21, and where we are heading. I took the example I used at Dev8d and hand-crafted some RDA examples, obtained a raw .mrc MARC21 file, and used the RDF from Worldcat to come up with a linked data example.

I have tried to avoid notes on the examples themselves. However, do note the following: the examples only generally use the same simple set of data elements, basically the bits you might find on a basic catalogue card: no subjects, few notes, etc.; the book is quite old so there is no ISBN anyway. The original index card is from our digitised card catalogue. The linked data example was compiled by copying the RDFa from the Worldcat page for the book; this was then put into this RDFa viewer (suggested by Manu Sporny) to extract the raw RDF/Turtle; I manually hacked this further to replace full URIs with prefixes as much as possible in an attempt to make it more readable (I suspect this is where some errors may have crept in). The example itself is of course a conversion from an AARC2/MARC21 record. C.M. Berners-Lee is Tim’s dad.

Feel free to use this and to point out mistakes. I would particularly welcome anyone spotting anything amiss in the RDA and linked data, where I am sure I have mangled the punctuation in both.

Harvard Citation

Berners-Lee, C.M. (ed.) 1965, Models For Decision: a Conference under the Auspices of the United Kingdom Automation Council Organised by the British Computer Society and the Operational Research Society, English Universities Press, London.

Pre-AACR2 on Index Card

BERNERS-LEE, C.M., [ed.].

Models for decision; a conference under the auspices of the United Kingdom Automation Council organised by the British Computer Society and the Operational Research Society.

London, 1965.

x, 149p. illus. 22cm.

AACR2 on Index Card

Models for decision : a conference under the auspices of the United Kingdom Automation Council organised by the British Computer Society and the Operational Research Society / edited by C.M. Berners-Lee. -- London : English Universities Press, 1965.

x, 149 p. : ill. ; 23 cm.

Includes bibliographical references.

-       Berners-Lee, C. M.

AACR2 in MARC21 (raw .mrc)

00788nam a2200181 a 4500001002700000005001700027008004100044024001500085245021000100260004900310300003200359504004100391650003300432700002300465710003900488710003000527710004900557_UCL01000000000000000477125_20061112120300.0_850710s1965    enka     b    000 0 eng  _8 _ax280050495_00_aModels for decision :_ba conference under the auspices of the United Kingdom Automation Council organised by the British Computer Society and the Operational Research Society /_cedited by C.M. Berners-Lee._  _aLondon :_bEnglish Universities Press,_c1965._  _ax, 149 p. :_bill. ;_c23 cm._  _aIncludes bibliographical references._ 0_aDecision making_vCongresses._1 _aBerners-Lee, C. M._2 _aUnited Kingdom Automation Council._2 _aBritish Computer Society._2 _aOperational Research Society (Great Britain)__

AACR2 in MARC21

245 00 $a Models for decision :
$b a conference under the auspices of the United Kingdom Automation Council organised by the British Computer Society and the Operational Research Society /
$c edited by C.M. Berners-Lee.
260 __ $a London :
$b English Universities Press,
$c 1965.
300 __ $a x, 149 p. :
$b ill. ;
$c 23 cm.
504 __ $a Includes bibliographical references.
700 1_ $a Berners-Lee, C. M.

RDA

Title proper Models for decision
Other title information a conference under the auspices of the United Kingdom Automation Council organised by the British Computer Society and the Operational Research Society
Statement of responsibility relating to title proper edited by C.M. Berners-Lee
Place of publication London
Publisher’s name The English Universities Press Limited
Date of publication 1965
Copyright date ©1965
Media type unmediated
Carrier type volume
Extent x, 149 pages
Dimensions 23 cm
Content type text
Illustrative content Illustrations
Supplementary content Includes bibliographical references.
Contributor Berners-Lee, C. M.
Relationship designator editor of compilation

RDA in MARC21

245 00 $a Models for decision :
$b a conference under the auspices of the United Kingdom Automation Council organised by the British Computer Society and the Operational Research Society /
$c edited by C.M. Berners-Lee.
264 _1 $a London :
$b The English Universities Press Limited,
$c 1965.
264 _4 $c ©1965
300 __ $a x, 149 pages :
$b illustrations ;
$c 23 cm.
336 __ $a text
$2 rdacontent
337 __ $a unmediated
$2 rdamedia
338 __ $a volume
$2 rdacarrier
504 __ $a Includes bibliographical references.
700 1_ $a Berners-Lee, C. M.,
editor of compilation.

Linked data


@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix schema: <http://schema.org/> .
@prefix worldcat: <http://www.worldcat.org/oclc/> .
@prefix library: <http://purl.org/library/> .
@prefix viaf: <http://viaf.org/viaf/> .
@prefix lc_authorities: <http://id.loc.gov/authorities/names/> .
@prefix mads: <http://www.loc.gov/mads/rdf/v1#> .

worldcat:221944758
  rdf:type schema:Book;
  library:oclcnum "221944758";
  schema:name "Models for decision : a conference under the auspices of the United Kingdom Automation Council organised by the British Computer Society and the Operational Research Society";
  library:placeOfPublication _:1;
  schema:publisher _:4 .
  schema:datePublished "[1965]";
  schema:numberOfPages "149";
  schema:contributor viaf:149407214;
  schema:contributor viaf:130073090;
  schema:contributor viaf:137135158;
  schema:contributor viaf:36887201;
_:1
  rdf:type schema:Place;
  schema:name "London :" .
_:4
  rdf:type schema:Organization;
  schema:name "English Universities Press" .
viaf:149407214
  rdf:type schema:Organization;
  madsrdf:isIdentifiedByAuthority lc_authorities:n79056431;
  schema:name "British Computer Society." .
viaf:130073090
  rdf:type schema:Organization;
  madsrdf:isIdentifiedByAuthority lc_authorities:n85076053;
  schema:name "Operational Research Society." .
viaf:137135158
  rdf:type schema:Organization;
  madsrdf:isIdentifiedByAuthority lc_authorities:n79063901;
  schema:name "Institution of Electrical Engineers." .
viaf:36887201
  rdf:type schema:Person;
  schema:name "Berners-Lee, C. M." .

How big is my book: Mashcat session

At Mashcat on 5 July in Cambridge I gave an afternoon session on getting computer readable information from the textual information held in MARC21 300 fields using Javascript and regular expressions. I intended this to be useful for cataloguers who might have done some of Codecademy’s Code Year programme as well as an exploration of how data is entered into catalogue records, its problems, and potential solutions.

AACR2/MARC (and RDA) records store much quantitative information as text, usually as a number followed by units, e.g. “31 cm.” or “xi, 300 p”. This is not easy for computers to deal with. For instance, a computer programme cannot compare two sizes- e.g. “23 cm.” and “25 cm.”- without first extracting a number out of the string (23 and 25) as well as determining the units used (cm). In some cases, units might vary: in AARC2 books below 10 cm. are measured in mm., and non-book materials are often measured in inches (abbreviated to in.). Potential uses for better quantitative data in the 300$c include planning shelving for reclassification and more easily finding books by size or range.

Before the session, I sketched out a possible solution using Javascript and regular expressions to make this conversion for dimensions in the 300$c. I have a put up a version of A script to find the size of an item in mm. based on the 300$c, with the addition of an extra row which you can fill in to test your own examples without having to edit the script.

If you do want to look at how it works or try editing it yourself you can view source, copy all the HTML, then paste it into a text editor. Save it, then open the file using a browser to test it. Refresh the browser when you change the file.

The heart of the script looks like this:

var dollar_c = [
  "9 mm",
  "4 in.",
  "4 3/4 in.",
  "30 cm.",
  "1/2 in.",
  "20 x 40 cm."
];

// Convert text to mm
function text_to_mm (text) {
  // Convert fractions to decimals
  text = text.replace(/(\d*) (\d)\/(\d)/, function(str, p1,p2,p3) {return parseFloat(p1)+p2/p3});
  text = text.replace(/(\d)\/(\d)/, function(str, p1,p2) {return parseFloat(p1/p2)});
  // Extract the size of the book
  size = text.replace (/([\d\.]*).*/, "$1");
  // Extract the units
  units = text.replace(/.*([a-z]{2}).*/g, "$1");
  // Convert from various units to mm
  if (units === "mm") {
    var mm = size;
  }
  if (units === "cm") {
    var mm = size * 10;
  }
  if (units === "in") {
    var mm = size * 25.4;
  }
  mm=Math.floor(mm);
  return mm;
}

It starts with a declaration of an array of examples to be tested: you can alter this with your own if you prefer. text_to_mm is the function that does all the work. It takes in the text from a 300$c, converts fractions (e.g. 4 3/4) to decimals (4.75), finds a number, finds a unit, then performs calculations on the size depending on what the unit is to produce a figure to a standard figure in mm. At Mashcat, Owen Stephens managed to plug an adaptation of this script into Blacklight to create an index of book sizes. Using this he could do things like find the most common sizes or the largest book in a collection.

The main focus of my session, however, was on a similar script to figure out how many actual pages there are in a book, given the contents of a 300$a, e.g. “300 p.”, “ix, 350 p.”, “100 p., [45] leaves of plates”  (a page being one side of a sheet of paper; a leaf being a sheet of paper only printed on one side, so therefore counting as two pages). I have also published a version of A script to find the absolute no. of pages based on the 300$a with the similar addition of a row for easy user testing. Potential uses for recording page numbers rather than pagination include planning shelving space, easier to understand displays for users, and finding books of specified lengths.

The script starts with a similar array of examples to be tested:

// An array of test examples
var dollar_a = [
  "9 p.",
  "9p",
  "30 leaves",
  "30 p., 20 leaves",
  "xiv, 20 p.",
  "20, 30 p.",
  "20, 30, 40 p.",
  "xv, 20, 30, 40 p., 5, 5 leaves of plates",
  "clviii, ii, 4, vi p."
];

The main function is called text_to_pages. The first thing it does is convert any Roman numerals to Arabic ones. The heavy lifting for this is a function by Stephen Levithan which does the actual number conversion. However, we still need to identify and extract the Roman numerals from the pagination in order to convert them. This line does the extraction and makes a list of the Roman numerals:

var roman_texts=text.match(/[ivxlc]*[, ]/g);

The session I gave concentrated on regular expressions (a bit like the wildcards you use on library databases but turned up to eleven) which in all cases here are contained within slashes, and I made a simple introductory guide to regular expressions (.docx). There are many guides to regular expressions on the web too, and useful testers to play with such as this one. The regular expression in the line above can be broken down as follows:

  • [ivxlc] uses square brackets to look for any one of the characters listed within them.
  • The following * means to look for any number of these in a row
  • [, ] any of a comma or a space, again using square brackets. Obviously these characters are not used in Roman numerals but they are a convenient method of isolating these characters as numbers rather, say, the “l” in leaves which would also match otherwise.

The next few lines work through the list, replace any instances of [, ] with “” (i.e. nothing) to leave the bear Roman numerals, convert all the numbers in the list using Stephen Levithan’s functions, then do the replacements on the pagination given in text:

if (roman_texts) {
    for (i=0; i<roman_texts.length; i++) {
      // Remove space
      roman_texts[i]=roman_texts[i].replace(/[, ]/,"");
      var arabic_text = deromanize(roman_texts[i])+" ";
      text = text.replace(roman_texts[i],arabic_text+" ");
    }
  }
}

Like the size script above, the rest of conversion needs to do two things: find the numbers and find the units. To do this we need to find the sequences involved. While this is easy with something like “24 p.” (number is 24, unit is p) or even “xv leaves” (number is 15, unit is leaves), it becomes troublesome when you get something like “23, 100 p.”: the first number is 23 but there is no unit associated with it, only a comma to signify that it is a sequence at all. The following lines try and get round this problem but looking for sequences where the comma appears to be the unit and then looking ahead to find the next unit. In the “23, 100 p.” example the script would keep looking forward past the 100 until it gets to the “p”.

// Convert 20, 30 p. to 20 p. 30 p
  while (text.match(/\d*,/)) {
    text = text.replace(/(\d*),(.*?(p|leaves))/, "$1 $3 $2");
  }

The first regular expression in the while line looks for:

  • \d* any number of digits. \d is any digit and * looks for any number of them, followed by
  • , a comma

So as long as the script finds any sequences of numbers followed by a comma, it will carry on making the replacement underneath it. The replacement line itself looks for

  • \d* any number of digits again, followed by
  • , a comma
  • .*? which is . any character * any number of times. The ? makes sure that the smallest matching group of characters is matched; otherwise the expression will think that the units corresponding to the number 15 in the pagination “15, 25 p., 50 leaves” is “leaves” rather than “p”.
  • p|leaves either p or leaves. The pipe means either match on the left of it or the right of it. Because this is in a set of round brackets, the pipe only applies there, rather than the whole expression.

Brackets also capture subsets which is really useful here: the first set of () brackets captures the number of pages and stores it as $1, the second set captures everything between the comma and the end of the units as $2, the third  set captures the units only, either “p” or “leaves”, and stores it as $3. So in the example “15, 25 p., 50 leaves”, $1 is “15″, $2 is ” 25 p”, and $3 is “p”. The replacement puts these back in a different order, i.e. “$1 $3 $2″ which would be “15 p 25 p”.

Now that all the sequences will be in number-unit pairs, we can get on with making a list of them to work through:

 // Find sequences
  var sequences = text.match(/\d+.*?(,|p|leaves)/g);

This looks for:

  • \d+ at least one digit
  • .*? any number of any characters, although not being greedy
  • (,|p|leaves) any of a comma, “p”, or “leaves”. Obviously, if the while loop above has worked, then the comma isn’t needed, but I’ll confess this is a hangover from a previous version of the script…

The next section goes through each of the sequences found and extracts the number and then the unit:

// go through sequences
  var pages = 0;
  for (var i=0; i<sequences.length; i++) {
    // Extract no
    var number = parseFloat(sequences[i].match(/\d+/g)[0]);
    var units = sequences[i].match(/(p|leaves)/g)[0];
    if (units == "p") {
      pages+=number;
    }
    if (units == "leaves") {
      pages+=number*2;
    }
  }

The regular expression to find the number is straightforward:

  • \d+ at least one digit

The parseFloat converts the digits as a string to a Javascript number. The regular expression to find the unit is also simple:

  • (p|leaves) either “p” or “leaves”

If the units are “p”, then the variable pages is incremented by the value of the number found; if “leaves”, then pages is incremented by twice that number.

The programme should cope with the loss of abbreviations in RDA as “p.” is expanded to “pages” but the regular expression to find the units will still find the “p” at the beginning much as it isn’t put off by the full stop after the “p”. It could be expanded to look for other variations and I will do so if I can:

  • “S.” for German “Seite” or “Seiten”.
  • “leaf”, as in “1 leaf of plates”
  • sequences which start in the middle of larger ones, like journal issues with “xii, p. 546-738″. This one will be the most complicated as it goes against the basic flow of the existing code.

I also haven’t properly tested folded sheets or multiple volume works. Other improvements are needed in failing more gracefully when it doesn’t find what it’s expecting: the programme should really test the existence of the arrays it makes before looping through them, but this would make it harder to understand at a glance or demonstrate on screen so I didn’t do it.

The scripts are written in Javascript for several reasons: it is the language that Codecademy focusses on for beginners; it requires no specialist environment, server, or even a web connection: you just need a basic text editor and a browser; it is easy to adapt for a web page if you do manage to build something; and, it is the language I am most confident working in. It would be fairly easy to port to other languages though, and Owen changed the size script with some other modifications to work in Beanscript/Java in Blacklight.

I can’t speak for the attendees, but I learnt a lot, and much was made more clear, from playing around with these scripts and talking to people at Mashcat:

  • Quite how depended AARC2 and RDA (and consequently MARC21) are on textual information, even for what appears to be quantitative data.
  • That even for what appears to be standard number-unit data, there are too many complications that make it non trivial to extract data:
    • fractions (not even decimals) in 300$c
    • differing units: book sizes in mm. or cm. depending on how big the book is; disc sizes in in.; extent in pages or leaves (or volumes or atlases or sheets…)
    • sequences with implied units, such as those with commas.
  • there is frequently a lack of clarity and ambiguity of what is actually being measured:
    • for books the dimension recorded is normally height (although this is not explicit from a user’s point of view,  sometimes it’s height and width, and for a folded sheet it could be all sorts of things); for a disc it’s the diameter.
    • For the 300$a what’s being recorded is pagination, something entirely different from number of pages. Although important for things like rare books, how important is complete pagination for most users compared to a robust idea of how large a book is? Amazon provide a number of pages. More importantly, how understandable is pagination? During my demonstration, some of my audience of librarians were left cold by the meanings of square brackets for example (and square brackets can mean any number of things depending on context). Perhaps there is room for both.

I suppose this latter point is a potential conclusion. Ed Chamberlain asked me what I thought should be done. I don’t know to be honest. I think, like much of the catalogue record, lots more research is needed to see what users (both human and computer) actually want or need. It should be said that entering pagination is in many ways easier for the cataloguer. However, I do think we need:

  • quantitative data entered as numbers with clear and standard units. For instance, record all book heights as mm. and convert to cm. for display if needed.
  • more data elements to properly make clear what is being recorded. Instead of a generic dimension, we need height, width, depth?, diameter, etc. Instead of pagination, we could have separate elements for pagination, number of pages, and number of volumes (50 volumes each of 10 pages is not the same as 4 volumes of 1000 pages each). Obviously all of them wouldn’t be needed for all items.

The research to enable us to choose what to record, why we’re recording it, and for whose benefit would be the best starting point for this as well as many other questions in cataloguing and metadata.

Cataloguing coding

I am not a trained programmer, coding is not part of my job description, and I have little direct access to cataloguing and metadata databases at work outside of normal catalogue editing and talking to the systems team, but I thought it might be worth making the point of how useful programming can be in all sorts of little ways. Of course, the most useful way is in gaining an awareness of how computers work, appreciating why some things might be more tricky than others for the systems team to implement, seeing why MARC21 is a bastard to do anything with even if editing it in a cataloguing module is not really that bad, and how the new world of FRDABRDF is going to be glued together. However, some more practical examples that I managed to cobble together include:

  • Customizing Classification Web with Greasemonkey. This is a couple of short scripts using Javascript, which is what the default Codeacademy lessons use. Javascript is designed for browers and is a good one to start with as you can do something powerful very quickly with a short script or even a couple of lines (think of all the 90s image rollovers). It’s also easy to have a go if you don’t have your own server, or even if you’re confined to your own PC.
  • Aleph-formatted country and language codes. I wrote a small PHP script to read the XML files for the MARC21 language and country codes and convert them into an up to date list of preferred codes in a format that Aleph can read, basically a text file which needs line breaks and spaces in the right places. It is easy to tweak or run again in the event of any minor changes. I don’t have this publicly available anywhere though. PHP is not the most elegant language but is relatively easy to dip into if you ever want to go beyond Javascript and do more fancy things, although it can be harder to get access to a server running PHP.
  • MARC21 .mrc file viewer. I occasionally need to quickly look at raw .mrc files to assess their quality and to figure out what batch changes we want to make before importing them into our catalogue. This is an attempt to create something that I could copy and paste snippets of .mrc files into for a quick look. It is written in PHP and is still under construction. There are other better tools for doing much the same thing to be honest, but coding this myself has had the advantages of forcing me to see how a MARC21 file is put together and realising how fiddly it can be. Try this with an .mrc which has some large 520 or 505 fields in it (there are some zipped ones here, to pick at random) and watch the indicators mysteriously degrade thereafter. I will get to the bottom of this…

The following examples are less useful for my own practical purposes but have been invaluable for learning about metadata and cataloguing, in particular, RDF/linked data. I was very interested in LD when I first heard about it. Being able to actually try something out with it (even if the results are not mind-blowing) rather than just read about it, has been very useful. Both are written in PHP and further details are available from the links:

Nothing to do with cataloguing, but what I am most proud of is this, written in Javascript: Cowthello. Let me know if you beat it.

Update: Shana McDanold also wrote an excellent post on why a cataloguer should learn to code with lots of practical examples.

Lodopac example searches

Yay, my entry for the Discovery & DevCSI Developers Competition- Lodopac- was awarded a commendation for its use of the Cambridge University Library (CUL) dataset. During the judging I was asked for searches which were known to work well- the timeout issues I discussed under Limitations being not insignificant, especially with author or title searches. I submitted a version of the following brief general notes which I hope are helpful to anyone else who wants to play:

The British National Bibliography (BNB) server is generally more responsive than the Cambridge University Library one; title seems to work better than author. The following are hopefully useful examples useful:

I would really like to try and think of ways of improving free text regular expression search times for things like author and title in Sparql* although I doubt there is one that doesn’t rely on the configuration, processing power, or indexing of the server being searched.

* thinking aloud, some ideas might include: downloading a larger imprecise set for further local searching (e.g. for an author/title search downloading the title matches and searching the authors locally: although this would also be slow, it would get round the timeout at least); forcing a look-up in a controlled vocab first in order to get an exact string match (esp for authors, although even if this is possible, this forces a user to do more work, which isn’t the point);  local indexing of the triple store (this is probably the best way but I’m not sure how to go about it, whether I really have the server capabilities to do it, and can be committed to the updating required).

Sparql recipes for bibliographic data

One of the difficulties in searching RDF data is knowing what the data looks like. For instance, finding a book by its title means knowing something about what how a dataset has recorded the relationship between a book and its title. There is no real standard for publishing MARC/AACR2-style bibliographic data as RDF: it seems libraries publishing RDF are approaching this largely individually, although they are using many of the same vocabularies, dc, bibo, etc. This was one reason why I wanted to create Lodopac: to present some kind of interface so that searchers didn’t need to know these different models but could start to explore them. Below are the Sparql recipes for the different search criteria I used for the BNB and the Cambridge University Library datasets, so they can be compared, re-used, or corrected. All examples use prefixes, which are defined anew in each example. The examples are of course fragments and don’t have all the necessary SELECT and WHERE clauses.

By the way, for an excellent Sparql tutorial with ample opportunity to play as you go along, do have a look at the Cambridge University Library’s SPARQL tutorial. It also gives clues to the way their data is structured. Of use for the BNB is their data model (PDF), which is not nearly as scary as it looks at first, and incredibly helpful.

Author keyword search

This would be relatively straightforward-the unavoidable regular expression being the main complication- but for the fact that the traditional author/editor/etc of bibliographic records can be found in dc:creator as well as dc:contributor which necessitates a UNION. The BNB used foaf:name:

PREFIX dc: <http://purl.org/dc/terms/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>

{?book dc:creator ?author} UNION {?book dc:contributor ?author} .
?author foaf:name ?name .
FILTER regex(?name, “author”, “i”) .

Cambridge uses much the same recipe except that it uses rdfs:label instead of foaf:name:

PREFIX dc: <http://purl.org/dc/terms/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

{?book dc:creator ?author} UNION {?book dc:contributor ?author} .
?author rdfs:label ?name .
FILTER regex(?name, “author”, “i”) .

Title keyword searches

This is more straightforward and is in fact the same for both the BNB and Cambridge University Library:

PREFIX dc: <http://purl.org/dc/terms/>

?book dc:title ?title .
FILTER regex(?title, “title”, “i”) .

Date of publication (year)

I imagined this one being simple and for Cambridge University Library it is. However the BNB took some unravelling as they have modelled publication as an event related to a book. The various elements of publication are then related to the event. So, for the BNB we have this:

PREFIX bibliographic: <http://data.bl.uk/schema/bibliographic#>
PREFIX event: <http://purl.org/NET/c4dm/event.owl#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

?book bibliographic:publication ?pub .
?pub event:time ?year .
?year rdfs:label “date” .

By contrast, Cambridge University Library has it in one line:

PREFIX dc: <http://purl.org/dc/terms/>

?book dc:created “date” .

ISBN

As an identifier, ISBN is relatively straightforward in both models, although care must be taken with the BNB as 10 and 13 digit ISBNs are treated as separate properties and the following assumes that the search will cover both:

PREFIX bibo: <http://purl.org/ontology/bibo/>

{?book bibo:isbn10 “isbn”} UNION {?book bibo:isbn13 “isbn”} .

For Cambridge University Library, also using the bibo ontology, this is:

PREFIX bibo: <http://purl.org/ontology/bibo/>

?book bibo:isbn “isbn” .

Conclusion

I didn’t set up to provide ground-breaking conclusions. However, it is remarkable how different data models can be formulated for modelling the same type of data by similar organisations. The real question is whether this is a good, a bad thing, or doesn’t really mattter. Will it need to be standardised? My understanding of how this works is probably not. I think the days of monolithic library standards are probably now gone. I wonder, for instance, if there ever will be a single MARC22 (or whatever you like to call it) and doubt RDA will ever completely replace AACR2 in the way we imagine. What will emerge I suspect will be a soup of various standards and data models, some of which will be more prevalent. One thing I picked up from various linked data talks is that information has frequently been published then re-used in ways that the issuers never imagined; if that is the case, the precise modelling and format is probably not as important as the fact that it is of good quality and intelligently put together. The BNB and Cambridge University Library models are clearly quite different but quite capable of being mapped and used despite this.

If there are any other bibliographic data Sparql endpoints, I would like to include them in a future version of the Lodopac search. Do let me know if you come across them.

More mundanely, do say if there are errors in my Sparql recipes or if there are ways they could be done more efficiently.

Lodopac : simple Linked Open Data OPAC

Lodopac is my entry to the UK Discovery Developer Competition. Aside from obvious mocking of the name, comments on Lodopac are very welcome. If anyone  installs it locally, I’d also be very interested to know.

About Lodopac

Lodopac is a simple linked open data opac using Sparql to search remote bibliographic RDF data. By default it is set up to search the BNB and Cambridge University Library datasets, but is designed to allow easy setup of additional datasets with Sparql endpoints (see Installation, source code, and configuration below). It was written in response to the UK Discovery Developer Competition.

The purpose of Lodopac is to provide a simple standard OPAC-style interface to perform searches of various bibliographic RDF datasets without having to know how to formulate a Sparql query and without having to know the structure of the database. I hope it is especially useful for people wanting to get a grip on how bibliographic RDF is put together, what it looks like, and what a Sparql query looks like. For example, an author search is possible without knowing about dc:creator and dc:contributor, or how these need to be linked together in a Sparql search. Similarly, a searcher wouldn’t need to worry about how to construct date searches in different datasets. For the BNB and CUL, these are very different (three lines in the BNB, one for CUL), but in Lodopac, there is only search box to search both. Lodopac displays the Sparql query it constructed to perform the search, as well as the combined RDF for all results found in XML, JSON, N3, and TTL.

How to search Lodopac

Select one or more of the available datasets using the checkboxes.

Author and Title searches are free text phrase searches. In other words, a string you search for will match with any exact match, including spacing and punctuation, and in the middle of words. E.g. searching for “shake” will match “Shakespeare”, “milkshakes”, and “More hits that you can shake a stick at”. Searching is case insensitive. The following punctuation is removed from searches: \”‘<>$^%.

You are strongly advised to keep author and title searches simple: e.g. one word of a title or a surname only.

ISBN searches 10 or 13 digit ISBNs. Any dashes or other non-digit characters are stripped from the search.

Date search will accept a year.

N.B. Keep searches as simple as possible, especially with author and title searches, to avoid them timing out. ISBN and date searches are generally quicker.

Limitations

A bad workman blames his tools and I’m no exception. The greatest limitation is the time taken by Sparql endpoints to perform a Sparql query, especially one that involves a regular expression, such as the Author and Title searches. What is needed is some more robust indexing or some cheat like Virtuoso’s unorthodox bif:contains, which the old version of the linked data BNB used. I touched on this in a blog post about the In Our Time Booklist script I wrote (see section 6).

The load and current capacity on the Sparql endpoints at the time the query is made is another important factor. A search which times out one minute can work fine the next.

The search options are obviously limited but do I hope represent the most common methods of searching normal library catalogues aside from, of course, a general keywords search. The manipulation of results is also rather sparse but allows click through to full data associated with a book, the structure and contents of which can be more fully explored. The aggregation of RDF data in various formats is I hope useful illustratively as well as having potential for further manipulation.

Installation, source code, and configuration

The source code for Lodopac is available as a zip file, which contains all the necessary PHP, Javascript, and CSS files. In addition, you will need to install ARC2, which makes the Sparql queries and manipulates the resultant RDF. Edit the first line of lodopac.php so that it points at your local installation of ARC2.

The programme is basically one long script- there is only one page- but is split for convenience of editing. The key file is lodopac.php which includes the other files as it goes along. The main core of the script which builds the queries and does the searching is all in lodopac.php.

I have attempted to make the script as easily configurable as possible so that additional Sparql endpoints can be added. There are probably more components hard-coded into the script that I have overlooked, but all the setup for the endpoints is in the file setup_endpoints.php. The first part of this file is a list of necessary prefixes that are needed for any possible queries from any of the endpoints and, although not ideal, all these prefixes are sent with any Sparql query. Following that and the declaration of an array of the endpoints, each endpoint has a dedicated block with the information added to a hash. To add another endpoint, duplicate a block and configure the search recipes as appropriate. The keys marked “brief_” are used to fetch information for the brief results display. I have conspicuously chickened out of providing an author and the attendant main entry and multiple author headaches involved.

Customizing Classification Web with Greasemonkey

Classification Web is ace, but there are a couple of things about the interface that annoy me and, in one colleague’s case, seriously put him off using it, in particular:

  • The opening of a new tab/window when you click on the MARC view for a subject or name.
  • The confusing menu. We don’t use LCC or DDC, and the browse options don’t really add much, so we only really need two options: Search LC Subject Headings and Search LC Name Headings.

I managed to work out a simple way of modifying how Classification Web works on Firefox using the Greasemonkey add-on and a couple of simple scripts, all of which is quick and easy to install:

  1. Install Greasemonkey: https://addons.mozilla.org/en-US/firefox/addon/greasemonkey/
  2. Make sure the monkey in the bottom-right corner is happy and colourful. Click on it if not.
  3. If you want to prevent the MARC view opening a new window, install the classweb_no_new_ window script by going to http://www.aurochs.org/zlib/js/userjs/classweb_no_new_window.user.js then
  4. Click on the Install button
  5. If you want to reduce the main menu, install classweb_prune_menu script by going to http://www.aurochs.org/zlib/js/userjs/classweb_prune_menu.user.js then
  6. Click on Install button
  7. Reload/refresh Classweb if it’s still open and it should work.

If you want to turn Greasemonkey off altogether, click on the monkey so he’s sad and grey. If you want to stop individual scripts, right click on the monkey, click on Manage User Scripts, select a script from the list, and un-tick the Enabled box in the lower left corner.

These instructions were tested on Firefox 3.5.3 although I imagine they would be fine on any recent version of Firefox. I would be interested to hear anything confirming or undermining that assertion.

If you’re happy to play around, these scripts can be further altered. In particular, you can choose which menu items appear in the pruned menu script:

  1. Right click on the monkey
  2. Click on Manage User Scripts
  3. Select classweb_prune_menu from the list
  4. Click on Edit (you will probably have to select a text editor at this point)
  5. Edit the list of pages under the line var menu_items_to_keep = Array (. Enter each page you want to appear on the menu on a separate line in quotes, with a comma at the end of each line except the last line. The menu item must appear exactly as it does on the Classification Web menu, including capitals. E.g., the default set up looks like this:
    var menu_items_to_keep = Array ( // end each line with a comma except the last line
      "Search LC Subject Headings",
      "Search LC Name Headings"
    );
  6. Save the file, and reload Classification Web.

If anyone else finds this useful or can think of more customizations let me know.