Archives

Categories

Cataloguing coding

I am not a trained programmer, coding is not part of my job description, and I have little direct access to cataloguing and metadata databases at work outside of normal catalogue editing and talking to the systems team, but I thought it might be worth making the point of how useful programming can be in all sorts of little ways. Of course, the most useful way is in gaining an awareness of how computers work, appreciating why some things might be more tricky than others for the systems team to implement, seeing why MARC21 is a bastard to do anything with even if editing it in a cataloguing module is not really that bad, and how the new world of FRDABRDF is going to be glued together. However, some more practical examples that I managed to cobble together include:

  • Customizing Classification Web with Greasemonkey. This is a couple of short scripts using Javascript, which is what the default Codeacademy lessons use. Javascript is designed for browers and is a good one to start with as you can do something powerful very quickly with a short script or even a couple of lines (think of all the 90s image rollovers). It’s also easy to have a go if you don’t have your own server, or even if you’re confined to your own PC.
  • Aleph-formatted country and language codes. I wrote a small PHP script to read the XML files for the MARC21 language and country codes and convert them into an up to date list of preferred codes in a format that Aleph can read, basically a text file which needs line breaks and spaces in the right places. It is easy to tweak or run again in the event of any minor changes. I don’t have this publicly available anywhere though. PHP is not the most elegant language but is relatively easy to dip into if you ever want to go beyond Javascript and do more fancy things, although it can be harder to get access to a server running PHP.
  • MARC21 .mrc file viewer. I occasionally need to quickly look at raw .mrc files to assess their quality and to figure out what batch changes we want to make before importing them into our catalogue. This is an attempt to create something that I could copy and paste snippets of .mrc files into for a quick look. It is written in PHP and is still under construction. There are other better tools for doing much the same thing to be honest, but coding this myself has had the advantages of forcing me to see how a MARC21 file is put together and realising how fiddly it can be. Try this with an .mrc which has some large 520 or 505 fields in it (there are some zipped ones here, to pick at random) and watch the indicators mysteriously degrade thereafter. I will get to the bottom of this…

The following examples are less useful for my own practical purposes but have been invaluable for learning about metadata and cataloguing, in particular, RDF/linked data. I was very interested in LD when I first heard about it. Being able to actually try something out with it (even if the results are not mind-blowing) rather than just read about it, has been very useful. Both are written in PHP and further details are available from the links:

Nothing to do with cataloguing, but what I am most proud of is this, written in Javascript: Cowthello. Let me know if you beat it.

Update: Shana McDanold also wrote an excellent post on why a cataloguer should learn to code with lots of practical examples.

In Our Time booklist

I have written a script which takes an unstructured reading list on the BBC’s In Our Time website, searches the British National Bibliography (BNB) using bibliographica for the books on the list, and returns structured metadata for the records it found.

This script was written in response to an idea raised by psychemedia for the Open Bibliographic Data Challenge: the BBC “In Our Time” Reading List:

The BBC “In Our Time” radio programme publishes suggested recommending reading in the programme data in an unstructured and citation style way: author, title (publisher, year), with what looks to be conventional character string separators between references (at least on the pages I looked at).

The idea is to extract and link suggested readings for the In Our time programmes to open, structured bibliographic data. This would make the In Our Time archive even more useful as an informal (open-ish) educational resource, especially as academic libraries start to release data relating to books used on courses. (So for example, this approach might help provide a link from a course to a relevant In Our Time broadcast via a common book.)

I was drawn to this idea as I like the idea of turning unstructured data into structured data: I have for example had some previous fun converting HTML pages into RSS feeds (e.g. CILIP Lisjobnet, Big Brother). I think something similar for any reading list (e.g. a Word document produced by a lecturer) would be an interesting idea.

The programme is written in PHP and is designed to be fired from a Javascript bookmarklet from a page on the In Our Time site, or by appending the In Our Time URL to the end of the URL for this page: http://www.aurochs.org/inourtime_booklist/inourtime_booklist_v1.php?. For example, to use it on the page for The Mexican Revolution (which I used a lot in testing), add the URL http://www.bbc.co.uk/programmes/b00xhz8d to produce http://www.aurochs.org/inourtime_booklist/inourtime_booklist_v1.php?http://www.bbc.co.uk/programmes/b00xhz8d.

The script follows the following steps:

  1. Set up ARC2 to enable SPARQL searching and RDF processing
  2. Extract Further Reading section
  3. Separate out Raw Data for each book
  4. Determine pattern used in citation then extract Basic Data, e.g. author, title, article, publication, using regular expressions
  5. Further refine elements to make searching easier, i.e. one surname for author, only title proper for titles
  6. Construct a SPARQL Query using author surname and title regular expressions pre-filtered for speed by a significant word using bif:contains
  7. Filter Hits by date of publication
  8. Obtain and display metadata from BNB
More details of these steps are below:

1. Set up ARC2 to enable SPARQL searching and RDF processing

ARC2 is a simple-to-use system for using RDF and SPARQL in PHP. I had previously played with it here when experimenting with creating my own RDF. The Sandy site uses SPARQL to populate the See Also sections.

2. Extract Further Reading section

A simple regular expression identifies the div in the HTML code that contains the reading list, which enables the next stage of the script to look for individual books.

3. Separate out Raw Data for each book

Another regular expression pulls out the paragraphs containing books and puts them in an array. You can see this by viewing the Raw Data.
4. Determine pattern used in citation then extract Basic Data, e.g. author, title, article, publication using regular expressions

As the In Our Time site does not use a single standard form of citation, the script has to try and determine which of several possible patterns a citation is using with regular expressions and extract the correct bits of data. This only works as well as it is possible to identify all the patterns, which effectively means looking at as many In Our Time pages as possible. This is one area that would certainly reward more work. It also points out how difficult it would be to extrapolate this into a script  that could read any citations. The In Our Time booklist currently uses five citations each identified with a number (1, 2, 3, 4, 15, 5). If you look at the Book Data for a particular book you will see the citation style number given. The regular expressions capture author, title, and publication.
The author information in citations on In Our Time is  unpredictable. Sometimes surnames are first, sometimes they are last.  The citation patterns take care of this if possible and try to extract one significant name.
5. Further refine elements to make searching easier, i.e. one surname for author, only title proper for titles.

The script removes things like “(ed.)” from the author, which would obviously throw off a catalogue search. Subtitles- everything after and including semi-colons- are also removed from titles to lessen any chance of variation and lost matches.

6. Construct a SPARQL Query using author surname and title regular expressions pre-filtered for speed by a significant word using bif:contains

Constructing the SPARQL query was the most tricky part. Ignoring the various standard prefixes pilfered from the standard example, the most important part is the title search. There are three unsatisfactory options:
  • Match the extracted title directly to a dc:title. This doesn’t work as the cited title is unlikely to be exactly the same in all matters of words included, spacing, punctuation, etc.
  • Use bif:contains for keyword searching as used in the BNB SPARQL example. This is certainly quick, but has a number of drawbacks: it can only be used once for a single keyword (any one of the two significant words in The history of Mexico, for example, will produce a huge number of hits). It is also not standard SPARQL. I was happy to overlook this if it worked, but ARC2 didn’t like it at all until I worked out it has to be used in angle brackets e.g. ?title <bif:contains> “Mexico”.
  • Use regular expressions (e.g. FILTER regex (?title, “The History of Mexico”, “i”)) for keyword searching. This is extremely powerful: you can easily construct searches but it is so slow as to routinely time out, so rendering it effectively useless.
The In Our Time script uses a combination of the last two techniques to get a result. First, it finds a significant word in the title, ideally the first four letter word after the first word (i.e. to avoid “The”, “A”, “That”, etc.) or, failing that, the first word. The SPARQL query then uses bif:contains to search for that word. The query then does a regular expression filter on the whole title. I don’t know if this is how SPARQL endpoints would generally work, but the BNB appears to only look for regular expression matches on the records already filtered by the bif:contains. In any case, it works.
In addition, the script also uses a regular expression to search by the author’s surname. It doesn’t search by date as the date of publication on BNB (dc:issued) is not in a standard format (e.g. “1994-01-01 00:00:00″, “2005 printing”, “c2006″). It is also not keyword searchable. You can see all the author-title hits with links to BNB records by viewing Hits.
7. Filter Hits by date of publication

You can, however, retrieve the date from the BNB and process it afterwards, which is what the script does. It finds the four digit year and compares this to the four digit year it found on the In Our Time site. You can see all the author-title-date hits with links to BNB records by viewing Date Hits. Perhaps rather arbitrarily, the first book in the resulting array is selected as the result.

8. Obtain and display metadata from BNB

When the search took place, the matching title, author (only one), and date is obtained from BNB. This title and author are displayed, as are the stripped down year of publication, and a link to the full BNB record. For records that returned no hits on the BNB, the Basic Data is simply regurgitated.
The script also downloads the full combined RDF for all the hits is displayed at the bottom of the page, viewable in a several formats.
Further work

I think a lot more work could be done on this given time, both to improve it and to extend it. In no particular order:
  • Make it more pretty. It is currently designed to look merely acceptable while I concentrated on functionality. I have also tried to show much of my working, which a finished version would obviously hide.
  • Refinement of the detection of citation style. This is probably the most critical improvement, and ultimately decides if this approach would be useful outside of In Our Time on other reading lists. There are more patterns that need to be added, especially for older pages.
  • Further preparation of data for searching. Currently, for example, a book on the Mexico reading list doesn’t return any hits because of the exclamation mark in “Zapata!”. This could be stripped, and there are lots of similar refinements no doubt.
  • More interesting/useful output. The script’s outputs are currently quite raw or basic as I concentrated on the mechanisms for pulling information from free text for automatic catalogue searching. It might be useful to output proper standalone RDF files, references in standard reference formats (e.g. Harvard) in HTML or text files, files in standard reference management formats, perhaps even MARC, and so on. Some of these would perhaps be fairly straightforward.
  • Links to catalogues or online bookshops so you could borrow or buy the books from the reading list based on ISBNs taken from the BNB record.
  • Searching more catalogues. If a search fails on the BNB, the script could search other open catalogues, e.g. the Cambridge catalogue.
  • Greasemonkey script or plugin so that a button appears next to the Further Reading section when you view an In Our Time page. This could even appear next to individual books. Ideally (pie in the sky) such a plugin would have a stab at finding books on any web page.
  • Other ways of firing the script not requiring manual addition to the URL or use of a bookmarklet, e.g. a searchbox of some kind (either accepting a URL as input or keywords for the titles of broadcasts).

Please do leave comments or questions.