Archives

Categories

In Our Time booklist

I have written a script which takes an unstructured reading list on the BBC’s In Our Time website, searches the British National Bibliography (BNB) using bibliographica for the books on the list, and returns structured metadata for the records it found.

This script was written in response to an idea raised by psychemedia for the Open Bibliographic Data Challenge: the BBC “In Our Time” Reading List:

The BBC “In Our Time” radio programme publishes suggested recommending reading in the programme data in an unstructured and citation style way: author, title (publisher, year), with what looks to be conventional character string separators between references (at least on the pages I looked at).

The idea is to extract and link suggested readings for the In Our time programmes to open, structured bibliographic data. This would make the In Our Time archive even more useful as an informal (open-ish) educational resource, especially as academic libraries start to release data relating to books used on courses. (So for example, this approach might help provide a link from a course to a relevant In Our Time broadcast via a common book.)

I was drawn to this idea as I like the idea of turning unstructured data into structured data: I have for example had some previous fun converting HTML pages into RSS feeds (e.g. CILIP Lisjobnet, Big Brother). I think something similar for any reading list (e.g. a Word document produced by a lecturer) would be an interesting idea.

The programme is written in PHP and is designed to be fired from a Javascript bookmarklet from a page on the In Our Time site, or by appending the In Our Time URL to the end of the URL for this page: http://www.aurochs.org/inourtime_booklist/inourtime_booklist_v1.php?. For example, to use it on the page for The Mexican Revolution (which I used a lot in testing), add the URL http://www.bbc.co.uk/programmes/b00xhz8d to produce http://www.aurochs.org/inourtime_booklist/inourtime_booklist_v1.php?http://www.bbc.co.uk/programmes/b00xhz8d.

The script follows the following steps:

  1. Set up ARC2 to enable SPARQL searching and RDF processing
  2. Extract Further Reading section
  3. Separate out Raw Data for each book
  4. Determine pattern used in citation then extract Basic Data, e.g. author, title, article, publication, using regular expressions
  5. Further refine elements to make searching easier, i.e. one surname for author, only title proper for titles
  6. Construct a SPARQL Query using author surname and title regular expressions pre-filtered for speed by a significant word using bif:contains
  7. Filter Hits by date of publication
  8. Obtain and display metadata from BNB
More details of these steps are below:

1. Set up ARC2 to enable SPARQL searching and RDF processing

ARC2 is a simple-to-use system for using RDF and SPARQL in PHP. I had previously played with it here when experimenting with creating my own RDF. The Sandy site uses SPARQL to populate the See Also sections.

2. Extract Further Reading section

A simple regular expression identifies the div in the HTML code that contains the reading list, which enables the next stage of the script to look for individual books.

3. Separate out Raw Data for each book

Another regular expression pulls out the paragraphs containing books and puts them in an array. You can see this by viewing the Raw Data.
4. Determine pattern used in citation then extract Basic Data, e.g. author, title, article, publication using regular expressions

As the In Our Time site does not use a single standard form of citation, the script has to try and determine which of several possible patterns a citation is using with regular expressions and extract the correct bits of data. This only works as well as it is possible to identify all the patterns, which effectively means looking at as many In Our Time pages as possible. This is one area that would certainly reward more work. It also points out how difficult it would be to extrapolate this into a script  that could read any citations. The In Our Time booklist currently uses five citations each identified with a number (1, 2, 3, 4, 15, 5). If you look at the Book Data for a particular book you will see the citation style number given. The regular expressions capture author, title, and publication.
The author information in citations on In Our Time is  unpredictable. Sometimes surnames are first, sometimes they are last.  The citation patterns take care of this if possible and try to extract one significant name.
5. Further refine elements to make searching easier, i.e. one surname for author, only title proper for titles.

The script removes things like “(ed.)” from the author, which would obviously throw off a catalogue search. Subtitles- everything after and including semi-colons- are also removed from titles to lessen any chance of variation and lost matches.

6. Construct a SPARQL Query using author surname and title regular expressions pre-filtered for speed by a significant word using bif:contains

Constructing the SPARQL query was the most tricky part. Ignoring the various standard prefixes pilfered from the standard example, the most important part is the title search. There are three unsatisfactory options:
  • Match the extracted title directly to a dc:title. This doesn’t work as the cited title is unlikely to be exactly the same in all matters of words included, spacing, punctuation, etc.
  • Use bif:contains for keyword searching as used in the BNB SPARQL example. This is certainly quick, but has a number of drawbacks: it can only be used once for a single keyword (any one of the two significant words in The history of Mexico, for example, will produce a huge number of hits). It is also not standard SPARQL. I was happy to overlook this if it worked, but ARC2 didn’t like it at all until I worked out it has to be used in angle brackets e.g. ?title <bif:contains> “Mexico”.
  • Use regular expressions (e.g. FILTER regex (?title, “The History of Mexico”, “i”)) for keyword searching. This is extremely powerful: you can easily construct searches but it is so slow as to routinely time out, so rendering it effectively useless.
The In Our Time script uses a combination of the last two techniques to get a result. First, it finds a significant word in the title, ideally the first four letter word after the first word (i.e. to avoid “The”, “A”, “That”, etc.) or, failing that, the first word. The SPARQL query then uses bif:contains to search for that word. The query then does a regular expression filter on the whole title. I don’t know if this is how SPARQL endpoints would generally work, but the BNB appears to only look for regular expression matches on the records already filtered by the bif:contains. In any case, it works.
In addition, the script also uses a regular expression to search by the author’s surname. It doesn’t search by date as the date of publication on BNB (dc:issued) is not in a standard format (e.g. “1994-01-01 00:00:00″, “2005 printing”, “c2006″). It is also not keyword searchable. You can see all the author-title hits with links to BNB records by viewing Hits.
7. Filter Hits by date of publication

You can, however, retrieve the date from the BNB and process it afterwards, which is what the script does. It finds the four digit year and compares this to the four digit year it found on the In Our Time site. You can see all the author-title-date hits with links to BNB records by viewing Date Hits. Perhaps rather arbitrarily, the first book in the resulting array is selected as the result.

8. Obtain and display metadata from BNB

When the search took place, the matching title, author (only one), and date is obtained from BNB. This title and author are displayed, as are the stripped down year of publication, and a link to the full BNB record. For records that returned no hits on the BNB, the Basic Data is simply regurgitated.
The script also downloads the full combined RDF for all the hits is displayed at the bottom of the page, viewable in a several formats.
Further work

I think a lot more work could be done on this given time, both to improve it and to extend it. In no particular order:
  • Make it more pretty. It is currently designed to look merely acceptable while I concentrated on functionality. I have also tried to show much of my working, which a finished version would obviously hide.
  • Refinement of the detection of citation style. This is probably the most critical improvement, and ultimately decides if this approach would be useful outside of In Our Time on other reading lists. There are more patterns that need to be added, especially for older pages.
  • Further preparation of data for searching. Currently, for example, a book on the Mexico reading list doesn’t return any hits because of the exclamation mark in “Zapata!”. This could be stripped, and there are lots of similar refinements no doubt.
  • More interesting/useful output. The script’s outputs are currently quite raw or basic as I concentrated on the mechanisms for pulling information from free text for automatic catalogue searching. It might be useful to output proper standalone RDF files, references in standard reference formats (e.g. Harvard) in HTML or text files, files in standard reference management formats, perhaps even MARC, and so on. Some of these would perhaps be fairly straightforward.
  • Links to catalogues or online bookshops so you could borrow or buy the books from the reading list based on ISBNs taken from the BNB record.
  • Searching more catalogues. If a search fails on the BNB, the script could search other open catalogues, e.g. the Cambridge catalogue.
  • Greasemonkey script or plugin so that a button appears next to the Further Reading section when you view an In Our Time page. This could even appear next to individual books. Ideally (pie in the sky) such a plugin would have a stab at finding books on any web page.
  • Other ways of firing the script not requiring manual addition to the URL or use of a bookmarklet, e.g. a searchbox of some kind (either accepting a URL as input or keywords for the titles of broadcasts).

Please do leave comments or questions.