I have written a script which takes an unstructured reading list on the BBC’s In Our Time website, searches the British National Bibliography (BNB) using bibliographica for the books on the list, and returns structured metadata for the records it found.
The BBC “In Our Time” radio programme publishes suggested recommending reading in the programme data in an unstructured and citation style way: author, title (publisher, year), with what looks to be conventional character string separators between references (at least on the pages I looked at).
The idea is to extract and link suggested readings for the In Our time programmes to open, structured bibliographic data. This would make the In Our Time archive even more useful as an informal (open-ish) educational resource, especially as academic libraries start to release data relating to books used on courses. (So for example, this approach might help provide a link from a course to a relevant In Our Time broadcast via a common book.)
I was drawn to this idea as I like the idea of turning unstructured data into structured data: I have for example had some previous fun converting HTML pages into RSS feeds (e.g. CILIP Lisjobnet, Big Brother). I think something similar for any reading list (e.g. a Word document produced by a lecturer) would be an interesting idea.
The script follows the following steps:
- Set up ARC2 to enable SPARQL searching and RDF processing
- Extract Further Reading section
- Separate out Raw Data for each book
- Determine pattern used in citation then extract Basic Data, e.g. author, title, article, publication, using regular expressions
- Further refine elements to make searching easier, i.e. one surname for author, only title proper for titles
- Construct a SPARQL Query using author surname and title regular expressions pre-filtered for speed by a significant word using bif:contains
- Filter Hits by date of publication
- Obtain and display metadata from BNB
1. Set up ARC2 to enable SPARQL searching and RDF processing
ARC2 is a simple-to-use system for using RDF and SPARQL in PHP. I had previously played with it here when experimenting with creating my own RDF. The Sandy site uses SPARQL to populate the See Also sections.
2. Extract Further Reading section
A simple regular expression identifies the div in the HTML code that contains the reading list, which enables the next stage of the script to look for individual books.
The script removes things like “(ed.)” from the author, which would obviously throw off a catalogue search. Subtitles- everything after and including semi-colons- are also removed from titles to lessen any chance of variation and lost matches.
6. Construct a SPARQL Query using author surname and title regular expressions pre-filtered for speed by a significant word using bif:contains
- Match the extracted title directly to a dc:title. This doesn’t work as the cited title is unlikely to be exactly the same in all matters of words included, spacing, punctuation, etc.
- Use bif:contains for keyword searching as used in the BNB SPARQL example. This is certainly quick, but has a number of drawbacks: it can only be used once for a single keyword (any one of the two significant words in The history of Mexico, for example, will produce a huge number of hits). It is also not standard SPARQL. I was happy to overlook this if it worked, but ARC2 didn’t like it at all until I worked out it has to be used in angle brackets e.g. ?title <bif:contains> “Mexico”.
- Use regular expressions (e.g. FILTER regex (?title, “The History of Mexico”, “i”)) for keyword searching. This is extremely powerful: you can easily construct searches but it is so slow as to routinely time out, so rendering it effectively useless.
8. Obtain and display metadata from BNB
- Make it more pretty. It is currently designed to look merely acceptable while I concentrated on functionality. I have also tried to show much of my working, which a finished version would obviously hide.
- Refinement of the detection of citation style. This is probably the most critical improvement, and ultimately decides if this approach would be useful outside of In Our Time on other reading lists. There are more patterns that need to be added, especially for older pages.
- Further preparation of data for searching. Currently, for example, a book on the Mexico reading list doesn’t return any hits because of the exclamation mark in “Zapata!”. This could be stripped, and there are lots of similar refinements no doubt.
- More interesting/useful output. The script’s outputs are currently quite raw or basic as I concentrated on the mechanisms for pulling information from free text for automatic catalogue searching. It might be useful to output proper standalone RDF files, references in standard reference formats (e.g. Harvard) in HTML or text files, files in standard reference management formats, perhaps even MARC, and so on. Some of these would perhaps be fairly straightforward.
- Links to catalogues or online bookshops so you could borrow or buy the books from the reading list based on ISBNs taken from the BNB record.
- Searching more catalogues. If a search fails on the BNB, the script could search other open catalogues, e.g. the Cambridge catalogue.
- Greasemonkey script or plugin so that a button appears next to the Further Reading section when you view an In Our Time page. This could even appear next to individual books. Ideally (pie in the sky) such a plugin would have a stab at finding books on any web page.
- Other ways of firing the script not requiring manual addition to the URL or use of a bookmarklet, e.g. a searchbox of some kind (either accepting a URL as input or keywords for the titles of broadcasts).
Please do leave comments or questions.