Archives

Categories

Lodopac : simple Linked Open Data OPAC

Lodopac is my entry to the UK Discovery Developer Competition. Aside from obvious mocking of the name, comments on Lodopac are very welcome. If anyone  installs it locally, I’d also be very interested to know.

About Lodopac

Lodopac is a simple linked open data opac using Sparql to search remote bibliographic RDF data. By default it is set up to search the BNB and Cambridge University Library datasets, but is designed to allow easy setup of additional datasets with Sparql endpoints (see Installation, source code, and configuration below). It was written in response to the UK Discovery Developer Competition.

The purpose of Lodopac is to provide a simple standard OPAC-style interface to perform searches of various bibliographic RDF datasets without having to know how to formulate a Sparql query and without having to know the structure of the database. I hope it is especially useful for people wanting to get a grip on how bibliographic RDF is put together, what it looks like, and what a Sparql query looks like. For example, an author search is possible without knowing about dc:creator and dc:contributor, or how these need to be linked together in a Sparql search. Similarly, a searcher wouldn’t need to worry about how to construct date searches in different datasets. For the BNB and CUL, these are very different (three lines in the BNB, one for CUL), but in Lodopac, there is only search box to search both. Lodopac displays the Sparql query it constructed to perform the search, as well as the combined RDF for all results found in XML, JSON, N3, and TTL.

How to search Lodopac

Select one or more of the available datasets using the checkboxes.

Author and Title searches are free text phrase searches. In other words, a string you search for will match with any exact match, including spacing and punctuation, and in the middle of words. E.g. searching for “shake” will match “Shakespeare”, “milkshakes”, and “More hits that you can shake a stick at”. Searching is case insensitive. The following punctuation is removed from searches: \”‘<>$^%.

You are strongly advised to keep author and title searches simple: e.g. one word of a title or a surname only.

ISBN searches 10 or 13 digit ISBNs. Any dashes or other non-digit characters are stripped from the search.

Date search will accept a year.

N.B. Keep searches as simple as possible, especially with author and title searches, to avoid them timing out. ISBN and date searches are generally quicker.

Limitations

A bad workman blames his tools and I’m no exception. The greatest limitation is the time taken by Sparql endpoints to perform a Sparql query, especially one that involves a regular expression, such as the Author and Title searches. What is needed is some more robust indexing or some cheat like Virtuoso’s unorthodox bif:contains, which the old version of the linked data BNB used. I touched on this in a blog post about the In Our Time Booklist script I wrote (see section 6).

The load and current capacity on the Sparql endpoints at the time the query is made is another important factor. A search which times out one minute can work fine the next.

The search options are obviously limited but do I hope represent the most common methods of searching normal library catalogues aside from, of course, a general keywords search. The manipulation of results is also rather sparse but allows click through to full data associated with a book, the structure and contents of which can be more fully explored. The aggregation of RDF data in various formats is I hope useful illustratively as well as having potential for further manipulation.

Installation, source code, and configuration

The source code for Lodopac is available as a zip file, which contains all the necessary PHP, Javascript, and CSS files. In addition, you will need to install ARC2, which makes the Sparql queries and manipulates the resultant RDF. Edit the first line of lodopac.php so that it points at your local installation of ARC2.

The programme is basically one long script- there is only one page- but is split for convenience of editing. The key file is lodopac.php which includes the other files as it goes along. The main core of the script which builds the queries and does the searching is all in lodopac.php.

I have attempted to make the script as easily configurable as possible so that additional Sparql endpoints can be added. There are probably more components hard-coded into the script that I have overlooked, but all the setup for the endpoints is in the file setup_endpoints.php. The first part of this file is a list of necessary prefixes that are needed for any possible queries from any of the endpoints and, although not ideal, all these prefixes are sent with any Sparql query. Following that and the declaration of an array of the endpoints, each endpoint has a dedicated block with the information added to a hash. To add another endpoint, duplicate a block and configure the search recipes as appropriate. The keys marked “brief_” are used to fetch information for the brief results display. I have conspicuously chickened out of providing an author and the attendant main entry and multiple author headaches involved.

In Our Time booklist

I have written a script which takes an unstructured reading list on the BBC’s In Our Time website, searches the British National Bibliography (BNB) using bibliographica for the books on the list, and returns structured metadata for the records it found.

This script was written in response to an idea raised by psychemedia for the Open Bibliographic Data Challenge: the BBC “In Our Time” Reading List:

The BBC “In Our Time” radio programme publishes suggested recommending reading in the programme data in an unstructured and citation style way: author, title (publisher, year), with what looks to be conventional character string separators between references (at least on the pages I looked at).

The idea is to extract and link suggested readings for the In Our time programmes to open, structured bibliographic data. This would make the In Our Time archive even more useful as an informal (open-ish) educational resource, especially as academic libraries start to release data relating to books used on courses. (So for example, this approach might help provide a link from a course to a relevant In Our Time broadcast via a common book.)

I was drawn to this idea as I like the idea of turning unstructured data into structured data: I have for example had some previous fun converting HTML pages into RSS feeds (e.g. CILIP Lisjobnet, Big Brother). I think something similar for any reading list (e.g. a Word document produced by a lecturer) would be an interesting idea.

The programme is written in PHP and is designed to be fired from a Javascript bookmarklet from a page on the In Our Time site, or by appending the In Our Time URL to the end of the URL for this page: http://www.aurochs.org/inourtime_booklist/inourtime_booklist_v1.php?. For example, to use it on the page for The Mexican Revolution (which I used a lot in testing), add the URL http://www.bbc.co.uk/programmes/b00xhz8d to produce http://www.aurochs.org/inourtime_booklist/inourtime_booklist_v1.php?http://www.bbc.co.uk/programmes/b00xhz8d.

The script follows the following steps:

  1. Set up ARC2 to enable SPARQL searching and RDF processing
  2. Extract Further Reading section
  3. Separate out Raw Data for each book
  4. Determine pattern used in citation then extract Basic Data, e.g. author, title, article, publication, using regular expressions
  5. Further refine elements to make searching easier, i.e. one surname for author, only title proper for titles
  6. Construct a SPARQL Query using author surname and title regular expressions pre-filtered for speed by a significant word using bif:contains
  7. Filter Hits by date of publication
  8. Obtain and display metadata from BNB
More details of these steps are below:

1. Set up ARC2 to enable SPARQL searching and RDF processing

ARC2 is a simple-to-use system for using RDF and SPARQL in PHP. I had previously played with it here when experimenting with creating my own RDF. The Sandy site uses SPARQL to populate the See Also sections.

2. Extract Further Reading section

A simple regular expression identifies the div in the HTML code that contains the reading list, which enables the next stage of the script to look for individual books.

3. Separate out Raw Data for each book

Another regular expression pulls out the paragraphs containing books and puts them in an array. You can see this by viewing the Raw Data.
4. Determine pattern used in citation then extract Basic Data, e.g. author, title, article, publication using regular expressions

As the In Our Time site does not use a single standard form of citation, the script has to try and determine which of several possible patterns a citation is using with regular expressions and extract the correct bits of data. This only works as well as it is possible to identify all the patterns, which effectively means looking at as many In Our Time pages as possible. This is one area that would certainly reward more work. It also points out how difficult it would be to extrapolate this into a script  that could read any citations. The In Our Time booklist currently uses five citations each identified with a number (1, 2, 3, 4, 15, 5). If you look at the Book Data for a particular book you will see the citation style number given. The regular expressions capture author, title, and publication.
The author information in citations on In Our Time is  unpredictable. Sometimes surnames are first, sometimes they are last.  The citation patterns take care of this if possible and try to extract one significant name.
5. Further refine elements to make searching easier, i.e. one surname for author, only title proper for titles.

The script removes things like “(ed.)” from the author, which would obviously throw off a catalogue search. Subtitles- everything after and including semi-colons- are also removed from titles to lessen any chance of variation and lost matches.

6. Construct a SPARQL Query using author surname and title regular expressions pre-filtered for speed by a significant word using bif:contains

Constructing the SPARQL query was the most tricky part. Ignoring the various standard prefixes pilfered from the standard example, the most important part is the title search. There are three unsatisfactory options:
  • Match the extracted title directly to a dc:title. This doesn’t work as the cited title is unlikely to be exactly the same in all matters of words included, spacing, punctuation, etc.
  • Use bif:contains for keyword searching as used in the BNB SPARQL example. This is certainly quick, but has a number of drawbacks: it can only be used once for a single keyword (any one of the two significant words in The history of Mexico, for example, will produce a huge number of hits). It is also not standard SPARQL. I was happy to overlook this if it worked, but ARC2 didn’t like it at all until I worked out it has to be used in angle brackets e.g. ?title <bif:contains> “Mexico”.
  • Use regular expressions (e.g. FILTER regex (?title, “The History of Mexico”, “i”)) for keyword searching. This is extremely powerful: you can easily construct searches but it is so slow as to routinely time out, so rendering it effectively useless.
The In Our Time script uses a combination of the last two techniques to get a result. First, it finds a significant word in the title, ideally the first four letter word after the first word (i.e. to avoid “The”, “A”, “That”, etc.) or, failing that, the first word. The SPARQL query then uses bif:contains to search for that word. The query then does a regular expression filter on the whole title. I don’t know if this is how SPARQL endpoints would generally work, but the BNB appears to only look for regular expression matches on the records already filtered by the bif:contains. In any case, it works.
In addition, the script also uses a regular expression to search by the author’s surname. It doesn’t search by date as the date of publication on BNB (dc:issued) is not in a standard format (e.g. “1994-01-01 00:00:00″, “2005 printing”, “c2006″). It is also not keyword searchable. You can see all the author-title hits with links to BNB records by viewing Hits.
7. Filter Hits by date of publication

You can, however, retrieve the date from the BNB and process it afterwards, which is what the script does. It finds the four digit year and compares this to the four digit year it found on the In Our Time site. You can see all the author-title-date hits with links to BNB records by viewing Date Hits. Perhaps rather arbitrarily, the first book in the resulting array is selected as the result.

8. Obtain and display metadata from BNB

When the search took place, the matching title, author (only one), and date is obtained from BNB. This title and author are displayed, as are the stripped down year of publication, and a link to the full BNB record. For records that returned no hits on the BNB, the Basic Data is simply regurgitated.
The script also downloads the full combined RDF for all the hits is displayed at the bottom of the page, viewable in a several formats.
Further work

I think a lot more work could be done on this given time, both to improve it and to extend it. In no particular order:
  • Make it more pretty. It is currently designed to look merely acceptable while I concentrated on functionality. I have also tried to show much of my working, which a finished version would obviously hide.
  • Refinement of the detection of citation style. This is probably the most critical improvement, and ultimately decides if this approach would be useful outside of In Our Time on other reading lists. There are more patterns that need to be added, especially for older pages.
  • Further preparation of data for searching. Currently, for example, a book on the Mexico reading list doesn’t return any hits because of the exclamation mark in “Zapata!”. This could be stripped, and there are lots of similar refinements no doubt.
  • More interesting/useful output. The script’s outputs are currently quite raw or basic as I concentrated on the mechanisms for pulling information from free text for automatic catalogue searching. It might be useful to output proper standalone RDF files, references in standard reference formats (e.g. Harvard) in HTML or text files, files in standard reference management formats, perhaps even MARC, and so on. Some of these would perhaps be fairly straightforward.
  • Links to catalogues or online bookshops so you could borrow or buy the books from the reading list based on ISBNs taken from the BNB record.
  • Searching more catalogues. If a search fails on the BNB, the script could search other open catalogues, e.g. the Cambridge catalogue.
  • Greasemonkey script or plugin so that a button appears next to the Further Reading section when you view an In Our Time page. This could even appear next to individual books. Ideally (pie in the sky) such a plugin would have a stab at finding books on any web page.
  • Other ways of firing the script not requiring manual addition to the URL or use of a bookmarklet, e.g. a searchbox of some kind (either accepting a URL as input or keywords for the titles of broadcasts).

Please do leave comments or questions.