Archives

Categories

Lodopac example searches

Yay, my entry for the Discovery & DevCSI Developers Competition- Lodopac- was awarded a commendation for its use of the Cambridge University Library (CUL) dataset. During the judging I was asked for searches which were known to work well- the timeout issues I discussed under Limitations being not insignificant, especially with author or title searches. I submitted a version of the following brief general notes which I hope are helpful to anyone else who wants to play:

The British National Bibliography (BNB) server is generally more responsive than the Cambridge University Library one; title seems to work better than author. The following are hopefully useful examples useful:

I would really like to try and think of ways of improving free text regular expression search times for things like author and title in Sparql* although I doubt there is one that doesn’t rely on the configuration, processing power, or indexing of the server being searched.

* thinking aloud, some ideas might include: downloading a larger imprecise set for further local searching (e.g. for an author/title search downloading the title matches and searching the authors locally: although this would also be slow, it would get round the timeout at least); forcing a look-up in a controlled vocab first in order to get an exact string match (esp for authors, although even if this is possible, this forces a user to do more work, which isn’t the point);¬† local indexing of the triple store (this is probably the best way but I’m not sure how to go about it, whether I really have the server capabilities to do it, and can be committed to the updating required).

Lodopac : simple Linked Open Data OPAC

Lodopac is my entry to the UK Discovery Developer Competition. Aside from obvious mocking of the name, comments on Lodopac are very welcome. If anyone¬† installs it locally, I’d also be very interested to know.

About Lodopac

Lodopac is a simple linked open data opac using Sparql to search remote bibliographic RDF data. By default it is set up to search the BNB and Cambridge University Library datasets, but is designed to allow easy setup of additional datasets with Sparql endpoints (see Installation, source code, and configuration below). It was written in response to the UK Discovery Developer Competition.

The purpose of Lodopac is to provide a simple standard OPAC-style interface to perform searches of various bibliographic RDF datasets without having to know how to formulate a Sparql query and without having to know the structure of the database. I hope it is especially useful for people wanting to get a grip on how bibliographic RDF is put together, what it looks like, and what a Sparql query looks like. For example, an author search is possible without knowing about dc:creator and dc:contributor, or how these need to be linked together in a Sparql search. Similarly, a searcher wouldn’t need to worry about how to construct date searches in different datasets. For the BNB and CUL, these are very different (three lines in the BNB, one for CUL), but in Lodopac, there is only search box to search both. Lodopac displays the Sparql query it constructed to perform the search, as well as the combined RDF for all results found in XML, JSON, N3, and TTL.

How to search Lodopac

Select one or more of the available datasets using the checkboxes.

Author and Title searches are free text phrase searches. In other words, a string you search for will match with any exact match, including spacing and punctuation, and in the middle of words. E.g. searching for “shake” will match “Shakespeare”, “milkshakes”, and “More hits that you can shake a stick at”. Searching is case insensitive. The following punctuation is removed from searches: \”‘<>$^%.

You are strongly advised to keep author and title searches simple: e.g. one word of a title or a surname only.

ISBN searches 10 or 13 digit ISBNs. Any dashes or other non-digit characters are stripped from the search.

Date search will accept a year.

N.B. Keep searches as simple as possible, especially with author and title searches, to avoid them timing out. ISBN and date searches are generally quicker.

Limitations

A bad workman blames his tools and I’m no exception. The greatest limitation is the time taken by Sparql endpoints to perform a Sparql query, especially one that involves a regular expression, such as the Author and Title searches. What is needed is some more robust indexing or some cheat like Virtuoso’s unorthodox bif:contains, which the old version of the linked data BNB used. I touched on this in a blog post about the In Our Time Booklist script I wrote (see section 6).

The load and current capacity on the Sparql endpoints at the time the query is made is another important factor. A search which times out one minute can work fine the next.

The search options are obviously limited but do I hope represent the most common methods of searching normal library catalogues aside from, of course, a general keywords search. The manipulation of results is also rather sparse but allows click through to full data associated with a book, the structure and contents of which can be more fully explored. The aggregation of RDF data in various formats is I hope useful illustratively as well as having potential for further manipulation.

Installation, source code, and configuration

The source code for Lodopac is available as a zip file, which contains all the necessary PHP, Javascript, and CSS files. In addition, you will need to install ARC2, which makes the Sparql queries and manipulates the resultant RDF. Edit the first line of lodopac.php so that it points at your local installation of ARC2.

The programme is basically one long script- there is only one page- but is split for convenience of editing. The key file is lodopac.php which includes the other files as it goes along. The main core of the script which builds the queries and does the searching is all in lodopac.php.

I have attempted to make the script as easily configurable as possible so that additional Sparql endpoints can be added. There are probably more components hard-coded into the script that I have overlooked, but all the setup for the endpoints is in the file setup_endpoints.php. The first part of this file is a list of necessary prefixes that are needed for any possible queries from any of the endpoints and, although not ideal, all these prefixes are sent with any Sparql query. Following that and the declaration of an array of the endpoints, each endpoint has a dedicated block with the information added to a hash. To add another endpoint, duplicate a block and configure the search recipes as appropriate. The keys marked “brief_” are used to fetch information for the brief results display. I have conspicuously chickened out of providing an author and the attendant main entry and multiple author headaches involved.