|
|
I have come up with two bookmarklets that allow you to search for an author’s works in a library catalogue from the author’s Wikipedia page in one click. A bookmarklet is a browser bookmark that does something with the page you’re looking at rather than just taking you to a web page: see the helpful Firefox guide for more information. The bookmarklets are identical, except that one searches UCL’s Explore (Primo) service, and the other searches COPAC. To try them:
Install one of the bookmarklets by dragging the link to your bookmarks toolbar:
You can rename them to something more snappy if you like. Next, go to a Wikipedia page for an author. The bookmarklets only work on Wikipedia pages with VIAF or LC Authorities links in them, but most major authors should be fine. Some examples to try:
How it works. The bookmarklet itself is only a short snippet of javascript; all it does is look for any links that might be VIAF or LC Authorities links. It then appends this information as a query string to a URL for a remote PHP script. The PHP script does all the hard work. It first works out the URI for the VIAF entry and has a look at it using ARC2. It looks at the RDF for the authorised LC heading, constructs a search URL for either UCL or COPAC, then redirects itself there, where its work is done. If there is a problem with the VIAF entry it tries the LC link, if there is one, in a similar way. If there is nothing, it will fail and offer to go back to the Wikipedia page or forward to the catalogue you wanted to search.
Why. One of the promises of linked data and BIBFRAME and all the rest of it, is that data from different sources can be linked together and work with each other. Since VIAF links were recently added to Wikipedia, I’ve wondered what could be done to take advantage of this in a practical way. The link does mean that from a Wikipedia page and its uncontrolled (or at least only consistent within Wikipedia) names, you can find out the authorised form of an author’s name. Charles Darwin (the famous one) is only called Charles Darwin in the title of his Wikipedia article. Search for that on a library catalogue and you’ll get all his works plus the stuff written by other Charles Darwins. With the VIAF data, we know that he is known in most (or a huge number of English-language catalogues) as “Darwin, Charles, 1809-1882″ as opposed to the other Charles Darwin mentioned above, who is “Darwin, Charles, 1758-1778″. Although most catalogues or discovery systems don’t use linked data and non-textual identifiers, the ubiquity and uniqueness of an LC heading does almost perform a similar thing (although there are caveats galore).
Many of the caveats are in the way library systems search. Both the examples used are imperfect: the UCL one, as I’ve done it, uses a facet search on top of the bare search which while eliminating some incorrect hits (where “Steve”, “Jones”, and “1944-” appear coincidentally as author elements in a search) also misses out a few hits depending on what field he appears in in the record (this is I think a fixable glitch which I intend to get fixed); the COPAC one is author free text but I’ve tried to remedy some of the potential for false hits by putting all searches in quotes.
Improvements. These are legion, but a few sketched ideas below:
- Implement this as a browser extension. This was my original intention, so that someone could be browsing any old Wikipedia page and when they come across one with a VIAF (or other service) link at the bottom, a search link is created at the top of the page for them to click on. This could easily be extended in several ways:
- Add subject searches. Should be straightforward, although would require moar bookmarklets, relying on the PHP script offering options, or a proper browser extension
- Add more catalogues/discovery interfaces. This is again straightforward to add to the PHP script if you can figure out the web API for a search service but is subject to the same caveats as subject searches.
- Add more than just VIAF and LC Authorities. There are other links appearing at the bottom of some Wikipedia pages, most notably Worldcat. The bookmarklet itself could be easily adapted to accommodate these so avoiding a further profusion of bookmarklets, as well as more back up when services are down (VIAF went down twice while I was testing). Adding services to the PHP is a matter of knowing the structure of the RDF, which shouldn’t be too painful.
- Improve how errors are dealt with and reported, especially so the bookmarklet handles more of them and prevents the PHP script being called unnecessarily.
Feedback. I appreciate this is highly unlikely to set the world on fire, but I would be interested in any feedback or ideas of how it could be developed. Of course, please do let me know if you come across any mistakes or problems: it’s becoming almost traditional for me to get the most crucial link wrong in blog posts.
I am not a trained programmer, coding is not part of my job description, and I have little direct access to cataloguing and metadata databases at work outside of normal catalogue editing and talking to the systems team, but I thought it might be worth making the point of how useful programming can be in all sorts of little ways. Of course, the most useful way is in gaining an awareness of how computers work, appreciating why some things might be more tricky than others for the systems team to implement, seeing why MARC21 is a bastard to do anything with even if editing it in a cataloguing module is not really that bad, and how the new world of FRDABRDF is going to be glued together. However, some more practical examples that I managed to cobble together include:
- Customizing Classification Web with Greasemonkey. This is a couple of short scripts using Javascript, which is what the default Codeacademy lessons use. Javascript is designed for browers and is a good one to start with as you can do something powerful very quickly with a short script or even a couple of lines (think of all the 90s image rollovers). It’s also easy to have a go if you don’t have your own server, or even if you’re confined to your own PC.
- Aleph-formatted country and language codes. I wrote a small PHP script to read the XML files for the MARC21 language and country codes and convert them into an up to date list of preferred codes in a format that Aleph can read, basically a text file which needs line breaks and spaces in the right places. It is easy to tweak or run again in the event of any minor changes. I don’t have this publicly available anywhere though. PHP is not the most elegant language but is relatively easy to dip into if you ever want to go beyond Javascript and do more fancy things, although it can be harder to get access to a server running PHP.
- MARC21 .mrc file viewer. I occasionally need to quickly look at raw .mrc files to assess their quality and to figure out what batch changes we want to make before importing them into our catalogue. This is an attempt to create something that I could copy and paste snippets of .mrc files into for a quick look. It is written in PHP and is still under construction. There are other better tools for doing much the same thing to be honest, but coding this myself has had the advantages of forcing me to see how a MARC21 file is put together and realising how fiddly it can be. Try this with an .mrc which has some large 520 or 505 fields in it (there are some zipped ones here, to pick at random) and watch the indicators mysteriously degrade thereafter. I will get to the bottom of this…
The following examples are less useful for my own practical purposes but have been invaluable for learning about metadata and cataloguing, in particular, RDF/linked data. I was very interested in LD when I first heard about it. Being able to actually try something out with it (even if the results are not mind-blowing) rather than just read about it, has been very useful. Both are written in PHP and further details are available from the links:
Nothing to do with cataloguing, but what I am most proud of is this, written in Javascript: Cowthello. Let me know if you beat it.
Update: Shana McDanold also wrote an excellent post on why a cataloguer should learn to code with lots of practical examples.
I have written a script which takes an unstructured reading list on the BBC’s In Our Time website, searches the British National Bibliography (BNB) using bibliographica for the books on the list, and returns structured metadata for the records it found.
This script was written in response to an idea raised by psychemedia for the Open Bibliographic Data Challenge: the BBC “In Our Time” Reading List:
The BBC “In Our Time” radio programme publishes suggested recommending reading in the programme data in an unstructured and citation style way: author, title (publisher, year), with what looks to be conventional character string separators between references (at least on the pages I looked at).
The idea is to extract and link suggested readings for the In Our time programmes to open, structured bibliographic data. This would make the In Our Time archive even more useful as an informal (open-ish) educational resource, especially as academic libraries start to release data relating to books used on courses. (So for example, this approach might help provide a link from a course to a relevant In Our Time broadcast via a common book.)
I was drawn to this idea as I like the idea of turning unstructured data into structured data: I have for example had some previous fun converting HTML pages into RSS feeds (e.g. CILIP Lisjobnet, Big Brother). I think something similar for any reading list (e.g. a Word document produced by a lecturer) would be an interesting idea.
The programme is written in PHP and is designed to be fired from a Javascript bookmarklet from a page on the In Our Time site, or by appending the In Our Time URL to the end of the URL for this page: http://www.aurochs.org/inourtime_booklist/inourtime_booklist_v1.php?. For example, to use it on the page for The Mexican Revolution (which I used a lot in testing), add the URL http://www.bbc.co.uk/programmes/b00xhz8d to produce http://www.aurochs.org/inourtime_booklist/inourtime_booklist_v1.php?http://www.bbc.co.uk/programmes/b00xhz8d.
The script follows the following steps:
- Set up ARC2 to enable SPARQL searching and RDF processing
- Extract Further Reading section
- Separate out Raw Data for each book
- Determine pattern used in citation then extract Basic Data, e.g. author, title, article, publication, using regular expressions
- Further refine elements to make searching easier, i.e. one surname for author, only title proper for titles
- Construct a SPARQL Query using author surname and title regular expressions pre-filtered for speed by a significant word using bif:contains
- Filter Hits by date of publication
- Obtain and display metadata from BNB
More details of these steps are below:
1. Set up ARC2 to enable SPARQL searching and RDF processing
ARC2 is a simple-to-use system for using RDF and SPARQL in PHP. I had previously played with it here when experimenting with creating my own RDF. The Sandy site uses SPARQL to populate the See Also sections.
2. Extract Further Reading section
A simple regular expression identifies the div in the HTML code that contains the reading list, which enables the next stage of the script to look for individual books.
3. Separate out Raw Data for each book
Another regular expression pulls out the paragraphs containing books and puts them in an array. You can see this by viewing the Raw Data.
4. Determine pattern used in citation then extract Basic Data, e.g. author, title, article, publication using regular expressions
As the In Our Time site does not use a single standard form of citation, the script has to try and determine which of several possible patterns a citation is using with regular expressions and extract the correct bits of data. This only works as well as it is possible to identify all the patterns, which effectively means looking at as many In Our Time pages as possible. This is one area that would certainly reward more work. It also points out how difficult it would be to extrapolate this into a script that could read any citations. The In Our Time booklist currently uses five citations each identified with a number (1, 2, 3, 4, 15, 5). If you look at the Book Data for a particular book you will see the citation style number given. The regular expressions capture author, title, and publication.
The author information in citations on In Our Time is unpredictable. Sometimes surnames are first, sometimes they are last. The citation patterns take care of this if possible and try to extract one significant name.
5. Further refine elements to make searching easier, i.e. one surname for author, only title proper for titles.
The script removes things like “(ed.)” from the author, which would obviously throw off a catalogue search. Subtitles- everything after and including semi-colons- are also removed from titles to lessen any chance of variation and lost matches.
6. Construct a SPARQL Query using author surname and title regular expressions pre-filtered for speed by a significant word using bif:contains
Constructing the SPARQL query was the most tricky part. Ignoring the various standard prefixes pilfered from the standard example, the most important part is the title search. There are three unsatisfactory options:
- Match the extracted title directly to a dc:title. This doesn’t work as the cited title is unlikely to be exactly the same in all matters of words included, spacing, punctuation, etc.
- Use bif:contains for keyword searching as used in the BNB SPARQL example. This is certainly quick, but has a number of drawbacks: it can only be used once for a single keyword (any one of the two significant words in The history of Mexico, for example, will produce a huge number of hits). It is also not standard SPARQL. I was happy to overlook this if it worked, but ARC2 didn’t like it at all until I worked out it has to be used in angle brackets e.g. ?title <bif:contains> “Mexico”.
- Use regular expressions (e.g. FILTER regex (?title, “The History of Mexico”, “i”)) for keyword searching. This is extremely powerful: you can easily construct searches but it is so slow as to routinely time out, so rendering it effectively useless.
The In Our Time script uses a combination of the last two techniques to get a result. First, it finds a significant word in the title, ideally the first four letter word after the first word (i.e. to avoid “The”, “A”, “That”, etc.) or, failing that, the first word. The SPARQL query then uses bif:contains to search for that word. The query then does a regular expression filter on the whole title. I don’t know if this is how SPARQL endpoints would generally work, but the BNB appears to only look for regular expression matches on the records already filtered by the bif:contains. In any case, it works.
In addition, the script also uses a regular expression to search by the author’s surname. It doesn’t search by date as the date of publication on BNB (dc:issued) is not in a standard format (e.g. “1994-01-01 00:00:00″, “2005 printing”, “c2006″). It is also not keyword searchable. You can see all the author-title hits with links to BNB records by viewing Hits.
7. Filter Hits by date of publication
You can, however, retrieve the date from the BNB and process it afterwards, which is what the script does. It finds the four digit year and compares this to the four digit year it found on the In Our Time site. You can see all the author-title-date hits with links to BNB records by viewing Date Hits. Perhaps rather arbitrarily, the first book in the resulting array is selected as the result.
8. Obtain and display metadata from BNB
When the search took place, the matching title, author (only one), and date is obtained from BNB. This title and author are displayed, as are the stripped down year of publication, and a link to the full BNB record. For records that returned no hits on the BNB, the Basic Data is simply regurgitated.
The script also downloads the full combined RDF for all the hits is displayed at the bottom of the page, viewable in a several formats.
Further work
I think a lot more work could be done on this given time, both to improve it and to extend it. In no particular order:
- Make it more pretty. It is currently designed to look merely acceptable while I concentrated on functionality. I have also tried to show much of my working, which a finished version would obviously hide.
- Refinement of the detection of citation style. This is probably the most critical improvement, and ultimately decides if this approach would be useful outside of In Our Time on other reading lists. There are more patterns that need to be added, especially for older pages.
- Further preparation of data for searching. Currently, for example, a book on the Mexico reading list doesn’t return any hits because of the exclamation mark in “Zapata!”. This could be stripped, and there are lots of similar refinements no doubt.
- More interesting/useful output. The script’s outputs are currently quite raw or basic as I concentrated on the mechanisms for pulling information from free text for automatic catalogue searching. It might be useful to output proper standalone RDF files, references in standard reference formats (e.g. Harvard) in HTML or text files, files in standard reference management formats, perhaps even MARC, and so on. Some of these would perhaps be fairly straightforward.
- Links to catalogues or online bookshops so you could borrow or buy the books from the reading list based on ISBNs taken from the BNB record.
- Searching more catalogues. If a search fails on the BNB, the script could search other open catalogues, e.g. the Cambridge catalogue.
- Greasemonkey script or plugin so that a button appears next to the Further Reading section when you view an In Our Time page. This could even appear next to individual books. Ideally (pie in the sky) such a plugin would have a stab at finding books on any web page.
- Other ways of firing the script not requiring manual addition to the URL or use of a bookmarklet, e.g. a searchbox of some kind (either accepting a URL as input or keywords for the titles of broadcasts).
Please do leave comments or questions.
|