One record in lots of data formats

For a Dev8d session I did with Owen Stephens in February I presented data for a single book and followed how it had changed as standards changed, trying above to explain to non-cataloguers why catalogue records look and work the way they do. At least one person found it useful. I am now drafting an internal session at work on the future of cataloguing and am planning to take a similar approach to briefly explain how we got to AARC2 and MARC21, and where we are heading. I took the example I used at Dev8d and hand-crafted some RDA examples, obtained a raw .mrc MARC21 file, and used the RDF from Worldcat to come up with a linked data example.

I have tried to avoid notes on the examples themselves. However, do note the following: the examples only generally use the same simple set of data elements, basically the bits you might find on a basic catalogue card: no subjects, few notes, etc.; the book is quite old so there is no ISBN anyway. The original index card is from our digitised card catalogue. The linked data example was compiled by copying the RDFa from the Worldcat page for the book; this was then put into this RDFa viewer (suggested by Manu Sporny) to extract the raw RDF/Turtle; I manually hacked this further to replace full URIs with prefixes as much as possible in an attempt to make it more readable (I suspect this is where some errors may have crept in). The example itself is of course a conversion from an AARC2/MARC21 record. C.M. Berners-Lee is Tim’s dad.

Feel free to use this and to point out mistakes. I would particularly welcome anyone spotting anything amiss in the RDA and linked data, where I am sure I have mangled the punctuation in both.

Harvard Citation

Berners-Lee, C.M. (ed.) 1965, Models For Decision: a Conference under the Auspices of the United Kingdom Automation Council Organised by the British Computer Society and the Operational Research Society, English Universities Press, London.

Pre-AACR2 on Index Card

BERNERS-LEE, C.M., [ed.].

Models for decision; a conference under the auspices of the United Kingdom Automation Council organised by the British Computer Society and the Operational Research Society.

London, 1965.

x, 149p. illus. 22cm.

AACR2 on Index Card

Models for decision : a conference under the auspices of the United Kingdom Automation Council organised by the British Computer Society and the Operational Research Society / edited by C.M. Berners-Lee. -- London : English Universities Press, 1965.

x, 149 p. : ill. ; 23 cm.

Includes bibliographical references.

-       Berners-Lee, C. M.

AACR2 in MARC21 (raw .mrc)

00788nam a2200181 a 4500001002700000005001700027008004100044024001500085245021000100260004900310300003200359504004100391650003300432700002300465710003900488710003000527710004900557_UCL01000000000000000477125_20061112120300.0_850710s1965    enka     b    000 0 eng  _8 _ax280050495_00_aModels for decision :_ba conference under the auspices of the United Kingdom Automation Council organised by the British Computer Society and the Operational Research Society /_cedited by C.M. Berners-Lee._  _aLondon :_bEnglish Universities Press,_c1965._  _ax, 149 p. :_bill. ;_c23 cm._  _aIncludes bibliographical references._ 0_aDecision making_vCongresses._1 _aBerners-Lee, C. M._2 _aUnited Kingdom Automation Council._2 _aBritish Computer Society._2 _aOperational Research Society (Great Britain)__


245 00 $a Models for decision :
$b a conference under the auspices of the United Kingdom Automation Council organised by the British Computer Society and the Operational Research Society /
$c edited by C.M. Berners-Lee.
260 __ $a London :
$b English Universities Press,
$c 1965.
300 __ $a x, 149 p. :
$b ill. ;
$c 23 cm.
504 __ $a Includes bibliographical references.
700 1_ $a Berners-Lee, C. M.


Title proper Models for decision
Other title information a conference under the auspices of the United Kingdom Automation Council organised by the British Computer Society and the Operational Research Society
Statement of responsibility relating to title proper edited by C.M. Berners-Lee
Place of publication London
Publisher’s name The English Universities Press Limited
Date of publication 1965
Copyright date ©1965
Media type unmediated
Carrier type volume
Extent x, 149 pages
Dimensions 23 cm
Content type text
Illustrative content Illustrations
Supplementary content Includes bibliographical references.
Contributor Berners-Lee, C. M.
Relationship designator editor of compilation


245 00 $a Models for decision :
$b a conference under the auspices of the United Kingdom Automation Council organised by the British Computer Society and the Operational Research Society /
$c edited by C.M. Berners-Lee.
264 _1 $a London :
$b The English Universities Press Limited,
$c 1965.
264 _4 $c ©1965
300 __ $a x, 149 pages :
$b illustrations ;
$c 23 cm.
336 __ $a text
$2 rdacontent
337 __ $a unmediated
$2 rdamedia
338 __ $a volume
$2 rdacarrier
504 __ $a Includes bibliographical references.
700 1_ $a Berners-Lee, C. M.,
editor of compilation.

Linked data

@prefix rdf: <> .
@prefix schema: <> .
@prefix worldcat: <> .
@prefix library: <> .
@prefix viaf: <> .
@prefix lc_authorities: <> .
@prefix mads: <> .

  rdf:type schema:Book;
  library:oclcnum "221944758";
  schema:name "Models for decision : a conference under the auspices of the United Kingdom Automation Council organised by the British Computer Society and the Operational Research Society";
  library:placeOfPublication _:1;
  schema:publisher _:4 .
  schema:datePublished "[1965]";
  schema:numberOfPages "149";
  schema:contributor viaf:149407214;
  schema:contributor viaf:130073090;
  schema:contributor viaf:137135158;
  schema:contributor viaf:36887201;
  rdf:type schema:Place;
  schema:name "London :" .
  rdf:type schema:Organization;
  schema:name "English Universities Press" .
  rdf:type schema:Organization;
  madsrdf:isIdentifiedByAuthority lc_authorities:n79056431;
  schema:name "British Computer Society." .
  rdf:type schema:Organization;
  madsrdf:isIdentifiedByAuthority lc_authorities:n85076053;
  schema:name "Operational Research Society." .
  rdf:type schema:Organization;
  madsrdf:isIdentifiedByAuthority lc_authorities:n79063901;
  schema:name "Institution of Electrical Engineers." .
  rdf:type schema:Person;
  schema:name "Berners-Lee, C. M." .

How big is my book: Mashcat session

At Mashcat on 5 July in Cambridge I gave an afternoon session on getting computer readable information from the textual information held in MARC21 300 fields using Javascript and regular expressions. I intended this to be useful for cataloguers who might have done some of Codecademy‘s Code Year programme as well as an exploration of how data is entered into catalogue records, its problems, and potential solutions.

AACR2/MARC (and RDA) records store much quantitative information as text, usually as a number followed by units, e.g. “31 cm.” or “xi, 300 p”. This is not easy for computers to deal with. For instance, a computer programme cannot compare two sizes- e.g. “23 cm.” and “25 cm.”- without first extracting a number out of the string (23 and 25) as well as determining the units used (cm). In some cases, units might vary: in AARC2 books below 10 cm. are measured in mm., and non-book materials are often measured in inches (abbreviated to in.). Potential uses for better quantitative data in the 300$c include planning shelving for reclassification and more easily finding books by size or range.

Before the session, I sketched out a possible solution using Javascript and regular expressions to make this conversion for dimensions in the 300$c. I have a put up a version of A script to find the size of an item in mm. based on the 300$c, with the addition of an extra row which you can fill in to test your own examples without having to edit the script.

If you do want to look at how it works or try editing it yourself you can view source, copy all the HTML, then paste it into a text editor. Save it, then open the file using a browser to test it. Refresh the browser when you change the file.

The heart of the script looks like this:

var dollar_c = [
  "9 mm",
  "4 in.",
  "4 3/4 in.",
  "30 cm.",
  "1/2 in.",
  "20 x 40 cm."

// Convert text to mm
function text_to_mm (text) {
  // Convert fractions to decimals
  text = text.replace(/(\d*) (\d)\/(\d)/, function(str, p1,p2,p3) {return parseFloat(p1)+p2/p3});
  text = text.replace(/(\d)\/(\d)/, function(str, p1,p2) {return parseFloat(p1/p2)});
  // Extract the size of the book
  size = text.replace (/([\d\.]*).*/, "$1");
  // Extract the units
  units = text.replace(/.*([a-z]{2}).*/g, "$1");
  // Convert from various units to mm
  if (units === "mm") {
    var mm = size;
  if (units === "cm") {
    var mm = size * 10;
  if (units === "in") {
    var mm = size * 25.4;
  return mm;

It starts with a declaration of an array of examples to be tested: you can alter this with your own if you prefer. text_to_mm is the function that does all the work. It takes in the text from a 300$c, converts fractions (e.g. 4 3/4) to decimals (4.75), finds a number, finds a unit, then performs calculations on the size depending on what the unit is to produce a figure to a standard figure in mm. At Mashcat, Owen Stephens managed to plug an adaptation of this script into Blacklight to create an index of book sizes. Using this he could do things like find the most common sizes or the largest book in a collection.

The main focus of my session, however, was on a similar script to figure out how many actual pages there are in a book, given the contents of a 300$a, e.g. “300 p.”, “ix, 350 p.”, “100 p., [45] leaves of plates”  (a page being one side of a sheet of paper; a leaf being a sheet of paper only printed on one side, so therefore counting as two pages). I have also published a version of A script to find the absolute no. of pages based on the 300$a with the similar addition of a row for easy user testing. Potential uses for recording page numbers rather than pagination include planning shelving space, easier to understand displays for users, and finding books of specified lengths.

The script starts with a similar array of examples to be tested:

// An array of test examples
var dollar_a = [
  "9 p.",
  "30 leaves",
  "30 p., 20 leaves",
  "xiv, 20 p.",
  "20, 30 p.",
  "20, 30, 40 p.",
  "xv, 20, 30, 40 p., 5, 5 leaves of plates",
  "clviii, ii, 4, vi p."

The main function is called text_to_pages. The first thing it does is convert any Roman numerals to Arabic ones. The heavy lifting for this is a function by Stephen Levithan which does the actual number conversion. However, we still need to identify and extract the Roman numerals from the pagination in order to convert them. This line does the extraction and makes a list of the Roman numerals:

var roman_texts=text.match(/[ivxlc]*[, ]/g);

The session I gave concentrated on regular expressions (a bit like the wildcards you use on library databases but turned up to eleven) which in all cases here are contained within slashes, and I made a simple introductory guide to regular expressions (.docx). There are many guides to regular expressions on the web too, and useful testers to play with such as this one. The regular expression in the line above can be broken down as follows:

  • [ivxlc] uses square brackets to look for any one of the characters listed within them.
  • The following * means to look for any number of these in a row
  • [, ] any of a comma or a space, again using square brackets. Obviously these characters are not used in Roman numerals but they are a convenient method of isolating these characters as numbers rather, say, the “l” in leaves which would also match otherwise.

The next few lines work through the list, replace any instances of [, ] with “” (i.e. nothing) to leave the bear Roman numerals, convert all the numbers in the list using Stephen Levithan’s functions, then do the replacements on the pagination given in text:

if (roman_texts) {
    for (i=0; i<roman_texts.length; i++) {
      // Remove space
      roman_texts[i]=roman_texts[i].replace(/[, ]/,"");
      var arabic_text = deromanize(roman_texts[i])+" ";
      text = text.replace(roman_texts[i],arabic_text+" ");

Like the size script above, the rest of conversion needs to do two things: find the numbers and find the units. To do this we need to find the sequences involved. While this is easy with something like “24 p.” (number is 24, unit is p) or even “xv leaves” (number is 15, unit is leaves), it becomes troublesome when you get something like “23, 100 p.”: the first number is 23 but there is no unit associated with it, only a comma to signify that it is a sequence at all. The following lines try and get round this problem but looking for sequences where the comma appears to be the unit and then looking ahead to find the next unit. In the “23, 100 p.” example the script would keep looking forward past the 100 until it gets to the “p”.

// Convert 20, 30 p. to 20 p. 30 p
  while (text.match(/\d*,/)) {
    text = text.replace(/(\d*),(.*?(p|leaves))/, "$1 $3 $2");

The first regular expression in the while line looks for:

  • \d* any number of digits. \d is any digit and * looks for any number of them, followed by
  • , a comma

So as long as the script finds any sequences of numbers followed by a comma, it will carry on making the replacement underneath it. The replacement line itself looks for

  • \d* any number of digits again, followed by
  • , a comma
  • .*? which is . any character * any number of times. The ? makes sure that the smallest matching group of characters is matched; otherwise the expression will think that the units corresponding to the number 15 in the pagination “15, 25 p., 50 leaves” is “leaves” rather than “p”.
  • p|leaves either p or leaves. The pipe means either match on the left of it or the right of it. Because this is in a set of round brackets, the pipe only applies there, rather than the whole expression.

Brackets also capture subsets which is really useful here: the first set of () brackets captures the number of pages and stores it as $1, the second set captures everything between the comma and the end of the units as $2, the third  set captures the units only, either “p” or “leaves”, and stores it as $3. So in the example “15, 25 p., 50 leaves”, $1 is “15”, $2 is ” 25 p”, and $3 is “p”. The replacement puts these back in a different order, i.e. “$1 $3 $2″ which would be “15 p 25 p”.

Now that all the sequences will be in number-unit pairs, we can get on with making a list of them to work through:

 // Find sequences
  var sequences = text.match(/\d+.*?(,|p|leaves)/g);

This looks for:

  • \d+ at least one digit
  • .*? any number of any characters, although not being greedy
  • (,|p|leaves) any of a comma, “p”, or “leaves”. Obviously, if the while loop above has worked, then the comma isn’t needed, but I’ll confess this is a hangover from a previous version of the script…

The next section goes through each of the sequences found and extracts the number and then the unit:

// go through sequences
  var pages = 0;
  for (var i=0; i<sequences.length; i++) {
    // Extract no
    var number = parseFloat(sequences[i].match(/\d+/g)[0]);
    var units = sequences[i].match(/(p|leaves)/g)[0];
    if (units == "p") {
    if (units == "leaves") {

The regular expression to find the number is straightforward:

  • \d+ at least one digit

The parseFloat converts the digits as a string to a Javascript number. The regular expression to find the unit is also simple:

  • (p|leaves) either “p” or “leaves”

If the units are “p”, then the variable pages is incremented by the value of the number found; if “leaves”, then pages is incremented by twice that number.

The programme should cope with the loss of abbreviations in RDA as “p.” is expanded to “pages” but the regular expression to find the units will still find the “p” at the beginning much as it isn’t put off by the full stop after the “p”. It could be expanded to look for other variations and I will do so if I can:

  • “S.” for German “Seite” or “Seiten”.
  • “leaf”, as in “1 leaf of plates”
  • sequences which start in the middle of larger ones, like journal issues with “xii, p. 546-738″. This one will be the most complicated as it goes against the basic flow of the existing code.

I also haven’t properly tested folded sheets or multiple volume works. Other improvements are needed in failing more gracefully when it doesn’t find what it’s expecting: the programme should really test the existence of the arrays it makes before looping through them, but this would make it harder to understand at a glance or demonstrate on screen so I didn’t do it.

The scripts are written in Javascript for several reasons: it is the language that Codecademy focusses on for beginners; it requires no specialist environment, server, or even a web connection: you just need a basic text editor and a browser; it is easy to adapt for a web page if you do manage to build something; and, it is the language I am most confident working in. It would be fairly easy to port to other languages though, and Owen changed the size script with some other modifications to work in Beanscript/Java in Blacklight.

I can’t speak for the attendees, but I learnt a lot, and much was made more clear, from playing around with these scripts and talking to people at Mashcat:

  • Quite how depended AARC2 and RDA (and consequently MARC21) are on textual information, even for what appears to be quantitative data.
  • That even for what appears to be standard number-unit data, there are too many complications that make it non trivial to extract data:
    • fractions (not even decimals) in 300$c
    • differing units: book sizes in mm. or cm. depending on how big the book is; disc sizes in in.; extent in pages or leaves (or volumes or atlases or sheets…)
    • sequences with implied units, such as those with commas.
  • there is frequently a lack of clarity and ambiguity of what is actually being measured:
    • for books the dimension recorded is normally height (although this is not explicit from a user’s point of view,  sometimes it’s height and width, and for a folded sheet it could be all sorts of things); for a disc it’s the diameter.
    • For the 300$a what’s being recorded is pagination, something entirely different from number of pages. Although important for things like rare books, how important is complete pagination for most users compared to a robust idea of how large a book is? Amazon provide a number of pages. More importantly, how understandable is pagination? During my demonstration, some of my audience of librarians were left cold by the meanings of square brackets for example (and square brackets can mean any number of things depending on context). Perhaps there is room for both.

I suppose this latter point is a potential conclusion. Ed Chamberlain asked me what I thought should be done. I don’t know to be honest. I think, like much of the catalogue record, lots more research is needed to see what users (both human and computer) actually want or need. It should be said that entering pagination is in many ways easier for the cataloguer. However, I do think we need:

  • quantitative data entered as numbers with clear and standard units. For instance, record all book heights as mm. and convert to cm. for display if needed.
  • more data elements to properly make clear what is being recorded. Instead of a generic dimension, we need height, width, depth?, diameter, etc. Instead of pagination, we could have separate elements for pagination, number of pages, and number of volumes (50 volumes each of 10 pages is not the same as 4 volumes of 1000 pages each). Obviously all of them wouldn’t be needed for all items.

The research to enable us to choose what to record, why we’re recording it, and for whose benefit would be the best starting point for this as well as many other questions in cataloguing and metadata.

RDA as a closed standard

Resource Description and Access (RDA), the new bibliographic standard to replace AACR2, was released in 2010 on the web as a closed standard sitting behind a paywall. This really worries me. I strongly believe it should be an open standard.

What do I mean by closed?
By closed I basically mean that you have to pay or subscribe to access it. In many ways, this is not dissimilar to AARC2. For decades, libraries (and individuals) paid for various editions of AACR2, which has always been primarily a print product, as well as for various updates when it changed to looseleaf format. Recently it has also been available on the web via Cataloger’s Desktop. RDA is primarily a web product via the RDA Toolkit, although a concession was eventually made to release it in print as well.

An open standard would be one that, according to Wikipedia, “is publicly available” even if it “has various rights to use associated with it”. This would be one which any cataloguer, librarian, or crucially, non-librarian, could see and benefit from.  A practical definition for the purposes of this post would be a standard I could go and look at right now without subscription. I can if I wish apply the standard without hindrance, I can assess it with ease, and, ideally, build on the standard without restriction. RDA is not open, although to be fair, a part of RDA has been released openly, the element set and vocabularies.

Other open standards
Open (publicly available) standards are quite common. Some well known-examples:

The following are open although I’d have a lawyer to hand if you wanted to do anything with them:

Some closed ones for comparision:

These are of course more subjective lists than they look, but you get the idea. The closed list was actually bigger until upon examination I found that JPG, GIF, and even MS Office standards are publicly readable even if I’m not sure what more you could legally do with them. I’d be happy to add more to the closed list to balance things out a little.

Why is RDA not open?
Money. This is a delicate matter that I don’t want to delve into too much although it is obviously central to the openness of the standard. It’s also hard to talk about without appearing to make wild assertions, and I hope I haven’t been unfair. I’ve heard Alan Danskin of the JSC explaining that they’d thought about releasing RDA openly but that they had to cover costs. I’m not exactly sure what the costs of production were, although presumably included expenses and staff costs, and production of the product itself. The last is I think unfortunate as I would like to have seen a far simpler publication of RDA without all the bells and whistles, login barriers, and the need to learn a new interface as well as a new standard. Compare with the HTML4 standard which is a set of simple HTML documents with normal links. I don’t need to learn how to use that. Or, come to that, the MARC21 site. I wonder how much of the fee goes towards setting up and maintaining the RDA Toolkit platform.

With my tin foil hat on, I also wonder how much the fee is needed to resume revenue to the co-publishers since AACR2 has been in unrevised abeyance. 

Why does it matter?
It matters because RDA (and with it all the high quality traditional cataloguing techniques) will not be widely used without being open. I think you can divide the potential RDA userbase as follows:

  1. Libraries with enough money to switch to RDA
  2. Libraries without enough money to switch to RDA
  3. Non-libraries dealing with metadata

Those in group 1 will buy RDA, but some libraries- Group 2- will not see the benefit for the costs of conversion and training, let alone the costs of subscription. For ‘traditional’ cataloguing to thrive, therefore, we need to involve Group 3. However, those in Group 3 will not be able to even have a look at RDA to see if it meets their needs. I think RDA will be lucky to retain the same user base as AARC2, let alone break into new areas and influence the way other metadata is carried out. Those in the metadata community who, I suspect, have already been put off by AACR2, are unlikely to even try looking at RDA if it involves forking out a subscription.

I recently sat in a room with about 15 or so people mostly involved in metadata for institutional repositories and the like. During some discussions they flagged up two problems they were having: establishing a consistent form of name, and a standard set of data elements. I asked myself, would I recommend RDA to them to help solve these problems? Even if I thought it met their needs, could they even have a look to see if it did? No. They will either come up with their own solution or look elsewhere for it, which is already what they have been doing. I can’t see us taking more people with us, just a proportion of the people already using AACR2.

Openness also matters because haven’t a closed standard doesn’t reflect terrribly well on librarianship in general. I have a friend in IT who Laughed Out Loud* when I said the new library metadata standard was behind a paywall. In the new world of openness where even Microsoft loosely adhere to web standards, traditionally closed governments are leading the charge to release more data, and the world has been transformed by the the open standards of the web, are we to follow The Times behind a paywall? Personally I feel libraries, librarians, and library data should be at the forefront of openness, not grudgingly following behind or not following at all.

What could be done?
This is the nub of the matter. I’m no marketing expert and maybe I’m naive and there is nothing that could be done. However, working on the assumption that all that needs to be done is to break even and pay the costs of production for RDA, I would suggest the following ideas for a start:

  • Make a flashy web product anyway and charge lots more for it. Many more well-off libraries would pay for a product like the Toolkit if it’s good enough.
  • There is a need for a more accessible version or versions of RDA, e.g. just for books or in a more convenient format like, say, the Chan books on LCSH or the green editions of AACR2. The co-publishers could fulfill this need which I imagine would be easily done by re-using the data they already have.
  • Explanatory books. There are a number of these on the market or on the way already. The co-publishers could publish an official companion.
  • Consultancy and training. There is going to be a big demand for this soon in any case.
  • Involve more organisations in the drafting and publishing of RDA to share the costs, e.g. publishers, LMS vendors, commercial metadata suppliers, other metadata initiatives. I think it would be a positive and pragmatic move to have these parties on board anyway. They would be more likely to use the high quality standard produced and we would be more likely to be using metadata that meets all our purposes.

See Also
I notice a post covering some of these issues by carolslib from a few days ago. From the Catalogs of Babes also has a similar post, RDA: why it won’t work, from a few weeks ago which much more succinctly makes some of the same points:

Many librarians are balking at the cost of implementing RDA, I think rightfully so, although not for the same reasons. I’m not bitching about it because it’s unaffordable for smaller libraries, or because it’s a subscription rather than a one-time printed book cost (although I think those are valid points). I’m bitching because putting a dollar amount on something, now matter how low it is, will stop people from using something, especially if there’s a free alternative. In this case, I see the free alternative as ‘ignoring rules altogether and/or making you your own standards.’ Requiring a price makes adhering to standards–a key value-added service of libraries and librarians–inaccessible. Which is pretty ironic, considering that libraries are supposed to be all about access. We’re all proactive about offering access to our patrons, but we can’t extend that same philosophy to ourselves, to help us do a better job??

[Updated 23 November 2012] Terry Reese asks Can we have open library standards, please? Free RDA/AACR2.

*** He literally LOL’d, although no ROFLing took place admittedly.