Archives

Categories

ISKO-UK linked data day

On 14 September I went to the ISKO-UK one day conference on Linked Data: the Future of Knowledge Organisation on the Web.  For me, this followed on from a previous Talis session on Linked Data and Libraries I attended at the British Library in June, which I found really very interesting and informative.

The ISKO conference was a lot broader in scope- it was noticed by several speakers discussing the BBC’s use of linked data that there were 22 attendees from the BBC- and  included talks about local and national government, business, libraries, as well as the Ordnance Survey. The following is a brief and personal overview, pausing in more detail over aspects that interested me more. It assumes a passing acquaintance with linked data and basic RDF.

Professor Nigel Shadbolt from the University of Southampton, a colleague of Tim Berners-Lee at Southampton as well as in advising the British Government developing the data.gov.uk site, opened with a talk about Government Linked Data: A Tipping Point for the Semantic Web. There were two interesting points from this (there were many, but you know what I mean). First was the speedy and incredible effects of openly releasing government data. Professor Shadbolt used the analogy of the mapping by John Snow of the 1854 cholera epidemic which identified a pump as the source and led to the realisation that water carried cholera. He mentioned the release of government bike accident data that was little used by the government but which was taken up and used by coders within days to produce maps of accident hotspots in London and guides to avoiding them.
The second point was the notion of the “tipping point” for the semantic web and linked data referred to in the talk’s title. Several speakers and audience members referred to the similar idea of the “killer implementation”, a killer app for the semantic web that would push it into the mainstream. The sheer quantity of data and use it is quickly put to, often beyond the imagination of those who created and initially stored it, was quite compelling. Richard Wallis made a similar point when discussing the relative position of the semantic web compared to the old-fashioned web in the 1990s. He noted that it is now becoming popular to the extent that is nearly impossible to realistically list semantic web sites and predicts that it will explode in the next year or so. Common to both Nigel Shadbolt’s and Richard Wallis’s talks was a feeling almost of evangelism: Richard Wallis explicitly refers to himself as a technology evangelist; Nigel Shadbolt referred to open government data as “a gift”. Despite being relatively long in the tooth, RDF, linked data, and all that have not yet taken off and both seemed keen to push it: when people see the benefits, it won’t fail to take off. There were interesting dissenting voices to this. Martin Hepp, who had spent over eight years coming up with the commercial GoodRelations ontology, was strongly of the opinion that it is not enough to merely convince people of the social or governmental benefits, but rather the linked data community should demonstrate that it can directly help commerce and make money. The fact that GoodRelations apparently accounts for 16% of all RDF triples in existence and is being used by corporations such as BestBuy and O’Reilly (IT publishers) seems to point to a different potential tipping point. Interestingly, Andreas Blumauer in a later talk said that SKOS (an RDF schema to be discussed in the next paragraph) could introduce Web 2.0 mechanisms to the ‘web of data’”. Perhaps, then, SKOS is the killer app for linked data (rather than government data or commercial data as suggested elsewhere), although Andreas Blumauer also agreed with Martin Hepp in saying that “If enterprises are not involved, there is no future for linked data”. In my own ignorant judgement, I would suggest government data is probably a more likely tipping point for linked data, closely followed by Martin Hepp’s commercial data. It is government data that is making people aware of linked data, and especially open data, in the first place. This is more likely to recruit and enthuse. I think the commercial data will be the one that provides the jobs based on the foregoing: it may change the web more profoundly but in ways fewer people will even notice. I suppose it all depends on how you define tipping points or killer apps, which I don’t intend to think about for much longer.

The second talk, and the start of a common theme, was about SKOS and linked data, by Antoine Isaac. This was probably the most relevant talk for librarians and was for me a simple introduction to SKOS, which seems to be an increasingly common acronym. SKOS stands for Simple Knowledge Organisation System and is designed for representing (simply) things like thesauruses* and classification schemes, based around concepts. These concepts have defined properties such as preferred name (“skos:prefLabel”), non-preferred term (“skos:altLabel”), narrower term (“skos:narrower”), broader term (“skos:broader”), and related term (“skos:related”).  The example I’ve been aware of for some time is the representation of Library of Congress Subject Headings (LCSH) in SKOS, where all the SKOS ideas I’ve just mentioned will be recognisable to a lot of librarians. In the LCSH red books, for example, preferred terms are in bold, non-preferred terms not in bold preceded by UF, and the relationships between concepts is represented by the abbreviations NT, BT, and RT. In SKOS, concepts and labels are more clearly distinct. An example of SKOS using abbreviated linked data might be (stolen and adapted from the W3C SKOS primer):

ex:animals rdf:type skos:Concept;
skos:prefLabel “animals”;
skos:altLabel “creatures”;
skos:narrower ex:mammals.

This means that ex:animals is a SKOS concept; that the preferred term for ex:animals is “animals”; a non-preferred term is “creatures”; and, that a narrower concept is ex:mammals. In a mock LCSH setting this might look something like this:

Animals
UF Creatures
NT Mammals

In the LCSH example, however, the distinction between concepts and terms is lost. One aspect of SKOS that Antoine Isaac spent some time on is the idea of equivalent concepts, especially across languages. In RDF you can bind terms to languages using an @ sign, something like this:

ex:animals rdf:type skos:Concept;
skos:prefLabel “animals”@en;
skos:prefLabel “animaux”@fr.

However, you can also link concepts more directly using skos:exactMatch, skos:closeMatch, skos:broadMatch, skos:narrowMatch, and relatedMatch to link thesauruses and schemes together. These are admittedly a bit nebulous. He mentioned work that had been done on linking LCSH to the French Rameau and from there to the German subject thesaurus SWD. For example:

Go to http://lcsubjects.org/subjects/sh85005249 which is the LCSH linked data page for “Animals”. (You can view the raw SKOS RDF using the links at the top right, although sadly not in n3 or turtle format which I have used above). At the bottom of the page there are links to “similar concepts” in other vocabularies, in this case Rameau.
Go the the first one, http://stitch.cs.vu.nl/vocabularies/rameau/ark:/12148/cb119328694, and you see the Rameau linked data page for “Animaux”.

In the LCSH RDF you can pick out the following RDF/XML triples:

<rdf:Description rdf:about=”http://lcsubjects.org/subjects/sh85005249#concept”>
<rdf:type rdf:resource=”http://www.w3.org/2004/02/skos/core#Concept”/>
<skos:prefLabel>Animals</skos:prefLabel>
<skos:altLabel xml:lang=”en”>Beasts</skos:altLabel>
<skos:narrower rdf:resource=”http://lcsubjects.org/subjects/sh95005559#concept”/>
<skos:closeMatch rdf:resource=”http://stitch.cs.vu.nl/vocabularies/rameau/ark:/12148/cb119328694″/>

which is basically saying the same as (clipping the URIs for the sake of clarity):

lcsh:sh85005249#concept rdf:type skos:Concept;
skos:prefLabel “Animals”@en;
skos:altLabel “Beasts”@en;
skos:narrower lcsh:sh95005559#concept;
skos:closeMatch rameau:cb119328694.

Not too far from the first example I gave, with the addition of  a mapping to a totally different scheme. Or in mock red book format again but with unrepresentable information missing:

Animals
UF Beasts
NT Food animals

Oh that some mapping like this were available to link LCSH and MeSH…!

Several other talks touched on SKOS, such is its impact on knowledge management. Andreas Blumauer talked about it in demonstrating a service provided by punkt. netServices, called PoolParty.** I don’t want to go into depth about it, but it seemed to offer a very quick and easy way to manage a thesaurus of terms without having to deal directly with SKOS or RDF. During the talk, Andraeas Blumauer briefly showed us an ontology based around breweries, then asked for suggestions for local breweries. Consequently, he added information for Fullers and published it right away. To see linked data actually being created and published (if not hand-crafted) was certainly unusual and refreshing. Most of what I’ve read and seen has talked about converting large amounts of data from other sources, such as MARC records, EAD records, Excel files, Access databases, or Wikipedia. I’ve had a go at hand-coding RDF myself, which I intend to write about if/when I ever get this post finished.

I don’t want to go into detail too much about it***, but another SKOS-related talk was the final one from Bernard Vatant who drew on his experience in a multi-national situation in Europe to promote the need for systems such as SKOS to deal more rigorously with terms, as opposed to concepts. Although SKOS would appear to be about terms, in many ways it is not clear on many matters of context. For instance, using skos:altLabel “Beasts” for the concept of Animals as in the examples given above gives no real idea of what the context of the term is. Here is a theoretical made-up example of some potential altLabels for the concept of Animals which I think makes some of the right points:

Animal (a singular)
Beasts (synonym)
Animaux (French term)
Animalia (scientific taxonomic term)

These could all be UF or altLabels but using UF or altLabel gives no idea about the relationship between the terms, and why one term is a non-preferred term. He gave another instance of where this might be important in a multinational and multilingual context, where the rather blunt instrument of adding @en or @fr is not enough, when a term is different in Belgian, French, or Canadian varieties of French. This has obvious parallels in English, where we often bemoan the use of American terms in LCSH. Whether embedded in LCSH or as a separate list, it might be possible to better tailor the catalogue for local conditions if non-preferred terms were given some context. Perhaps “Cellular telephones” could be chosen by a computer to be displayed in a US context, but “mobile telephones” could be chosen in a UK context if the context of those terms were known and specified in the thesaurus.

Moving away from SKOS, Andy Powell talked about the Dublin Core Metadata Initiative (DCMI). I’ll admit I’ve always been slightly confused as to what the precise purpose of Dublic Core (DC) is and how one should go about using it. Andy Powell’s talk explained a lot of this confusion by detailing how much DC had changed and reshaped itself over the years. To be honest, in many ways I found it surprising how it is still active and relevant given the summary I heard. The most interesting part of his talk for me was his description of the mistakes or limitations of the DCMI caused by its library heritage. Another confession- my notes here are awful- but the most important point that stuck out for me was the library use of a record-centric approach, e.g.:

  • each book needs a self-contained record
  • this record has all the details of the author, title, etc.
  • this record is used to ship the record from A to B (e.g. from bibliographic utility to library catalogue),
  • this record also tracks the provenance of the data within the record, such as within the 040 field: it all moves together as one unit.

Contrast this with the sematic web approach where data is carried in triples. A ‘record’, such as an RDF file, might only contain a sameAs triple which relates data about a thing to a data store elsewhere; many triples from multiple sources could be merged together and information about a thing could be enriched or added to. This kind of merging is not particularly easy or encouraged by MARC records (although the RLUK database does something similar and quite tortuously when it deduplicates records). There’s a useful summary of all this at all things cataloged which opens thus:

Despite recent efforts by libraries to release metadata as linked data, library records are still perceived as monolithic entities by many librarians. In order to open library data up to the web and to other communities, though, records should be seen as collections of chunks of data that can be separated / parsed out and modeled. Granted, the way we catalog at the moment makes us hold on to the idea of a “record” because this is how current systems represent metadata to us both on the back- and front-end. However with a bit of abstraction we can see that a library record is essentially nothing but a set of pieces of data.

One problem with the linked data approach though is the issue of provenance which was referred to above as one of the roles the MARC record undertakes (ask OCLC, e.g. http://community.oclc.org/metalogue/archives/2008/11/notes-on-oclcs-updated-record.html). If you take a triple out of its original context or host, how can you tell who created the triple? Is it important? Richard Wallis always makes the point that triples are merely statements: like other web content they are not necessarily true at all. Some uneasiness on the trustworthiness or quality of data turned up at various points during the day. I think it is an interesting issue, not that I know what the answer is, especially when current cataloguing practices largely rely on double checking work that has already been done by other institutions because that work cannot really be trusted. There are other issues and possible solutions that are a little outside my comfort zone at the moment, including excellent buzzwords like named graphs and bounded graphs.

Andy Powell also mentioned, among other things:

  • the “broad semantics” or “fuzzy buckets” of DC which derive in large part from the library catalogue card, where, for instance, “title” or “creator” can mean all sorts of imprecise things;
  • flat world modelling where two records are needed to describe say, a painting and a digital image of the painting. This sounds to me like the kind of thing RDA/FRBR is attempting (awkwardly in my view) to deal with.
  • the use of strings instead of things, such as describing an author as “Shakespeare, William” rather than <http://www.example/authors/data/williamshakespeare>. This mirrors one of the bizarre features of library catalogues where authorities matching is generally done by matching exact strings of characters rather than identifiers. See Karen Coyle for an overview of the headaches involved.

There were three other talks which I don’t propose to go into in much detail. I’ve touched on Richard Wallis’s excellent (and enthusiastic) introduction to the whole idea of linked data and RDF, a version of which I found dangerously intriguing at a previous event given by Talis. He talked about, among other things, the use of the BBC in using linked data to power its wildlife pages (including drawing content from Wikipedia) and World Cup 2010 site; in fact, how linked data is making the BBC look at the whole way it thinks about data.

His other big message for me was to compare the take-up of the web to where the current take of linked data was in order to suggest that we are on the cusp of something big: see above for my discussion of the tipping point.

* I don’t like self-conscious classical plurals where I can help it, not that there’s anything wrong with them as such.
** I can’t help but find this name a little odd, if not actually quite camp. I expect there’s some pun or reference I’m not getting. Anyway. Incidentally, finding information about PoolParty from the punkt. website is not easy, which I find hard to understand given that it is a company wanting to promote its products; and, more specifically, it is a knowledge management (and therefore also information retrieval) company.
*** Partly because I don’t think I could do it justice, partly also because it was the most intellectual talk and took place at the end of the day.