Publishing ORO as Linked Data « The LUCERO Project

Publishing ORO as Linked Data

26 Nov 2010 at 11:54

ostephens

article bibo jiscEXPO luceroproject oro publication research

The data

One of the first data sets to be made available on http://data.open.ac.uk is the contents of ORO (Open Research Online), the Open University’s repository of research publications and other research outputs. The software behind ORO is EPrints, open source software developed at the School of Electronics and Computer Science and is used widely for similar repositories across UK Higher Education (and beyond).

ORO contains a mixture of metadata for items and full text items (often as PDF). The repository includes a mixture of journal articles, conference papers, book chapters andtheses. The data we are taking and presenting on http://data.open.ac.uk is just the metadata records – not any of the full-text items. Typical information for a record includes:

title
author/editor(s)
abstract
type (e.g. article, book section, conference, thesis)
date of publication

The process

We had initially expected to extract data from ORO in an XML format (possibly RSS) and transform into RDF. However, Chris Gutteridge, the lead developer for the EPrints, added an RDF export option to version 3.2.1 of EPrints, and since we could get this installed on a test server we decided we would make use of this native RDF support. We did make a few small changes to the data before we published it, mainly to replace some of the URIs assigned by EPrints with data.open.ac.uk URIs as noted the blog post ‘First version of data.open.ac.uk‘.

Issues

In general, the process of publishing the data was quite smooth. However, once we had published the data it quickly became apparent there were some issues with the data. Most notably we found that in some cases two or more individual author details were merged together into a single ‘person’ in the RDF. Investigation showed that the problems were in the source data, and were caused by a number of issues:

Incorrectly allocated author IDs in ORO

ORO allows an (Open University) ID to be recorded for Open University authors, and we use this ID as a way of linking together works by an author. Unfortunately in some cases incorrect IDs had been entered, leading to two separate identities to become con-fused in our data

Name changes

In some cases the author had changed their name, resulting in two names against the same author ID. While all the information is correct, it leads to slightly confusing representation in the RDF (e.g. Alison Ault changed her name to Alison Twiner)

Name variations

In some cases the author uses different versions of their name in different publications. Good practice for ORO is to use the name as printed on the publication, which can result in different versions of the same name – for example in most papers, Maria Velsco-Garcia’s name is spelt with a single ’s’ in Velasco, but in one paper, it is spelt Velassco with a double ’s’.

A particularly common inconsistency was around the use of accents on characters – where sometimes a plain character was used instead of the accented character – this seemed to be down to a mixture of data entry errors and variations in the use of accents in publications

Incorrect publisher data

There were a couple of examples where the publisher had incorrect data in their systems, which had been brought through into ORO. One particular example split a single author with several parts to their name into two separate authors.

Having identified the records effected, the next challenge was correcting them – firstly investigation into each error (this could be challenging – especially where name changes had occurred it was sometimes difficult to know if this was the same person or not), secondly the question of where these are corrected. In this case we were given edit access to ORO so we could make the corrections directly, but the question does arise – what happens if you can’t get the errors corrected in the source data set?

Conclusions

One of the interesting things for me is that these small errors in data would be unlikely to be spotted easily in ORO. For example, when you browse papers by author in ORO, behind the scenes, ORO uses the author ID while presenting the user with the names associated with that ID. Because of this, you would be hard pushed to notice a single instance of a mis-assigned identifier. However, once the data was expressed as RDF triples, the problem became immediately apparent. This means that a very low error rate in ORO data, is magnified into obvious errors on http://data.open.ac.uk

I suspect that this ‘magnification’ of errors will lead to some debate over the urgency of fixing errors. While for http://data.open.ac.uk fixing the data errors becomes important (because they are very obvious), it may be that for the contributing dataset (perhaps especially large datasets of heterogeneous data such as bibliographic data) fixing these errors is of lower priority.

On the upside, using the data on data.open.ac.uk we can start to run queries that will help us clean the data – for example, you can find people with more than 1 family name in ORO.

3 Comments

Pingback

by Tweets that mention Publishing ORO as Linked Data « The LUCERO Project -- Topsy.com

26 Nov 2010 at 12:38

[...] This post was mentioned on Twitter by Andy Powell and ostephens, Mathieu d'Aquin. Mathieu d'Aquin said: RT @ostephens: New #luceroproject blogpost: Publishing ORO (Eprints repository) as Linked Data http://is.gd/hOEH7 – what we found … [...]
by Adrian Stevenson

21 Feb 2011 at 16:57

Hi Owen

I see from http://data.open.ac.uk/ that you’re using CC-BY for your Linked Data. I wondered how you came to that decision, and if you foresee any attribution trail problems? I’m looking at CC0 and ODC Public Domain Dedication and Licence (PDDL) right now, and ‘no going back’ aspect is making me think twice.

Cheers, Adrian
by Owen Stephens

25 Feb 2011 at 09:53

Hi Adrian,

Essentially we found the bottom line for many data owners we talked to was that they wanted attribution if data was used in other contexts. While some data owners might be happy with CC0 or PDDL, we decided at this stage it was easier to apply one license across the piece.

I’m not sure that CC-BY is the correct one – I need to discuss with the rest of the team, but I know (apart from CC0) CC licenses aren’t really suitable for data – we may need to look at something like the Open Government License instead (http://www.nationalarchives.gov.uk/doc/open-government-licence/) – but I’ll need to look at this further and talk to others.

In terms of the attribution trail – my personal view is that we probably shouldn’t get too worked up about this (for the moment at least). While I know some feel that we should look at how we track provenance for individual pieces of data my view is that we shouldn’t worry about this level of detail and should think about attributing overall sources (e.g. “This data was compiled from locah.ac.uk and data.open.ac.uk”) rather than tracking where each piece of data came from. I’m aware there are going to be issues if you only track at a high level, but I find it hard to get to worried about this. However, it is early days and I may well change my mind!

Click here to cancel reply.

The LUCERO Project