First version of data.open.ac.uk « The LUCERO Project

First version of data.open.ac.uk

11 Oct 2010 at 14:51

Mathieu

LUCERO is all about making University wide resources available to everyone in an open, linked data approach. We are building the technical and organisational infrastructure for institutional repositories and research projects to expose their data on the Web, as linked data. It is therefore natural for the interface to this data, the SPARQL endpoint and server addressing URIs in this data to be hosted under http://data.open.ac.uk. The first version of the components underlying this site, as well as a small part of the data which will be ultimately exposed there have gone live last week, with a certain level of excitement from all involved.

What is there? The data

The “launch” of data.open.ac.uk happened relatively shortly after the beginning of the LUCERO project. Indeed, we take the approach that the basic data exposure architecture have to be in place, to incrementally integrate data into it. As a first step, we developed extraction and update mechanisms (see the previous blog post of about the LUCERO workflow) for two important repositories at the Open University: ORO, our publication repository, and podcast, the collection of podcasts produced by the Open University, including the ones being distributed through iTunes U.

ORO data concerns scientific publications with at least one member of the Open University as co-author. The source of the data is a repository based on the EPrints open source publication repository system. EPrints already integrates a function to export information as RDF, using the BIBO ontology. We of course used this function, post-processing what is obtained to obtain a representation consistent with the other (future) datasets in data.open.ac.uk, in particular in terms of URI Scheme. The ORO data represents at the moment 13,283 Articles and 12 Patents, in approximately 340,000 triples (see for example the article “Molecular parameters of post impact cooling in the Boltysh impact structure”).

Podcast data is extracted from the collection of RSS feeds obtained from podcast.open.ac.uk, using a variety of ontologies, including the W3C media ontology and FOAF (see for example the podcast “Great-circle distance”). An interesting element of this dataset is that it provides connections to other types of resources at the Open University, including courses (see for example the course MU120, which is being referred to in a number of podcasts). Podcasts are also classified into categories, using the same topics used to classify courses at the Open University, as well as the iTunesU categories, which we represent in SKOS (see for example the category “Mathematics”).

While representing only a small fraction of the data we will ultimately expose through data.open.ac.uk, the new possibilities obtained by exposing openly these datasets in RDF, with a SPARQL endpoint and resolvable URIs are very exciting already. In a blog post, Tony Hirst has shown some initial examples and encouraged others to share their queries to the Open University’s linked data. Richard Cyganiak has also kindly created a CKAN description of our datasets, for others to find and exploit.

The technical aspects

In a previous blog post, we gave an overview of the technical workflow by which data from the original sources would end up being exposed as linked data. The current platform implements parts of this workflow, including updaters and extractors for the two considered datasets. At the centre of the platform is the triple store. After trying several options, including Sesame, Jena TDB and 4Store, we settled for SwiftOWLIM, which is free, scalable and efficient, and includes limited reasoning capabilities, which might end up being useful in the future.

The current platform also implements the mechanisms by which URIs in the http://data.open.ac.uk namespaces are being resolves. Very simply, a URI such as http://data.open.ac.uk/course/a330 can either be re-directed to http://data.open.ac.uk/page/course/a330 or to http://data.open.ac.uk/resource/course/a330 depending on the content being requested by the client. http://data.open.ac.uk/page/course/a330 shows a browsable webpage linking the considered resource to related one, while http://data.open.ac.uk/resource/course/a330 provides the RDF representation of this resource.

A SPARQL endpoint is also available, which allows to query the whole set of data, or individual datasets through their namespaces, http://data.open.ac.uk/context/oro and http://data.open.ac.uk/context/podcast.

What’s next?

Of course, this first version of data.open.ac.uk is only the beginning of the story. We are currently actively looking at the way to represent and extract information about courses and qualifications from the Study At the OU website, as well as at information about places in the OU campus and regional centres (building, car parks, etc.)

More ways to access will also be soon made available, including faceted search/browsing, and links to external datasets are being investigated. All this is going to be gradually integrated into the platform while the existing data is being constantly updated.

4 Comments

by Adrian Stevenson

19 Oct 2010 at 16:47

Hi LUCERO People.

You’re making great progress here. Good to see.

It all looks like it’s going smoothly. Any snags, issues, quick wins to report at this stage? I’m personally involved in the LOCAH project, and we’ve found challenges in modelling the data for starters. There’s more on our experiences at http://blogs.ukoln.ac.uk/locah/2010/09/22/creating-linked-data-more-reflections-from-the-coal-face/. I’d be interested to know if you’ve had any of the same sort of issues, or any different ones?

We’re very keen to identify any common themes that emerge across the projects so we can get as much learning from the programme as possible. Please keep us informed on what’s working and what isn’t.

Cheers, Adrian
JiscEXPO Synthesis Liaison
by Chr

26 Oct 2010 at 13:56

Darn, I like it when Southampton does stuff first, but congratulations to you guys for getting this going.

I’ve been mulling what EPrints should do “out of the box” and feedback is welcome.
http://blogs.ecs.soton.ac.uk/webteam/2010/10/26/local-or-canonical-uris/

BTW, it’s quite easy to override what URIs EPrints generates for ‘external’ entities such as people, publications etc.
Pingback

by Aims, Objectives, Outcomes | Linked Data Focus

22 Nov 2010 at 13:53

[...] Lucero project at the Open University is an early example of a JISC-supported effort to make information [...]
Pingback

by Publishing ORO as Linked Data « The LUCERO Project

26 Nov 2010 at 11:54

[...] We had initially expected to extract data from ORO in an XML format (possibly RSS) and transform into RDF. However, Chris Gutteridge, the lead developer for the EPrints, added an RDF export option to version 3.2.1 of EPrints, and since we could get this installed on a test server we decided we would make use of this native RDF support. We did make a few small changes to the data before we published it, mainly to replace some of the URIs assigned by EPrints with data.open.ac.uk URIs as noted the blog post ‘First version of data.open.ac.uk‘. [...]

Click here to cancel reply.

The LUCERO Project