The LUCERO Project » publication

The Commonwealth of Learning publishes a report on Linked Data for education, based on the experience in LUCERO

Mathieu — Wed, 25 Jul 2012 12:33:20 +0000

The Commonwealth of Learning (COL) is an intergovernmental organisation including more than 50 independent sovereign states, created to encourage the development and sharing of open learning/distance education knowledge, resources and technologies. Impressed by the work realised in the LUCERO project, by the deployment of data.open.ac.uk (the world’s first university linked data platform), and by the impact it had on The Open University and the broader higher education community, COL commissioned a report on the use and deployment of Linked Data principles and technologies for open and distance learning, which was published last week.

This report is based in large parts on the experience built in the last months of LUCERO, especially in connecting with other organisations and trying to gather common issues and practices, through LinkedUniversities.org.

The report covers the general principles underlying Linked Data technologies and their relevance to the field of education, especially focusing on open and distance learning. It illustrates use case scenarios of Linked Data for learning and teaching through describing existing applications, and details the process of adopting and deploying Linked Data for educational resources and learning-related information. Through publishing it on its website under an open license, COL hopes that this report will become a valuable resource to a wide variety of organisations, raising the general awareness of the benefits of using open Web technologies such as Linked Data for educational purposes.

What to ask linked data

Mathieu — Fri, 24 Jun 2011 15:54:37 +0000

Publishing linked data is becoming easier, and we now come across new RDF datasets almost everyday. One question that keeps being asked however is “what can I do with it?” More or less everybody understand the general advantages of linked data, in terms data access, integration, mash-up, etc., but getting to know and use a particular dataset is far from trivial: “What does it say? What can I ask it?”

You can look at the ontology to get an idea of the data model used there, send a couple of SPARQL queries to `explore’ the data, look at example objects. etc. We also provide example SPARQL queries to help people getting the point of our datasets. Of course, not everybody is proficient enough in SPARQL, RDF-S and OWL to really get it using this sort of clues. Also, datasets might be heterogeneous in the representation of objects, in the distribution of values, or simply very big and broad.

To help people who don’t necessarily know/care about SPARQL `getting into’ a complex dataset, we developed a system (whatoask) that automatically extract a set of questions that a dataset is good at answering. The technical aspects of realising that are a tiny bit sophisticated (i.e., it uses formal concept analysis) and are detailed in a paper I will present next week at the K-CAP conference. What is interesting however is how such a technique can provide a navigation and querying interface of top of a linked dataset, providing a simple overview of the data and a way to drill down to particular areas of interest. In essence, it can be seen as an FAQ for a dataset, not presenting frequently asked questions, but the questions the dataset is specially good at answering.

What the tool does is creating a hierarchy of all the simple questions an RDF dataset can answer, and presents to the user a subset that, according to a set of metrics described in the paper, are believed to be more likely of interest. The questions are displayed in a pseudo natural language, in a format where for example “What are the (Person/*) that (knows Tom) and that (KMi hasEmployee)?” can be interpreted as the question “What are the people who know Tom and are employed in KMi?”. Questions can be selected, and displayed with their answers, and the question hierarchy can be navigated, selecting more specific and more general questions than the selected one.

To clarify what that means, let’s look at what it does on the data.open.ac.uk OpenLearn dataset. The initial screen shows a list of questions, the first one (“What are the (Document/*/OpenLearnUnit) that (subject Concept, relatesToCourse Course, relatesToCourse Module)?”, i.e., “What are the OpenLearn Units that are related to courses and have a topic?”) being selected. More general and more specific questions are also shown, such as “What are the OpenLearn Units that have a topic?” (more general) and “What are the OpenLearn Units that relate to a course and have for topic `Education Research’?” (more specific).

We can select alternative questions, such as the second in the list — “What are the OpenLearn Units in english distributed under a creative commons licence and that talk about Science?”, obtain a new list of answers (quite a few), as well as more general and more specific questions. We can then specialise the question to “What are the OpenLearn Unit in english under a CC licence that talk about science and family?” and carry-on with a more general question looking at the `family topic’ without science, to finally ask “What are the OpenLearn units about family?” (independently of the licence and language).

As can be seen from the example, the system is not meant for people who know in advance what they want to ask, but to provide a level of serendipitous navigation amongst the queries the dataset can answer, with the goal of giving a general overview of what the dataset is about and what it can be used for. The same demo is also available using the set of reading experiences from the RED dataset and the datasets regarding buildings and places at the OU. The interface is not the most straightforward at the moment, but we are thinking about ways by which the functionalities of the system could be integrated in a more compelling manner, as a basic `presentation’ layer on top of a linked dataset.

Publishing ORO as Linked Data

ostephens — Fri, 26 Nov 2010 11:54:19 +0000

The data

One of the first data sets to be made available on http://data.open.ac.uk is the contents of ORO (Open Research Online), the Open University’s repository of research publications and other research outputs. The software behind ORO is EPrints, open source software developed at the School of Electronics and Computer Science and is used widely for similar repositories across UK Higher Education (and beyond).

ORO contains a mixture of metadata for items and full text items (often as PDF). The repository includes a mixture of journal articles, conference papers, book chapters andtheses. The data we are taking and presenting on http://data.open.ac.uk is just the metadata records – not any of the full-text items. Typical information for a record includes:

title
author/editor(s)
abstract
type (e.g. article, book section, conference, thesis)
date of publication

The process

We had initially expected to extract data from ORO in an XML format (possibly RSS) and transform into RDF. However, Chris Gutteridge, the lead developer for the EPrints, added an RDF export option to version 3.2.1 of EPrints, and since we could get this installed on a test server we decided we would make use of this native RDF support. We did make a few small changes to the data before we published it, mainly to replace some of the URIs assigned by EPrints with data.open.ac.uk URIs as noted the blog post ‘First version of data.open.ac.uk‘.

Issues

In general, the process of publishing the data was quite smooth. However, once we had published the data it quickly became apparent there were some issues with the data. Most notably we found that in some cases two or more individual author details were merged together into a single ‘person’ in the RDF. Investigation showed that the problems were in the source data, and were caused by a number of issues:

Incorrectly allocated author IDs in ORO

ORO allows an (Open University) ID to be recorded for Open University authors, and we use this ID as a way of linking together works by an author. Unfortunately in some cases incorrect IDs had been entered, leading to two separate identities to become con-fused in our data

Name changes

In some cases the author had changed their name, resulting in two names against the same author ID. While all the information is correct, it leads to slightly confusing representation in the RDF (e.g. Alison Ault changed her name to Alison Twiner)

Name variations

In some cases the author uses different versions of their name in different publications. Good practice for ORO is to use the name as printed on the publication, which can result in different versions of the same name – for example in most papers, Maria Velsco-Garcia’s name is spelt with a single ’s’ in Velasco, but in one paper, it is spelt Velassco with a double ’s’.

A particularly common inconsistency was around the use of accents on characters – where sometimes a plain character was used instead of the accented character – this seemed to be down to a mixture of data entry errors and variations in the use of accents in publications

Incorrect publisher data

There were a couple of examples where the publisher had incorrect data in their systems, which had been brought through into ORO. One particular example split a single author with several parts to their name into two separate authors.

Having identified the records effected, the next challenge was correcting them – firstly investigation into each error (this could be challenging – especially where name changes had occurred it was sometimes difficult to know if this was the same person or not), secondly the question of where these are corrected. In this case we were given edit access to ORO so we could make the corrections directly, but the question does arise – what happens if you can’t get the errors corrected in the source data set?

Conclusions

One of the interesting things for me is that these small errors in data would be unlikely to be spotted easily in ORO. For example, when you browse papers by author in ORO, behind the scenes, ORO uses the author ID while presenting the user with the names associated with that ID. Because of this, you would be hard pushed to notice a single instance of a mis-assigned identifier. However, once the data was expressed as RDF triples, the problem became immediately apparent. This means that a very low error rate in ORO data, is magnified into obvious errors on http://data.open.ac.uk

I suspect that this ‘magnification’ of errors will lead to some debate over the urgency of fixing errors. While for http://data.open.ac.uk fixing the data errors becomes important (because they are very obvious), it may be that for the contributing dataset (perhaps especially large datasets of heterogeneous data such as bibliographic data) fixing these errors is of lower priority.

On the upside, using the data on data.open.ac.uk we can start to run queries that will help us clean the data – for example, you can find people with more than 1 family name in ORO.