The LUCERO Project » Datasets http://lucero-project.info/lb Linking University Content for Education and Research Online Mon, 21 Jan 2013 08:34:16 +0000 http://wordpress.org/?v=2.9.2 en hourly 1 So, what’s in linked datasets for education? http://lucero-project.info/lb/2012/04/so-whats-in-linked-datasets-for-education/ http://lucero-project.info/lb/2012/04/so-whats-in-linked-datasets-for-education/#comments Wed, 18 Apr 2012 22:02:26 +0000 Mathieu http://lucero-project.info/lb/?p=698 Since the first push when we deployed data.open.ac.uk, the area of linked data for education, especially in universities, as been slowly but steadily growing. This is obviously a rather good news as a critical benefit of linked data in education (some would say, the only one worth considering) is that it creates a common, public information space for education that goes outside the boundaries of specific institutions. However, this will only happen if a certain level of convergence is happening so that shared vocabularies and schema elements are commonly used that make it possible to aggregate and jointly query data provided by different parties. Here, we try to get an overview of the current landscape in existing linked datasets in the education sector, to see how much of this convergence is happening, what are the areas of clear agreement, and the ones where more efforts might be required.

The Datasets

To look at the current state of linked data in education, we considered 8 different datasets, some provided by universities and some by specific projects. We looked at datasets that were explicitly dedicated to education (as opposed to the ones containing information that could be used for educational purposes, such as library and museum data, and the ones that have connection with education but focus on other aspects, such as the datasets from purely research institutions). Also, we view datasets in a very coarse-grained way, for example considering the whole of data.open.ac.uk as one dataset, rather than each of its sub-datasets separately. Finally, we could only process datasets with a functioning SPARQL endpoint working properly with common SPARQL clients (in our case ARC2).

From Universities:

  • data.open.ac.uk which SPARQL endpoint is available at http://data.open.ac.uk/sparql
  • data.bris from the University of Bristol. SPARQL endpoint: http://resrev.ilrt.bris.ac.uk/data-server-workshop/sparql
  • University of Southampton Open Data. SPARQL endpoint: http://sparql.data.southampton.ac.uk/
  • LODUM from the University of Muenster, Germany. SPARQL endpoint: http://data.uni-muenster.de/sparql

Others should be included eventually, but we could not access them at the time

From projects and broader institutions

  • mEducator, a european project aggregating learning resources: SPARQL Endpoint: http://meducator.open.ac.uk/resourcesrestapi/rest/meducator/sparql
  • OrganicEduNet a european project that aggregated learning resources from LOM repositories (see this post). SPARQL endpoint: http://knowone.csc.kth.se/sparql/ariadne-big
  • LinkedUniversities Video Dataset which aggregates video resources from various repositories (see this paper). SPARQL Endpoint: http://smartproducts1.kmi.open.ac.uk:8080/openrdf-sesame/repositories/linkeduniversities
  • Data.gov.uk Education which aggregates information about schools in the UK. SPARQL endpoint: http://services.data.gov.uk/education/sparql

Common Vocabularies

As everybody will always say: the important thing is the reuse shared and common vocabularies! As they are talking about similar things, it is expected that education-related datasets would share vocabularies, and that their overlaps would allow to achieve joint reuse of the exposed data. The chart above shows the namespaces that are used by more than one of the considered datasets.

Unsurprisingly, FOAF is almost omnipresent. One of the reasons for this is that FOAF is the unquestioned common vocabulary to represent information about people, and it is quite rare that an education-related dataset would not need to represent information about people. It is also the case that FOAF includes high-level classes that are also very common, especially in this sort of datasets, namely Document and Organisation.

In clear second place come vocabularies to represent information about bibliographic resources, and other published artifacts: Dublin Core and BIBO. Dublin Core is actually the de-facto standard for metadata for just about anything that can be published. BIBO, the bibliographic ontology, is more specialised (and actually rely on both Dublin Core and FOAF) to represents in particular academic publications.

Other vocabularies used include generic “representation languages” such as RDF, RDFS, OWL and SKOS (often used to represent topics), as well as specific vocabularies related to the description of multimedia resources, events and places (including building, addresses and geo-location).

Common Classes

At a more granular level, it is interesting to look at the types of entities that can be found in the considered datasets. The chart above shows the classes that are used by at least 2 datasets. This confirms in particular the strong focus on people and bibliographic/learning resources (Article, Book, Document, Thesis, Podcast, Recording, Image, Patent, Report, Slideshow).

In second place come information about educational institutions as organisations and physical places (Organization, Institution, Building, Address, VCard).

Besides generic, language-level classes other areas such as events, courses, vacancies, etc. tend to be only considered by a very small number of datasets.

Common Properties


Finally, going a step further in granularity, we look through the chart above at the way common types of entities are represented. This chart show the properties used by more than 3 datasets. Once again, besides generic properties, the focus on people (name) and media/bibliographic resources (title, date, subject) is obvious, especially with properties connecting the 2 (contributor, homepage).

The representation of institutions as physically located places is also clearly reflected here (lat, long, postal-code, street-address, adr).

Doing More with the Collected Data

Of course, the considered datasets only represent a small sample, and ideally, we could draw some more definitive conclusions as the number of education-related datasets grows and are included. Indeed, in order to realise the analysis in this post, we created a script that generates VOID-based descriptions of the datasets. The created descriptions are available on a public SPARQL endpoint which will be extended as we find more datasets to include. Please let us know if there are datasets you would like to see taken into account. The charts above are dynamically generated out of SPARQL query to the aforementioned SPARQL endpoint.

Also, we will look at reflecting the elements discussed here on the vocabulary page of LinkedUniversities.org. The nice thing about having a SPARQL endpoint for the collected data is that it will make it easy to create a simple tool to explore the “Vocabulary Space” of educational datasets. This might appear useful as well as a way provide federated querying services for common types of entities (see this recent paper about using VOID for doing that), which might end-up being a useful feature for the recently launched data.ac.uk initiative (?) Another interesting thing to do would be to apply a tiny bit of data-mining to check for example what elements tend to appear together, and see if there are common patterns in the use of some vocabularies.

]]>
http://lucero-project.info/lb/2012/04/so-whats-in-linked-datasets-for-education/feed/ 2
PRONOM and linked data http://lucero-project.info/lb/2011/05/pronon-and-linked-data/ http://lucero-project.info/lb/2011/05/pronon-and-linked-data/#comments Thu, 26 May 2011 17:59:50 +0000 Mathieu http://lucero-project.info/lb/?p=463 PRONOM is the national archive’s technical registry, and is currently being `transformed’ to be exposed as linked data. We can of course only welcome such an initiative and be very enthusiastic about this potentially valuable resource. Now, because we are of this kind of people who like to criticize (or more seriously, because we were asked by our programme manager to give feedback), here are a few comments regarding things that could be done better.

Most of the description and technical specification of the work relate to the specification of a vocabulary. Apart from all the low-level boring issues (such as “it is in pdf”, “it is not really clear”, etc.), there are major issues in its definition: mostly, 1- it is not really good modelling, and 2- it does not reuse enough other vocabularies. Funnily enough, these two criticisms could be applied to many vocabularies that are created `ad-hoc’, for a particular project.

A nice big example of bad modelling regards all the classes used to represent file formats. First, their names are quite seriously misleading. Video, is not a Video, it is a video type of file format. GIS is the type of file format use by a geographic information system, etc. I really don’t understand how these things could be classes. It seems that the intension was that a class such as `Video’ would correspond to what should be called `VideoFormat’. In this case for example <http://reference.data.gov.uk/id/file-format/13>, which corresponds to the PNG image format should be an instance of <http://reference.data.gov.uk/technical-registry/formatType/Image_(Raster)>. However, it is not. It is connected to it through a triple {<http://reference.data.gov.uk/id/file-format/13> <http://reference.data.gov.uk/technical-registry/formatType> <http://reference.data.gov.uk/technical-registry/formatType/Image_(Raster)>}, in which case, <http://reference.data.gov.uk/technical-registry/formatType/Image_(Raster)> should really be an individual (and have another name, e.g., <http://reference.data.gov.uk/technical-registry/formatType/raster-image-format>). Now, if that wasn’t confusing enough, <http://reference.data.gov.uk/id/file-format/13> is also a class. This one, I have no explanation for. I don’t know either why things such as <http://reference.data.gov.uk/technical-registry/Big_endian> are described as properties.

I’m sure there are quite a few other issues (even if the vocabulary seem in itself rather simple, but I haven’t found the RDF-S version of it), including underspecified domains, ranges and classes, untyped objects, etc. I might have missed something, but the naming conventions used seem to have been made voluntarily confusing. The four core classes are not capitalised, and use `-’ as separators. The other classes are capitalised and use `_’. Some properties would be fully in upper-case (MIMETYPE), some have the first letter capitalised and some only the first letter of the second word capitalised (and not word separator). The file formats are associated with numbers in the namespace ‘http://reference.data.gov.uk/id/file-format/’ while a human readable ID (e.g. ‘png1.2′) could have easily been created. Other things such as `internal signatures’ are also associated to numbers, in name spaces such as ‘http://reference.data.gov.uk/technical-registry/internalSignature/’. I never understand why many people seem to want to have ‘id’ in their namespaces, but if it is done for one, they might as well do it for the others. `Big_endian’ as mentioned above has a nice capital letter for the first word, not the second, while it is described as a property and used as an individual.

Finally, this vocabulary does not reuse. Almost nothing. The example promotes the use of the dublin core vocabulary. A tiny bit of SKOS is used for labels (I’m personally not too sure whether you could use SKOS label properties on things other than SKOS concepts, but that it is really only a detail). DC could certainly be used more (e.g., dct:published instead of releaseDate?). I’m also reasonably convinced that the W3C Ontology for Media Resources should be at least connected to this vocabulary.

In a nutshell, I like this vocabulary and the data based on it, and I will use them. They provide a great resource illustrating how easy it is to make wrong modelling choices.

]]>
http://lucero-project.info/lb/2011/05/pronon-and-linked-data/feed/ 0
Publishing OpenLearn metadata as linked data http://lucero-project.info/lb/2011/04/publishing-openlearn-metadata-as-linked-data/ http://lucero-project.info/lb/2011/04/publishing-openlearn-metadata-as-linked-data/#comments Thu, 21 Apr 2011 08:51:49 +0000 Mathieu http://lucero-project.info/lb/?p=434 OpenLearn is a website giving free access to Open University course material. We especially look at the “LearningSpace” where hundreds of HTML documents, called OpenLearn Units, are made available. These units represent very valuable resources for students as they provide entry points into specific topics, useful in particular in deciding whether or not to enroll in a course on this topic. A lot of these units relate directly to specific courses as their content is obtained from the corresponding course material. Being able to query and use such metadata in connection with other sources of information can be very useful in applications supporting students in the discovery of learning resources, as demonstrated by the OpenLearn Linked Data application developed by Fouad Zablith.

Representing an OpenLearn unit is realised through a specific class called OpenLearnUnit, which is a subclass of foaf:Document. Most of the common fields, such as title, subject and description of the unit are represented through common Dublin Core properties. A specific property relatesToCourse is used to relate a unit to the corresponding course in the Course Description dataset. We also use the Creative Commons Rights Expression vocabulary to express the license attached to the content of the unit (mostly Creative Commons Attribution – NonCommercial-ShareAlike 2.0 Licence) and the Nice Tag Ontology to connect units to the keywords they have been tagged with.

While all this information is already available in structured form from the OpenLearn websites (through XML descriptions and RSS feeds), having it in directly accessible, Web addressable and queryable makes it easier to create new interfaces, new links and new processess that facilitates the use of this information for resource discovery. Some elements are still being investigated, regarding in particular the complex connection that might exist between an OpenLearn unit and the corresponding course material as described in the library catalogue.

]]>
http://lucero-project.info/lb/2011/04/publishing-openlearn-metadata-as-linked-data/feed/ 0
Connecting the Reading Experience Database to the Web of Data http://lucero-project.info/lb/2011/03/connecting-the-reading-experience-database-to-the-web-of-data/ http://lucero-project.info/lb/2011/03/connecting-the-reading-experience-database-to-the-web-of-data/#comments Tue, 08 Mar 2011 09:31:21 +0000 Mathieu http://lucero-project.info/lb/2011/03/connecting-the-reading-experience-database-to-the-web-of-data/ The Reading Experience Database (RED) project is dedicated to collecting and using evidences of reading experiences for teaching and research. The project has created a large and very rich database regarding specific situations in which a person has read a text, and how such an experience was evidenced.

RED is one of the projects from the Open University’s Faculty of Arts working with LUCERO, as an early example on how linked data can be applied to research in humanities, and in general. And it is really a very good example! We have been working on an initial method to extract the content of the RED database into RDF, combining several well known vocabularies (see figure below). While we are still at an early stage in the whole process, this has given us a great insight into the challenges and potentials for linked data in such a domain.

Data cleaning is clearly one of our biggest issues. The RED database is mostly based on contributions from various people, from researchers in humanities connected to the project, to interested individuals. As a result, many entities are duplicated, misspelled, or mistakenly aggregated. A lot of these problems can be addressed automatically through filters, but the major part has to be addressed by the RED team, who are currently involved in a cleaning, normalisation and restructuration process.

Unsurprisingly, where the linked data approach really creates novelty here is in the links. We have published a “preview” of the dataset in data.open.ac.uk, we initial sets of links for people and places, to their (supposed) equivalent in DBPedia. For example, Virginia Woolf, who is both an author and a reader in the RED database, is represented as http://data.open.ac.uk/page/red/person/woolf-virginia, which is linked to the corresponding DBPedia http://dbpedia.org/page/Virginia_Woolf.

This might not look like much in principle, but in reality, it opens up to new ways to look at the data, that couldn’t be anticipated even by the researchers involved in modelling it. I gave a quick talk at a workshop organised two weeks ago by the RED team, to an audience of researchers and lecturers in humanities (see picture above). Showing the benefit of linked data to such an audience is clearly not the most trivial task. I therefore developed a small demonstrator that presents in one page the information about a given person from the RED database (here, Virginia Woolf), together with some information from DBPedia (abstract, categories, and influences). Now, where it becomes interesting, is that the information from DBPedia can be used to filter and browse the information in the RED Database. What this demonstration can do is, through clicking on the corresponding categories, tell you what other people in RED are, according to DPPedia, People from Kensington, People With Bipolar Disorder, Bisexual Writers, Writers Who Committed Suicide, etc. Looking at this, through one simple set of links to one dataset, we can already see emerge a brand new research questions and a new set of research practices, together with the data to start exploring them. We can only be overwhelmed thinking about what will happen when the approach is generalised to more links, more datasets, and more research projects.

]]>
http://lucero-project.info/lb/2011/03/connecting-the-reading-experience-database-to-the-web-of-data/feed/ 2
Publishing ORO as Linked Data http://lucero-project.info/lb/2010/11/publishing-oro-as-linked-data/ http://lucero-project.info/lb/2010/11/publishing-oro-as-linked-data/#comments Fri, 26 Nov 2010 11:54:19 +0000 ostephens http://lucero-project.info/lb/?p=295 The data

One of the first data sets to be made available on http://data.open.ac.uk is the contents of ORO (Open Research Online), the Open University’s repository of research publications and other research outputs. The software behind ORO is EPrints, open source software developed at the School of Electronics and Computer Science and is used widely for similar repositories across UK Higher Education (and beyond).

ORO contains a mixture of metadata for items and full text items (often as PDF). The repository includes a mixture of journal articlesconference papersbook chapters andtheses. The data we are taking and presenting on http://data.open.ac.uk is just the metadata records – not any of the full-text items. Typical information for a record includes:

  • title
  • author/editor(s)
  • abstract
  • type (e.g. article, book section, conference, thesis)
  • date of publication

The process

We had initially expected to extract data from ORO in an XML format (possibly RSS) and transform into RDF. However, Chris Gutteridge, the lead developer for the EPrints, added an RDF export option to version 3.2.1 of EPrints, and since we could get this installed on a test server we decided we would make use of this native RDF support. We did make a few small changes to the data before we published it, mainly to replace some of the URIs assigned by EPrints with data.open.ac.uk URIs as noted the blog post ‘First version of data.open.ac.uk‘.

Issues

In general, the process of publishing the data was quite smooth. However, once we had published the data it quickly became apparent there were some issues with the data. Most notably we found that in some cases two or more individual author details were merged together into a single ‘person’ in the RDF. Investigation showed that the problems were in the source data, and were caused by a number of issues:

Incorrectly allocated author IDs in ORO

ORO allows an (Open University) ID to be recorded for Open University authors, and we use this ID as a way of linking together works by an author. Unfortunately in some cases incorrect IDs had been entered, leading to two separate identities to become con-fused in our data

Name changes

In some cases the author had changed their name, resulting in two names against the same author ID. While all the information is correct, it leads to slightly confusing representation in the RDF (e.g. Alison Ault changed her name to Alison Twiner)

Name variations

In some cases the author uses different versions of their name in different publications. Good practice for ORO is to use the name as printed on the publication, which can result in different versions of the same name – for example in most papers, Maria Velsco-Garcia’s name is spelt with a single ’s’ in Velasco, but in one paper, it is spelt Velassco with a double ’s’.

A particularly common inconsistency was around the use of accents on characters – where sometimes a plain character was used instead of the accented character – this seemed to be down to a mixture of data entry errors and variations in the use of accents in publications

Incorrect publisher data

There were a couple of examples where the publisher had incorrect data in their systems, which had been brought through into ORO. One particular example split a single author with several parts to their name into two separate authors.

Having identified the records effected, the next challenge was correcting them – firstly investigation into each error (this could be challenging – especially where name changes had occurred it was sometimes difficult to know if this was the same person or not), secondly the question of where these are corrected. In this case we were given edit access to ORO so we could make the corrections directly, but the question does arise – what happens if you can’t get the errors corrected in the source data set?

Conclusions

One of the interesting things for me is that these small errors in data would be unlikely to be spotted easily in ORO. For example, when you browse papers by author in ORO, behind the scenes, ORO uses the author ID while presenting the user with the names associated with that ID. Because of this, you would be hard pushed to notice a single instance of a mis-assigned identifier. However, once the data was expressed as RDF triples, the problem became immediately apparent. This means that a very low error rate in ORO data, is magnified into obvious errors on http://data.open.ac.uk

I suspect that this ‘magnification’ of errors will lead to some debate over the urgency of fixing errors. While for http://data.open.ac.uk fixing the data errors becomes important (because they are very obvious), it may be that for the contributing dataset (perhaps especially large datasets of heterogeneous data such as bibliographic data) fixing these errors is of lower priority.

On the upside, using the data on data.open.ac.uk we can start to run queries that will help us clean the data – for example, you can find people with more than 1 family name in ORO.

]]>
http://lucero-project.info/lb/2010/11/publishing-oro-as-linked-data/feed/ 3
First version of data.open.ac.uk http://lucero-project.info/lb/2010/10/first-version-of-data-open-ac-uk/ http://lucero-project.info/lb/2010/10/first-version-of-data-open-ac-uk/#comments Mon, 11 Oct 2010 14:51:49 +0000 Mathieu http://lucero-project.info/lb/?p=212 LUCERO is all about making University wide resources available to everyone in an open, linked data approach. We are building the technical and organisational infrastructure for institutional repositories and research projects to expose their data on the Web, as linked data. It is therefore natural for the interface to this data, the SPARQL endpoint and server addressing URIs in this data to be hosted under http://data.open.ac.uk. The first version of the components underlying this site, as well as a small part of the data which will be ultimately exposed there have gone live last week, with a certain level of excitement from all involved.

What is there? The data

The “launch” of data.open.ac.uk happened relatively shortly after the beginning of the LUCERO project. Indeed, we take the approach that the basic data exposure architecture have to be in place, to incrementally integrate data into it. As a first step, we developed extraction and update mechanisms (see the previous blog post of about the LUCERO workflow) for two important repositories at the Open University: ORO, our publication repository, and podcast, the collection of podcasts produced by the Open University, including the ones being distributed through iTunes U.

ORO data concerns scientific publications with at least one member of the Open University as co-author. The source of the data is a repository based on the EPrints open source publication repository system. EPrints already integrates a function to export information as RDF, using the BIBO ontology. We of course used this function, post-processing what is obtained to obtain a representation consistent with the other (future) datasets in data.open.ac.uk, in particular in terms of URI Scheme. The ORO data represents at the moment 13,283 Articles and 12 Patents, in approximately 340,000 triples (see for example the article “Molecular parameters of post impact cooling in the Boltysh impact structure”).

Podcast data is extracted from the collection of RSS feeds obtained from podcast.open.ac.uk, using a variety of ontologies, including the W3C media ontology and FOAF (see for example the podcast “Great-circle distance”). An interesting element of this dataset is that it provides connections to other types of resources at the Open University, including courses (see for example the course MU120, which is being referred to in a number of podcasts). Podcasts are also classified into categories, using the same topics used to classify courses at the Open University, as well as the iTunesU categories, which we represent in SKOS (see for example the category “Mathematics”).

While representing only a small fraction of the data we will ultimately expose through data.open.ac.uk, the new possibilities obtained by exposing openly these datasets in RDF, with a SPARQL endpoint and resolvable URIs are very exciting already. In a blog post, Tony Hirst has shown some initial examples and encouraged others to share their queries to the Open University’s linked data. Richard Cyganiak has also kindly created a CKAN description of our datasets, for others to find and exploit.

The technical aspects

In a previous blog post, we gave an overview of the technical workflow by which data from the original sources would end up being exposed as linked data. The current platform implements parts of this workflow, including updaters and extractors for the two considered datasets. At the centre of the platform is the triple store. After trying several options, including Sesame, Jena TDB and 4Store, we settled for SwiftOWLIM, which is free, scalable and efficient, and includes limited reasoning capabilities, which might end up being useful in the future.

The current platform also implements the mechanisms by which URIs in the http://data.open.ac.uk namespaces are being resolves. Very simply, a URI such as http://data.open.ac.uk/course/a330 can either be re-directed to http://data.open.ac.uk/page/course/a330 or to http://data.open.ac.uk/resource/course/a330 depending on the content being requested by the client. http://data.open.ac.uk/page/course/a330 shows a browsable webpage linking the considered resource to related one, while http://data.open.ac.uk/resource/course/a330 provides the RDF representation of this resource.

A SPARQL endpoint is also available, which allows to query the whole set of data, or individual datasets through their namespaces, http://data.open.ac.uk/context/oro and http://data.open.ac.uk/context/podcast.

What’s next?

Of course, this first version of data.open.ac.uk is only the beginning of the story. We are currently actively looking at the way to represent and extract information about courses and qualifications from the Study At the OU website, as well as at information about places in the OU campus and regional centres (building, car parks, etc.)

More ways to access will also be soon made available, including faceted search/browsing, and links to external datasets are being investigated. All this is going to be gradually integrated into the platform while the existing data is being constantly updated.

]]>
http://lucero-project.info/lb/2010/10/first-version-of-data-open-ac-uk/feed/ 4