First version of data.open.ac.uk

LUCERO is all about making University wide resources available to everyone in an open, linked data approach. We are building the technical and organisational infrastructure for institutional repositories and research projects to expose their data on the Web, as linked data. It is therefore natural for the interface to this data, the SPARQL endpoint and server addressing URIs in this data to be hosted under http://data.open.ac.uk. The first version of the components underlying this site, as well as a small part of the data which will be ultimately exposed there have gone live last week, with a certain level of excitement from all involved.

What is there? The data

The “launch” of data.open.ac.uk happened relatively shortly after the beginning of the LUCERO project. Indeed, we take the approach that the basic data exposure architecture have to be in place, to incrementally integrate data into it. As a first step, we developed extraction and update mechanisms (see the previous blog post of about the LUCERO workflow) for two important repositories at the Open University: ORO, our publication repository, and podcast, the collection of podcasts produced by the Open University, including the ones being distributed through iTunes U.

ORO data concerns scientific publications with at least one member of the Open University as co-author. The source of the data is a repository based on the EPrints open source publication repository system. EPrints already integrates a function to export information as RDF, using the BIBO ontology. We of course used this function, post-processing what is obtained to obtain a representation consistent with the other (future) datasets in data.open.ac.uk, in particular in terms of URI Scheme. The ORO data represents at the moment 13,283 Articles and 12 Patents, in approximately 340,000 triples (see for example the article “Molecular parameters of post impact cooling in the Boltysh impact structure”).

Podcast data is extracted from the collection of RSS feeds obtained from podcast.open.ac.uk, using a variety of ontologies, including the W3C media ontology and FOAF (see for example the podcast “Great-circle distance”). An interesting element of this dataset is that it provides connections to other types of resources at the Open University, including courses (see for example the course MU120, which is being referred to in a number of podcasts). Podcasts are also classified into categories, using the same topics used to classify courses at the Open University, as well as the iTunesU categories, which we represent in SKOS (see for example the category “Mathematics”).

While representing only a small fraction of the data we will ultimately expose through data.open.ac.uk, the new possibilities obtained by exposing openly these datasets in RDF, with a SPARQL endpoint and resolvable URIs are very exciting already. In a blog post, Tony Hirst has shown some initial examples and encouraged others to share their queries to the Open University’s linked data. Richard Cyganiak has also kindly created a CKAN description of our datasets, for others to find and exploit.

The technical aspects

In a previous blog post, we gave an overview of the technical workflow by which data from the original sources would end up being exposed as linked data. The current platform implements parts of this workflow, including updaters and extractors for the two considered datasets. At the centre of the platform is the triple store. After trying several options, including Sesame, Jena TDB and 4Store, we settled for SwiftOWLIM, which is free, scalable and efficient, and includes limited reasoning capabilities, which might end up being useful in the future.

The current platform also implements the mechanisms by which URIs in the http://data.open.ac.uk namespaces are being resolves. Very simply, a URI such as http://data.open.ac.uk/course/a330 can either be re-directed to http://data.open.ac.uk/page/course/a330 or to http://data.open.ac.uk/resource/course/a330 depending on the content being requested by the client. http://data.open.ac.uk/page/course/a330 shows a browsable webpage linking the considered resource to related one, while http://data.open.ac.uk/resource/course/a330 provides the RDF representation of this resource.

A SPARQL endpoint is also available, which allows to query the whole set of data, or individual datasets through their namespaces, http://data.open.ac.uk/context/oro and http://data.open.ac.uk/context/podcast.

What’s next?

Of course, this first version of data.open.ac.uk is only the beginning of the story. We are currently actively looking at the way to represent and extract information about courses and qualifications from the Study At the OU website, as well as at information about places in the OU campus and regional centres (building, car parks, etc.)

More ways to access will also be soon made available, including faceted search/browsing, and links to external datasets are being investigated. All this is going to be gradually integrated into the platform while the existing data is being constantly updated.

Initial Overview of the LUCERO Workflow

A large part of the technical development of LUCERO will consist of a set of tools to extract RDF from existing OU repositories, load this RDF into a triple store and expose it through the Web. This might sound simple, but the reality is that, in order to achieve this with sources that are constantly changing and that are originally working in isolation requires a workflow which is at the same time efficient, flexible and reusable.

The diagram below gives an initial overview of how such a workflow will look like for the institutional repositories of the Open University considered in the project. It involves a mix of specific components, which implementation require to take into account the particular characteristics of the dataset considered (e.g., an RDF Extractor components depend on the input data), and generic components, which are globally reusable, independently of the dataset. The approach for the deployment of this workflow is that each component, specific or generic, is realised as a REST service. The materialisation of the workflow for a given dataset is then realised by a scheduling programme, calling the appropriate components/services in the appropriate order.

LUCERO Workflow

One of the points worth noticing in this diagram is the way updates are handled. A set of (mostly) specific components are in charge of detecting, at regular intervals, what is new, what have been removed and what have been modified from a given dataset. They then generate a list of new items to be extracted into RDF, and a list of obsolete items (either deleted elements of data, or previous versions of updated items). The choice here is to re-create the set of RDF triples corresponding to obsolete items, so that they can be removed from the triple store. This assumes that the RDF extraction process consistently generates the same triples from the same input items over time, but has the advantage of having to keep track of updates only in the early stages of the workflow, making it simpler and more flexible.

Another crucial element concerns the way the different datasets connect to each other. Indeed, the workflow is intended to run independently for each dataset. A phase of linking is planned to be integrated right after RDF extraction (currently left out of the workflow), but this is essentially meant as way to connect local datasets to external ones. Here, we realise the connections between different entities of different local datasets through the use of an Entity Name System (ENS). The role of ENS (inspired by what was done more globally in the Okkam project) is to support the RDF Extractor components in using well-defined, shared URIs for common entities. It implements the rules for generating OU data URIs from particular characteristics of the considered object (e.g., creating the adequate URI for a course using the course code), independently from the dataset where the object is encountered. In practice, implementing such rules and ensuring their use across datasets will remove the barriers between the considered repositories, creating connections based on common objects such as courses, people, places and topics.

Hello World!

And Welcome to the LUCERO project!

As you probably have noticed, this is not exactly the first post on this blog. As you can see from the about page, LUCERO is a JISC funded project and JISC encourage us to use the blog of the project also as a reporting tool. The 7 first posts therefore correspond to the project workplan and, while some of it can be of interest, the rest is mostly bureaucratic things.

To summarize all what that says, LUCERO is a one year project started last June at The Open University (OU), with the goal to set up and launch a complete infrastructure, both technical and organizational, for the exposure of educational and research content as linked data. A higher education organization such the OU typically creates and manages vast amounts of data in different repositories, including library catalogs, publication bases, staff directories, A/V material, course descriptions, etc. Linked data is a set of principles and technologies making it possible to publish data on the Web, in a standard, machine readable format, and where every piece of information is “Web addressable”, i.e., identified by a URI, and so linkable. This allows to seamlessly integrate data, creating on the Web a network of data, much like there is currently a network of documents. This is obviously a very brief and general summary (you can have a look at the Linked Data Horizon Scan document for more), but already, the potential of applying this to large University repositories should appear evident. What LUCERO intends to do is to create a OU Web of Data, connecting all the repositories with each other and with external datasets. More importantly, we want to use this experience to provide reusable software and guidelines for other similar organizations to take advantage of the linked data approach, through setting up and sustaining the exposure of educational content as linked data.

Of course, we don’t start from scratch. Several organizations have already been publishing data online, such as the BBC, the UK government or, closer to us, the ECS school at the University of Southampton. However, the scale of the task in organizational terms, the non-technological issues involved and the endless possibilities implied by releasing and connecting such data clearly makes LUCERO a unique experience. Indeed, one of the goals of LUCERO is also to concretely demonstrate the benefit of linked data, through the development of applications targeting students and researchers, focusing on the domain of Arts. But that should really be the topic of another blog post later in the project, as will be the details of the different datasets we consider, the tools we experiment with and the issues we will need to tackle.

Budget

The forecast project budget for Lucero totals £165,108 (£100,000 JISC funding with the balance – 39% – from institutional contributions). The detailed budget forecast breakdown is given in the table below.

06/10 – 03/11 04/11 – 05/11 Total
Directly Incurred Staff £65,171 £13,064 £78,234
Directly Incurred non-staff £1,300 £100 £1,400
Total Directly Incurred £66,471 £13,164 £79,634
Directly Allocated £21,405 £4,284 £25,689
Indirect Costs £49,820 £9,964 £59,784
Total Project Costs £137,696 £27,412 £165,108

The Lucero project Directly Incurred staff costs fund a full-time Research Assistant working in the Knowledge Media Institute. This post will be carrying out the key roles of extracting, converting and building the linked data datasets, creating the prototype search systems and carrying out the key technical work of the project. Overseeing this work will be the Project Director who has 20% of his overall time allocated to this project. Both have a high degree of experience in working with creating and exposing linked data.

The other Directly Incurred costs cover the work required to plan and extract library catalogue data from the Voyager library management system and to provide support to the project from metadata and Information Management experts. The Information Management expertise is expected to be of particular help in looking at embedding linked data in workflows and in information management practice.

The final element of Directly Incurred staff costs is represented by a 0.5 fte Project Manager. This post is essential in keeping the project on track and to budget.  Given the short nature of the project and the wide range of stakeholders it is essential that the project is managed effectively and that all reporting requirements are complied with.

The Directly-Incurred non-staff costs include a small amount to fund attendance at programme and dissemination events.

Directly-allocated costs include stakeholders allocated to Chair the Steering Group and provide strategic advice, liaison and guidance. An important element of the directly-allocated staff costs is the time from the managers of the key Arts Faculty databases that will be targeted as part of Workpackage 4. This time will be spent allowing the Project team to understand the data content within these databases, understand any licence or rights restrictions and consider the best methods of extracting and converting their data.

Finally there are directly allocated costs for Estates and indirect charges.

The project budget will be maintained by the Knowledge Media Institute Administration Manager Jane Whild in conjunction with the KMI Unit Accountant.

The Project Manager will be in regular contact with the KMI finance team to monitor and report on the budget to the Project Team and Steering Group.  An initial budget meeting has already taken place (30 June 2010) with the Project Director and Project Manager, KMI, Library and Arts Finance teams to plan and agree the budget management processes.

Projected Timeline, Workplan & Overall Project Methodology

The project is divided into 7 different workpackages. The first one includes the necessary management tasks, including the project’s basic infrastructure, reporting and documentation (WP1: Project Management including Programme-level engagement)

Lucero WP1 timelineTimeline for WP1.

The second workpackage is concerned with the first task considered by LUCERO: to create the technical infrastructure necessary to expose University content as linked data and to semi-automatically create meaningful connections in this data, and with external resources. (WP2: Exposing University content as linked data)

Lucero WP2 timelineTimeline for WP2.

Beyond the purely technological aspect of exposing University content as linked data, we will consider new procedures so that the practices of linked data are integrated within the current activities in a sustainable way. This is the goal of the third workpackage of the project. (WP3: Integrating linked data in University practices)

Lucero WP3 timelineTimeline for WP3.

Exposing data as linked data is only the first step of the process, as it allows building a new kind of applications taking benefit from the meaningful connections established or derived from the data. The benefit brought by such applications constitutes the main incentive for the broad education and research community to integrate the principles of linked data. Therefore, the third workpackage of LUCERO concerns building demonstrators showing the interests of using linked data to both academics and students. (WP4:Demonstrating the value of linked data to researchers and students)

Lucero WP4 timelineTimeline of WP4.

LUCERO’s benefit to end-users will be directly demonstrated through prototype applications (see Objective 3). However, the institution-wide benefits of linked data practices for data exposure and connection will only be fully understood in the long term, after the project is ended. Therefore, particular efforts should be spent as part of the project to ensure the sustainability and continuity of the activities started during the project, as well as to identify potential issues related to this sustainability. This also includes activities around the evaluation of the project and the dissemination of the project results, which are the topics of the workpackages 5, 6, and 7. (WP5:Sustainability in the exposure of linked data. WP6: Project Evaluation. WP7: Dissemination)

Lucero WP5, 6, 7 timelineTimelines of WP5, WP6 and WP7.

Overall, the Lucero project will be managed according to the Open University Prince2 project methodology.  There will be weekly project team meetings with action points, regular Project Steering Group meetings, regular project reporting, and risk/issue logs to ensure the project achieves the aims and objectives set out for the project.


Project Team Relationships and End User Engagement

The core LUCERO team includes researchers from the Knowledge Media Institute of the Open University, as well a project manager from the Open University’s Library:

Dr. Mathieu d’Aquin (project director) is a researcher at KMi. He obtained a PhD from the University of Nancy, France, where he worked on real-life applications of semantic technologies to knowledge management and decision support in the medical domain. As a member of the EU Integrated Project NeOn, Mathieu has researched large-scale infrastructures for the discovery, indexing and exploitation of semantic data (e.g. the Watson Semantic Web search engine, Cupboard system for managing semantic information spaces), as well as in numerous research prototypes in concrete applications of these developments.

Fouad Zablith Fouad Zablith is a researcher and PhD candidate at the Knowledge Media Institute of the Open University. Within the LUCERO project, he’s working on modelling and deploying data in various university contexts, published within http://data.open.ac.uk. His research PhD is in the Semantic Web area, focussing on ontology evolution from external domain data, by reusing various sources of background knowledge. Fouad is also the web consultant of the Open Arts Archive project (http://www.openartsarchive.org), responsible of the implementation and maintenance of the website.

Salman Elahi is a research assistant at KMi. He obtained his Master’s degree in Knowledge Management and Engineering from the University of Edinburgh. Prior to joining KMi, he has been working as a Software Engineer on projects related to the use of semantic technologies to enhance search systems in the domain of Freshwater Sciences. At KMi, he is involved in the Watson and Cupboard projects. He has also started his part-time PhD looking at issues related to identity and personal information management.

Prof. Enrico Motta is Professor of Knowledge Technologies at KMi and a leading international scientist in the area of Semantic Technologies, with extensive experience of both fundamental and applied research. Over the years, he has authored more than 200 refereed publications and collaborated with a variety of organizations, including Nokia, Rolls-Royce, Fiat, Phillips, and the United Nations, to name just a few, while receiving close to £7M in external research funding.

Owen Stephens

Owen Stephens is the Project Manager for LUCERO. He joined the Open University in 2009 and was previously Project Manager for the JISC-funded TELSTAR (Technology enhanced learning supporting students to achieve academic rigour) project delivered at the Open University. Owen also works as an independent consultant to the library sector. As well as a strong technical background, he has been on the management team of the library services of two leading UK Universities he has been responsible for a number of innovative projects at both institutional and national levels. Owen was Project Director for the EThOSNet project to launch national e-theses service based at the British Library, and is the founder of the ‘Mashed Libraries’ events in the UK.

Stuart Brown Stuart Brown (@stuartbrown) is Web Developments and Online Communities manager at The Open University. Involved in implementing the OU’s move to Drupal as default CMS Stuart is interested in both reuse of OU linked data within the OU’s web publishing environment as well as ensuring OU web publishing activity plays a role in the OU’s continued publication of linked data. Working with colleagues and systems across the OU Stuart hopes to help ensure that a linked data approach becomes a part of core OU activity.

Richard Nurse is Digital Libraries Programme Manager at the Open University Library.  Richard joined the Open University in 2009 and leads on Digital Library and website initiatives. He has considerable experience of library systems management and extensive experience of managing funded projects from the National Lottery and Wolfson Challenge Fund. Richard has been a key member of the recent collaborative JISC-funded TELSTAR (Technology enhanced learning supporting students to achieve academic rigour) project delivered at the Open University.

In addition to the core team, the project includes two sets of users and practitioners of linked educational and research data. This includes, on the one end, library specialists involved in managing data, especially about publications, course materials and archives (ORO, The OU Archive, the Library Catalogue). On the other end, out project plan includes the exposure of data from specific research projects, which include data modeling, requirement analysis and evaluation realized together with academics from 6 different projects at the Faculty of Arts of the Open University.

IPR (Creative Commons Use & Open Source Software License)

A part of the technologies employed in the project is developed externally and available as open source software. Technologies developed at the Open University, in particular relating to the semantic management of information, will also be employed. These are to a large extent already available as open source. The final software will therefore be made available as open source software, which can be reused and further developed in other organisations. New software will be mostly distributed under the LGPL and EPL.

Deciding on which license to apply on the data itself is a slightly more complicated issue. One element that needs to be taken into account is the source of the original data. The ORO repository, library catalogue, archives materials, OpenLearn and iTunesU content is available for educational and non-commercial use, mostly under the creative commons licence. Only the publishable parts of the staff directory will be considered for linked data exposure. For new content and links, we are currently investigating the applicability of a variety of licenses such as CC0 and PDDL.

All content produced, including reports, blogs and documentation will be made available under the creative commons license with attribution.


Risk Analysis and Success Plan

Measuring the success of a project such as LUCERO, investigating innovative and emerging technologies within a short period of time, is not an easy task. We follow three main directions regarding evaluating the results of the project:

  1. through the benefit of developed linked-data based application to students and academics, i.e., how the deployed application have increased access to and usage of data in these communities.
  2. through the successful application and deployment of the devised procedures within the Open University to create sustainable linked data exposure, i.e., having evidence that the rate and quality of linked data exposure will be sustained beyond the project, and that new applications will be developed (possibly by others) on top of this data.
  3. through the adoption of the practices put in place within the project by other education and/or research organizations. Indeed, one of the major goal of the project is to document the experience acquired in setting up such practices within the OU in order to provide guidelines for others to engage with linked data.

Of course, the realization of such a project does not come without risks. Most obviously, like in many technological projects, the availability of key resources, technologies, skills and staff are essential. More specifically, linked data being a very recent set of technologies and principles, the risk of the necessarily underlying infrastructure not being ready and mature enough to support the ambitious goals of the project is non negligible, even if the project include some of the recognized experts in the area, with many connection in the research community. Finally, in relation with the relative novelty of linked data, important considerations will have to be given to the non-technical issues of exposing University resources, included the legal, business and ethical aspects that all raise their own set of specific issues for which very little experience exists until now.

Such risks and issues will be managed through the use of Risks and Issues logs maintained by the Project Manager, discussed at Project Team meetings and reported to Project Steering Group meetings.

Wider Benefits to Sector & Achievements for Host Institution

Exposing educational and research resources and the corresponding connections as linked data creates a potential for broader reuse of their content, impacting on potentially large numbers of students and research communities. It also contributes in terms of gained experience, through articulated and evidenced benefits for exposing content and data to broader audiences. The project will aim to document business process changes required to achieve successful integrated institutional approaches and behaviours required to facilitate content and data reuse alongside documenting the development of policy and recommended standard-based, semantic technology interoperability solutions to support the effective delivery of educational and research data linked exposure.

Tremendous efforts are required for students for example to obtain an overview of relevant scholarly material concerning a particular topic of a course, or covering elements of some multimedia content. For example, answering a simple question such as “What courses are available that relate to this BBC programme I have just seen?” would require to manually locate the relevant resources, access them through many different systems and integrate the results. The same issues apply to researchers who rely on ad-hoc data collection, access and curation mechanisms, limiting their ability to flexibly exploit the
data, and to interpret it in connection with external information.

In addition, for researchers, the management and sharing of data is becoming a major issue. This is often realised through a database maintained by the project officer, manually entering and cleaning the
data, which is linked to a Web interface developed by an IT team. The main issues that arise from such a workflow include the inaccessibility of the data to other applications than its dedicated Web interface, the relative isolation of this data, being not connected to other information, but also that the relation between the database and its Web exposure is
non-trivial, creating a threat to its sustainability.

Through proposing clear procedures and technological support to the exposure of research and educational data as linked data, LUCERO will benefit “Users” and members of the Open University, especially:

  • Course and programme teams, through more effective content collection for course and programme creation, as well as the ability to publish course content enriched with links to relevant (data) resources.
  • Students, by providing multiple access points to educational and research data, as well as the availability of new tools to explore relevant resources.
  • Researchers, through the availability of new data analysis tools for linked data, able to make emerge connections between previously unrelated elements on the basis of links to external datasets.
  • The communication services, through new processes to realise the Web exposure of University content in an efficient and interlinked manner.

In addition, as LUCERO integrates the openness that characterize the Open University as well as, to a large extent, the linked data movement, within the exposure of educational and research data. Indeed, data published as part of the project will be made accessible freely and openly to any academics, student or research without restrictions.

Finally, besides the direct benefit of the open exposure of linked data, LUCERO intends to engage with the wider community by providing experience report, guidelines and reusable components, which can be picked up and employed directly by other organizations. As such, we intend to engage with the community of Semantic Web practitioners though interlinking and sharing of practices, as well as with the community of librarians/information managers regarding procedures to manage and expose linked educational and research data.

LUCERO – Aims, Objectives and Final Outputs of the Project

The goal of LUCERO (Linking University Content for Education and Research Online) is to investigate and prototype the use of linked data technologies and approaches to linking and exposing data for students and researchers. Linked data technologies and principles represent emerging practices to format and interconnect information on the Web. Working with groups of learners, researchers and practitioners based at the Open University, LUCERO will scope, prototype, pilot and evaluate reusable, cost-effective solutions relying on linked data for exposing and connecting educational and research content. LUCERO aim in particular at answering the following questions:

“What are the workflows, business processes, policies and technologies needed to expose the Open University and related digital content as linked data?”

“How can we integrate linked data technology in a sustainable way to support the research and educational activities of a Further or Higher Education organisation?”

LUCERO collaborates closely with the Open University Faculty of Arts to prototype and evaluate specific content exposure and linked data applications for researchers working within the Arts and Arts History domains, providing experience on the exposure and connection of research data outputs, and demonstrating their concrete benefits.

Exposing resources as linked data creates a potential for broader reuse of their content, impacting on potentially large numbers of students and research communities. In LUCERO, we aim to document business process changes required to achieve successful integrated institutional approaches and behaviours required to facilitate content and data reuse alongside documenting the development of policy and recommended standard-based, semantic technology interoperability solutions to support the effective delivery of educational and research data linked exposure.

More precisely, the planned outputs of the project are:

  • The deployment, test and documentation of a technical infrastructure, a toolkit, to facilitate the creation, exposure and use of linked data, implemented within the Open University, but designed to be reusable in other HE/FE institutions. This includes the realisation of interfaces to data creation, storage, publication and semi-automatic linking suitable to be used by library staff and academics, and that integrate with their current working environment.
  • The identification, documentation and validation of the processes necessary to integrate linked data in the Univeristy’s practices and workflows, including in particular the business, legal, ethical and organisational aspects.
  • Demonstrators of the benefits of exposing educational and research data as linked data through the realisation of applications improving access to educational and research data in the domain of Arts, for both researchers and students.