As part of the extension of the project, I was talking recently with Sean Bechhofer, who is currently looking at doing some linked data for the University of Manchester. A part of the discussion was naturally concerned with reusing things from LUCERO. One thing Sean would have expected to be able to reuse are the tools we employed/developed for extracting data from their original sources into RDF. While many parts of the LUCERO technical workflow are reusable, and the extractors are only a small part of it, it is still quite disappointing that these tools are not based on generic mechanisms that can be easily re-applied to other environments, especially because tools to extract RDF from legacy data exist.

This post is therefore meant as a bit of a survey on such tools, their scope and applicability. There are different types of tools that can be considered, depending in particular on the format of the original source.

Generating RDF from Relational Databases

Triplify is one of the first tools we experimented with, in a pilot project that was based on a relational (MySQL) database. The way to use Triplify, if it is not already integrated as a plugin for whatever you are doing, is to defined a set of SQL select queries on the database, that also include information about the way the results should be converted into RDF. More precisely, it is assumed that each row of results correspond to an individual, that the first column is an identifier for individual, and other columns are properties. A simple example of such a query is: SELECT id,name AS 'foaf:name' FROM users. This works very well for simple, easy to transform structures, but tends to become difficult to manage when the RDF graph has to be significantly different from the naive transformation of the database (with queries spanning over many queries, or RDF individuals being contributed to by many tables).

D2RQ takes a slightly different approach to Triplify, as instead of trying to create an RDF dump of a relational database, it allows to create a mapping that relate the structure of the database to RDF triples, and transforms at run-time SPARQL queries into SQL queries using this mapping. The D2RQ mapping language is reasonably simple, as shown in the example below, and can express many intricate relationships from the database. Another advantage is that the D2RQ tool can create a default ‘naive’ mapping from the database schema, which can then be customized (therefore facilitating the first steps of managing the transformation process). In case of evolving databases, the mapping can become quite hard to maintain however. Another disadvantage is that the run-time query transformation approach is not very efficient (but help keeping data up-to-date). It is worth noticing however that D2RQ can also create an RDF dump of the content of the database using the same mapping.

map:Conference a d2rq:ClassMap;
d2rq:dataStorage map:Database1.
d2rq:class :Conference;
d2rq:uriPattern "http://conferences.org/comp/confno@@Conferences.ConfID@@";
.
map:eventTitle a d2rq:PropertyBridge;
d2rq:belongsToClassMap map:Conference;
d2rq:property :eventTitle;
d2rq:column "Conferences.Name";
d2rq:datatype xsd:string;
.
map:location a d2rq:PropertyBridge;
d2rq:belongsToClassMap map:Conference;
d2rq:property :location;
d2rq:column "Conferences.Location";
d2rq:datatype xsd:string;
.

ODEMapster is based on the R2O mapping language, which, similarly to the one of D2RQ, can establish relations between the structure of a database and the way it can be exported into RDF. ODEMapster however focuses more on the creation of OWL ontologies from the content of databases. It is available currently as a plugin of the NeOn Toolkit for Ontology Engineering. Similarly, RDVToOnto tries to semi-automatically extract populated ontologies, relying on both the schema of and content patterns in the database.

It is worth noticing here as well that the W3C has set-up an RDB2RDF working group, in charge in particular of defining a common language, set of requirements and test cases for transforming relational databases (RDB) into RDF. The working group has in particular produced a survey of existing approaches in this area.

Generating RDF from XML (including RSS)

There have been less work on converting XML sources into RDF than converting from relational databases. One of the reasons, paradoxically, is that XML and RDF share a common base, at least in terms of syntax (i.e., RDF/XML uses an XML syntax, XML can be made, somehow, RDF friendly and RSS 1.0 is, in principle, already in RDF). There have therefore been quite a few examples of syntactic conversions of XML to RDF/XML, using in particular XSLT.

The GRDDL language recommended by the W3C intends to provide a standard and systematic way to achieve such XSLT based transformation, by making it possible to declare that XML documents include data compatible with RDF. It has been extensively used for example for the conversion of microformats.

Generating RDF from tables and spreadsheets

In many domains, including ours, data simply come in tabular format, through spreadsheets and CSV files. Transforming such formats, intended to make the data easily sharable between people, can be quite a challenge.

Google Refine is a tool which is meant as an easy way to clean, transform and explore data in a tabular format. It can import from many different sources, including MS Excel, Google Spreadsheet and CSV, and includes a number of useful features to work on the data. While it is not originally developed to support RDF export, it is extensible. The RDF Extension has been created in order to allow export into RDF (with a graphical definition of the mappings between the table and RDF), as well as to including useful tools to connect the content of the table to external linked datasets.

Other tools exist such as Any23 or QUIDICRC that provides simple, direct transformation of CSV files into RDF.

More specific sources and generic frameworks

There are many other tools that exist that can be used legacy data into RDF, from small specific tools, to generic frameworks (see http://www.w3.org/wiki/ConverterToRdf for a more complete list).

For example, SIMILE RDFIzer is a set of specialized converters for a large variety of input formats. Of relevance to the education domain, we can for example notice marcmods2rdf which converts library catalog records to RDF, oai2rdf which can extract RDF from open archive repositories (OAI-PMH) and ocw2rdf which can extract RDF from MIT OpenCourseWare metadata.

Even outside RDFIzer, a number of converters can be found that would take as input specialized formats and export them into RDF using particular vocabularies. We can mention for example Bibtex2RDF converting bibliographical references in the Bibtex format, or the Youtube2RDF tools developed in the LUCERO project, and that converts Youtube playlists into RDF using media vocabularies.

Conclusion

As can be seen from above, one can sometimes find many options to convert legacy data into RDF, depending on the original format of the data, and on the particular requirements regarding the transformation process. This list is obviously not complete.

One the main issue however regarding the use of these tools is not the choice, but rather their integration and adaptation into the right environment. Some tools would require efforts into creating and maintaining a mapping between the original source and RDF, which might end-up being very time consuming (possibly more than creating dedicated, ad-hoc converters like we did in LUCERO). Other converters do not require such configuration, but would produce ‘generic RDF’ that might not fit the considered requirements. Some might say for example that a generic conversion of MARC to RDF is inconceivable. Finally, when having to convert from many different sources with disparate formats, managing the use of multiple tools, their outputs (and especially the overall consistency of the produced RDF) and their scheduling might become a difficult challenge.