PRONOM and linked data « The LUCERO Project

PRONOM and linked data

26 May 2011 at 17:59

Mathieu

PRONOM is the national archive’s technical registry, and is currently being `transformed’ to be exposed as linked data. We can of course only welcome such an initiative and be very enthusiastic about this potentially valuable resource. Now, because we are of this kind of people who like to criticize (or more seriously, because we were asked by our programme manager to give feedback), here are a few comments regarding things that could be done better.

Most of the description and technical specification of the work relate to the specification of a vocabulary. Apart from all the low-level boring issues (such as “it is in pdf”, “it is not really clear”, etc.), there are major issues in its definition: mostly, 1- it is not really good modelling, and 2- it does not reuse enough other vocabularies. Funnily enough, these two criticisms could be applied to many vocabularies that are created `ad-hoc’, for a particular project.

A nice big example of bad modelling regards all the classes used to represent file formats. First, their names are quite seriously misleading. Video, is not a Video, it is a video type of file format. GIS is the type of file format use by a geographic information system, etc. I really don’t understand how these things could be classes. It seems that the intension was that a class such as `Video’ would correspond to what should be called `VideoFormat’. In this case for example <http://reference.data.gov.uk/id/file-format/13>, which corresponds to the PNG image format should be an instance of <http://reference.data.gov.uk/technical-registry/formatType/Image_(Raster)>. However, it is not. It is connected to it through a triple {<http://reference.data.gov.uk/id/file-format/13> <http://reference.data.gov.uk/technical-registry/formatType> <http://reference.data.gov.uk/technical-registry/formatType/Image_(Raster)>}, in which case, <http://reference.data.gov.uk/technical-registry/formatType/Image_(Raster)> should really be an individual (and have another name, e.g., <http://reference.data.gov.uk/technical-registry/formatType/raster-image-format>). Now, if that wasn’t confusing enough, <http://reference.data.gov.uk/id/file-format/13> is also a class. This one, I have no explanation for. I don’t know either why things such as <http://reference.data.gov.uk/technical-registry/Big_endian> are described as properties.

I’m sure there are quite a few other issues (even if the vocabulary seem in itself rather simple, but I haven’t found the RDF-S version of it), including underspecified domains, ranges and classes, untyped objects, etc. I might have missed something, but the naming conventions used seem to have been made voluntarily confusing. The four core classes are not capitalised, and use `-’ as separators. The other classes are capitalised and use `_’. Some properties would be fully in upper-case (MIMETYPE), some have the first letter capitalised and some only the first letter of the second word capitalised (and not word separator). The file formats are associated with numbers in the namespace ‘http://reference.data.gov.uk/id/file-format/’ while a human readable ID (e.g. ‘png1.2′) could have easily been created. Other things such as `internal signatures’ are also associated to numbers, in name spaces such as ‘http://reference.data.gov.uk/technical-registry/internalSignature/’. I never understand why many people seem to want to have ‘id’ in their namespaces, but if it is done for one, they might as well do it for the others. `Big_endian’ as mentioned above has a nice capital letter for the first word, not the second, while it is described as a property and used as an individual.

Finally, this vocabulary does not reuse. Almost nothing. The example promotes the use of the dublin core vocabulary. A tiny bit of SKOS is used for labels (I’m personally not too sure whether you could use SKOS label properties on things other than SKOS concepts, but that it is really only a detail). DC could certainly be used more (e.g., dct:published instead of releaseDate?). I’m also reasonably convinced that the W3C Ontology for Media Resources should be at least connected to this vocabulary.

In a nutshell, I like this vocabulary and the data based on it, and I will use them. They provide a great resource illustrating how easy it is to make wrong modelling choices.

No Comment

Be the first to respond!

Click here to cancel reply.

The LUCERO Project