RDF vocabulary design issues — checksums

The SPDX technical team recently encountered and interesting situation while developing our RDF vocabulary.

The exact scenario was as follows, the information we store about a file is potentially invalidated with every change to the contents of that file. For example, some code might be added that has different licensing requirements. Or all the code that has a particular licensing requirement might be removed. To make sure that the information in and SPDX file is valid for a particular file it must be possible to verify that its contents are identical to the contents of the file that was analyzed.

In SPDX we support verification of file contents by providing message digests1 of every file. However, message digest (hash) functions come and go. MD/5 used to be the norm but has fallen out of favor. SHA1 is currently very popular but is steadily being replaced with SHA256 and SHA512. We definitely need to support more than one digest algorithm.

We had three basic choices:

  • have separate properties for each digest algorithm, each of which would be a sub-property of the checksum property
  • define datatypes for each digest algorithm and have single checksum property
  • define a class for digests that encapsulates all the data and have a single checksum property

For the first option the graph would look like

<http://zlib.net/zlib-1.2.5.tar.gz#Makefile> a spdx:file;
  spdx:sha1 "1fac389…"^^xs:hexBinary;
  spdx:sha256 "4aa8223…"^^xs:hexBinary.

The resulting graphs are simple and self explanatory. Support for new digest algorithms is achieved by the addition of one new properties for each algorithm.

One potential downside is that some tools might not be able to pass through digests for algorithms the do not understand. For example, SPDX provides a tool to translate SPDX RDF data to and from spreadsheets. The digests are inserted into the spreadsheet by the tool looking for the known digest properties and putting that data into the appropriate column in the spreadsheet. This means that novel digest types would be lost in the translation. This could be avoided if the tool supported OWL inferencing but it is unlikely we will implement that in the near future. I think requiring OWL inferencing to work properly is a design smell.

A graph for the second option would look like

<http://zlib.net/zlib-1.2.5.tar.gz#Makefile> a spdx:file;
  spdx:checksum "1fac389…"^^spdx:sha1Hex;
  spdx:checksum "4aa8223…"^^spdx:sha256Hex.

The moves the digest algorithm into the datatype specification of the literals. This approach seems quite elegant. There is a single property so it is easy for tools deal with. It is extensible, anyone could define a new datatype. It is relatively compact.

However, there very few ontologies that use xml datatypes in this way. This could be because there are subtle problems with this approach. Or it could be that it is just uncommon. This approach would break down for algorithms that have any secondary parameters. In that case you could combine it with the third option, though.

A graph for the third option would look like

<http://zlib.net/zlib-1.2.5.tar.gz#Makefile> a spdx:file;
  spdx:checksum [spdx:algorithm <spdx:sha1>;
                 spdx:checksumValue "1fac389…"^^xs:hexBinary];
  spdx:checksum [spdx:algorithm <spdx:sha256>;
                 spdx:checksumValue "4aa8223…"^^xs:hexBinary].

In many ways this approach is very similar to the datatype based one. The introduction of anonymous resources does, potentially, allow the addition of additional parameters to the digest algorithm. However, tools that do not understand the algorithm would probably not pass that information though correctly. One practical upside is that the label for particular algorithms can be stored in the RDF. The <spdx:sha1> resource can have a dc:title property with value “SHA1”. This would mean that tools don’t necessarily have to implicitly understand a digest type in order to display it to humans.

In the end we decided to use the third approach on SPDX. The additional flexibility was generally found appealing. Having to define a new property for each new digest algorithm was generally viewed as a bit of a kludge. When i presented this issue to the semantic web mailing list there was only one response which preferred the first option, but found the third option acceptable. Most of the people involved in the SPDX effort are not highly experienced RDF modelers. I am not sure if the distaste for defining new properties reflects our relative lack of experience with RDF or if it is more fundamental.

Feedback on this choice is welcome but this post is more of an exploration of the possibilities and implications of those approaches.

  1. We decided to call this property spdx:checksum. While this is technically a misnomer it does effectively convey the intent of the field.

RDFa as interchange format

The tension between human and machine readability is never greater than when developing interchange formats. Formats that are easy and efficient for computers to read tend to be rather difficult for people to understand. When developing an interchange format you know that there will be few tools supporting it when it is released tools so it needs to be useful even with limited tooling. However, the format must support the development of sophisticated tools if it is to succeed in the long run.

A large part of the appeal of XML based languages is XML provides a reasonable balance between those two factors – for programmers. It can be read easily by computers and understood reasonably well by a programmer with very limited tooling.

For the non-programmer the story is somewhat different. Most XML based languages are complete gibberish to people without significant technical expertise. A business person will need non-trivial tools to allow them to consume the information locked up in an XML file.

Data interchange formats implicitly value the wider distribution of information. Why else would you be exchanging data. It is disappointing that so many of these formats are based on technology the excludes all but those with sophisticated tools or deep technical knowledge. Data interchange formats should be designed first for people, and second for computers. A properly designed data interchange format should be consumable, using commonly available tools, by any person who is familiar with the domain.

This means that XML is pretty much right out.

Fortunately there is HTML+RDFa. RDFa allows RDF data sets to be serialized into HTML documents. The information can then be consumed by humans using any web browser. The raw data can be readily extracted by tools.

Consider the following two examples. Each is part of an SPDX1 file. In first example, HTML+RDFa, both groups are easily supported. The information is displayed in a way that it can be understood by most human and the data is machine readable. In the second the information is machine readable but quite difficult for humans to interpret.

Files in zlib 1.2.5

Name Type License Checksum Copyright
source Zlib gWAPnq8fV6sVKdiYkgJQ1nFoTaXXSqoVfJbMCr9Kzd0 unknown
amiga/Makefile.pup other Zlib plyzzUCxuOx34oiXTdncU9ke14u.SV6UzMhN3UI.3×8 unknown
ChangeLog other Zlib rxKcRCSHu8.4tHMdiJRINMatY4efBCMz.PmHpo.gj9s unknown
contrib/ada/readme.txt other GPL-2.0 j.nlMD8ujot0bHglDnS3xK63zmIS_c51H8Ogzlakf.I unknown

The following is a similar amount of information expressed in RDF/XML

<spdx:File rdf:about="https://olex.openlogic.com/package_versions/download/9423?path=openlogic/zlib/1.2.5/openlogic-zlib-1.2.5-all-src-1.zip=3690#CMakeLists.txt">
  <spdx:License rdf:resource="http://spdx.org/licenses/Zlib"/>
<spdx:File rdf:about="https://olex.openlogic.com/package_versions/download/9423?path=openlogic/zlib/1.2.5/openlogic-zlib-1.2.5-all-src-1.zip=3690#ChangeLog">
  <spdx:License rdf:resource="http://spdx.org/licenses/Zlib"/>
<spdx:File rdf:about="https://olex.openlogic.com/package_versions/download/9423?path=openlogic/zlib/1.2.5/openlogic-zlib-1.2.5-all-src-1.zip=3690#FAQ">
  <spdx:License rdf:resource="http://spdx.org/licenses/Zlib"/>
<spdx:File rdf:about="https://olex.openlogic.com/package_versions/download/9423?path=openlogic/zlib/1.2.5/openlogic-zlib-1.2.5-all-src-1.zip=3690#INDEX">
  <spdx:License rdf:resource="http://spdx.org/licenses/Zlib"/>

The extreme accessibility of HTML+RDFa for both humans and machines makes it an obviously superior choice for data interchange formats. HTML+RDFa is a relatively new entry into the arena. Hopefully we will see more data formats use this superb technology.

  1. The Software Package Data Exchange project is designing a way to exchange licensing information for software packages. The current phase of development is primarily focused on simple manifest and copyright licensing related information.