Skip to content

Dataset (and other DCAT) versioning

David Browning edited this page Nov 7, 2018 · 4 revisions

Currently a work in progress

This page is for discussion and proposals related to the Dataset versioning topic. The requirements that need to be addressed include

together with related use cases

Implicit in these requirements is that the lifecycle of data publication on the web leads to a need to relate instances of datasets, distributions etc with each other because they are in some way related to each other, typically by having something in common - that they are different versions of some underlying entity. [Note that there are other relationships between datasets that arise naturally that are also allluded to in the requirements such as "subset". Also DCAT-rev has introduced some potentially appropriate properties for the dcat:resource class with the sub-properties of dct:relation

Work done during the requirements discovery and discussions can be found here and here. In addition, techniques that use Prov-O are described at Tracking versions with PAV (linked from the DXWG home page).

Also strongly relevent to this (IMHO) is section on Versioning best practice in Data on the Web Best Practices. Amongst other things, this highlights that what constitutes a versionable change varies widely across different information domains, different publishers and even different datasets from the same publisher. What drives this variation (publisher technical contraints, attempts to make data more easily consumed are only two of the drivers) isn't necessarily strongly relevant to how it can be supported via DCAT and other vocabularies (if only to ensure the appropriate level of semantic commitment). For the purposes of determining how we revise DCAT (and any examples or guidance that we provide), it is important to recognize that DCAT aims to provide a common framework across different domains so any detailed support is more likely to be in a specialised vocabulary and/or described via a profile than built into DCAT itself.

A closely related topic to versioning has to do with identification, particularly of datasets, but more broadly of dcat:Resource.

Initial Discussion & Background

Looking at the requirements as recorded above (and through them into the relevent use cases as recorded in the UCR), we have five distinct requirements, each of which addresses distinct aspects of versioning. If we look at these one at time:

Issue #93 - Version subject

This can be summarised as "What aspects of the DCAT backbone can be subject to versioning?" As the discussion in DWBP makes clear, there are many examples of practice where datasets and distributions can have multiple versions. New versions of datasets can come about simply because more information has been gathered over time (or simply by publisher choice), while new versions of distributions would arise from the same processes but could also arise from correction of an error in generating a distribution. [It's also within the publishers gift to see these activities as giving rise to different independent instances of datasets or distributions. DCAT can trivially support that kind of process if that's how the publisher wants to do it, but that's not the presumption of these requirements we have been asked to address, nor of DWBP.]

Beyond the obvious candidates for versioning (dcat:Dataset and dcat:Distibution) the other aspects we need to decide upon are probably

  • dcat:Resource (essentially adding dcat:DataService as well)
  • dcat:Catalog - which may be stretching it a bit - Views?

Issue #90 - Version definition

This can be summarised as Guidance on what conditions (both type and severity) lead to new versions, and why? (e.g. change management for consumers) Again, DWBP provides a useful story on this, and the implementation report here provides multiple examples of where versioning has been used in different ways to achieve different ends.

We could meeting this requirement by providing (links to) some examples. It would be important to show that there are multiple 'right' answers to this, depending on intentions of publishers, predilections of consumers and the nature of the information and data being exchanged - we need to ensure that readers understand that this can be designed in different ways for different ends rather than assuming that our example set is exhaustive.

In terms of solution, there are a number of choices to how this could be done - in particular, the discussion in the section of the Qualified Relations wiki page on related datasets is relevant here.

Issue #89 - Version delta

Summarised as Provide a way to explain what has changed between versions There are a number of choices to how this could be done. A key question is to decide what we expect to spell out in the core vocabulary and what can be usefully seen as a use of profiles. (Qualified relations may play a role here too).

Issue #91 - Version release date

Versions should have relevant dates associated with them To my reading, many (possibly all) of the potentially versionable classes already have some form of date associated with 'release'. We will need to check once we have decided on which core aspects have versioning, and have some working examples

Issue #92 - Version identifier

Provide a means to identify a version of a dataset Clearly this links to the requirements for identifiers, particularly those derived from the use case at Modelling identifiers and making them actionable such as RDID.