DOIs for Dataset versions #4499

adam3smith · 2018-03-09T18:07:46Z

Following up on the thread on the google group:
https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!topic/dataverse-community/34E9foKnxQs

Userstory

As a researcher, I want to be able to cite, using a permanent identifier, a specific version of dataset to avoid any ambiguity and to make the citation machine-actionable.

How this should probably look

Zenodo would be a good template here.

A dataset has one "generic" DOI that always points to the latest version (this would be used if you e.g. just cite the data generically)
A dataset has one DOI per version to allow to point to a specific version

Example

https://doi.org/10.5281/zenodo.1041767 is the generic dataset DOI, always pointing to the most recent version, with 10.5281/zenodo.1188752 being the DOI for the 2nd (current) version and 10.5281/zenodo.1041768 the DOI for the first version

Relationship to other features

File DOIs:

File DOIs are great (and by themselves necessary), but they are not a replacement for dataset version DOIs. Datasets are often made up of multiple files. An analysis script (which itself may be part of the data and thus versioned) may point to multiple files in a dataset. It's not feasible (or desirable) for a researcher to include the DOI for every file used in a citation. They want to point to one single DOI to reference the exact data (made up of multiple files) they've been using. This is equally true for quantiative and qualitative data, btw.

File DOIs of files that are not changed between versions should remain stable (otherwise there's a potential to massively inflate the number of DOIs issued for no good reason)

UNFs

UNFs (at least for tabular data) help ensure that a cited dataset is the one being used, so they avoid using incorrect versions, but they do not function as identifiers, i.e. UNF:6:edx+kB6SY2N3Zt9OsUbp4A== tells me nothing about where to find that dataset.

philippconzett · 2021-01-22T07:17:56Z

It would be good to prioritize this issue as DOI versioning at dataset level 1) is recommended to support reproducibility; 2) is implemented by other repository applications (Zenodo); and 3) is a feature desired by the Dataverse community. See also a current discussion in the Dataverse Google Group: https://groups.google.com/u/1/g/dataverse-community/c/34E9foKnxQs.

poikilotherm · 2021-01-23T09:13:17Z

Nice find of this long standing issue!

This is also related to my plans on adding research software depositing support.

djbrooke · 2021-05-11T21:26:35Z

Adding content from #7867:

I had a question on DOI and versions within Dataverse. Data in Dataverse gets 1 DOI, even when there are multiple versions present of the data package. Considering that this might hamper machine readability to refer to a specific version of a data package, would a Zenodo implementation be something to consider? That would allow getting people immediately to the correct version of a data package through a specific resolvable DOI (and avoids possible confusion on which version to use).

https://help.zenodo.org/#versioning%22

I understand that you can refer to a version by for example:

https://dataverse.nl/dataset.xhtml?persistentId=doi:10.34894/2WZ0S9&version=2.0
https://dataverse.nl/dataset.xhtml?persistentId=doi:10.34894/2WZ0S9&version=1.0

But that is not really the same as one resolvable DOI (https://doi.org/10.34894/2WZ0S9)

poikilotherm · 2022-02-05T09:54:56Z

While writing the concept paper for HERMES https://arxiv.org/abs/2201.09015, this issue moved inside my head...

It is crucial for Dataverse software to support this to be compliant with FORCE11 Software Citation guidelines.

It should be an optional feature, configurable like File PIDs.

I pretty much dislike how Zenodo did this: from a version DOI it's cumbersome to get to the concept DOI.

How about using the same trick as we did for File PIDs and add a /... after the DOI?

Sth. like https://doi.org/10.11111/FOOBAR/171727/V1?

For software, it would be beneficial to use the software version instead of the dataset version. We could use the software version from the datasets metadata and either use it exclusively or register both as DOI.

When also enabling File PIDs, how does this currently work? When a file has its own DOI, does it change when the file is updated?

pdurbin · 2022-02-07T16:08:24Z

When also enabling File PIDs, how does this currently work? When a file has its own DOI, does it change when the file is updated?

@poikilotherm good question. As far as I can tell, this isn't well explained in the guides. I also looked at pull request #4350 where file PIDs were introduced. For simplicity, let's just talk about DOIs.

DOIs are stored on the dvobject database table so when a new file/dvobject is created when file level PIDs are enabled, a new DOI created for that file. Even if you use "file replace", a new DOI is created. The old file and the new file will have both different database IDs and different DOIs.

Now, if you are only changing the metadata of a file (description, tags, etc.) a new DOI is not created for the file. So you can safely fix typos in file descriptions, for example, without worrying about the DOI of the file changing.

I hope this helps!

poikilotherm · 2022-02-07T16:12:51Z

So there is no need to worry about files for version PIDs, right? The version / DvObject knows which files belong to it and in which version, so nothing will break from this, aye?

WDYT about the idea of a version suffix?

pdurbin · 2022-02-07T21:24:25Z

I'm not worried about anything breaking.

What we have now is version numbers like 3.0 and 4.1. If you supply either of these numbers to Dataverse, it will give back to you the correct files for the version. You can look at the downloadAllFromVersion code in pull request #7086, for example, to see how it works.

With DOIs for versions, it just means 3.0 and 4.1 (examples from above) will still work as before but you'd also be able to supply an alias, if you will, a DOI, for each version, to download files (or whatever operation).

I'm not opposed to the version suffix idea but my understanding is that you're generally not suppose to encode anything meaningful into DOIs. They're supposed to just be meaningless strings.

scolapasta · 2022-02-07T21:27:38Z

there are different schools of thought on this. With file dois we did it this way doi:xxx/1 doi:xxx/2 etc to have two files within the same datasets. For version we had discussed using "." as the separator, e.g. doi.xxx.1 and doi.xxx.2

While this is not meaningless, it is not encoding anything about the content just the connection bwtween the two.

poikilotherm · 2022-02-07T21:56:31Z

As I wrote before - Zenodo uses a DOI for the "concept" or whole dataset and completely unrelated ones for "versions". And it's really hard to tell them apart, as they are of the same length etc.

For the software use case (and presumably others), IMHO it would be nice to make it more obvious we are talking about versions of a dataset. If that means skip the software version as it's real metadata @pdurbin , that's fine for me as long as we have an identifier for the dataset version.

Currently there is no way to jump from an identifier right into a specific version of a dataset, as resolvers remove any query parts etc.

@scolapasta using the dots is also a great idea! Maybe it just should be a configurable thing to enable broader use cases.

WDYT would be the best approach to get a discussion started? As this will definitely have an impact on the UI, there are more people involved. And what about community consensus?

Danny-dK · 2022-09-15T10:51:44Z

The Figshare approach may be nicer than Zenodo.

Figshare:
https://help.figshare.com/article/can-i-edit-or-delete-my-research-after-it-has-been-made-public
One base doi and versions are appended to the base doi.
https://doi.org/10.6084/m9.figshare.2066037
https://doi.org/10.6084/m9.figshare.2066037.v16
https://doi.org/10.6084/m9.figshare.2066037.v2

poikilotherm · 2022-09-15T11:04:01Z

This is precisely what I was intending to do (see above, the delimiter character maybe should be configurable), as I don't like the Zenodo approach, either.

philippconzett · 2022-09-15T14:23:34Z

However, contrary to the figshare example, I'd suggest we avoid semantic loading of any parts of the DOI, except for the indication of versions belonging to the same dataset/file, thus using opaque strings and not including branding like "figshare", or in our case "dataverse" or whatever. See the recommendations in section 2.2 Syntax of a DOI name in the DOI Handbook.

scolapasta · 2022-09-15T15:19:14Z

As we did for file PIDs, I'd suggest we enable both ways to configure - either going in order numerical or random string. (after a "." to identify it's a version:
So either:
https://doi.org/10.11111/XYZABC.1
https://doi.org/10.11111/XYZABC.2

or:

https://doi.org/10.11111/XYZABC.DEFGHI
https://doi.org/10.11111/XYZABC.JKLMNO

(I'm more wary of making the . configurable, as I'd prefer the consistency of meaning there, plus not allowing for /, which already means file)

poikilotherm · 2022-09-15T15:39:19Z

@philippconzett I don't think the proposal by @Danny-dK was about incorporating something like this. Plus Dataverse is already configurable to do this if desired via the shoulder setting.

poikilotherm · 2022-09-27T09:04:15Z

I just found https://doi.org/10.1371/journal.pbio.2001414 and think their statement on including versioning in the PID is valuable for this discussion (see "Lesson 6. Implement a version-management policy"):

Embedding versioning in identifiers is recommended if the prevailing use of an unversioned identifier results in “breaking changes” (e.g., a change in the hypothesized cause of a disease). However, if new information about the entity emerges slowly and the changes are “nonbreaking”, it is reasonable to instead maintain a machine-actionable change history in the entity’s metadata.

Which would mean: we should create version PIDs only for "major" changes, not "minor". Is our change history accessible via API?

I also like their approach with the "version after dot". It might still be worth the "trouble" to make the separation character configurable, though.

mreekie · 2023-01-18T19:18:45Z

backlog prio meeting:

This overlaps with work that Oliver is doing and we should coordinate with him with sizing and the work

mreekie · 2023-01-23T15:11:05Z

sizing:

PM added to ordered sizing queue

This commit adds a new scope and setting to the JvmSettings, enabling the configuration of different modes for Dataset Version PIDs. These modes are depicted in VersionPidMode. A test ensures the parsability. In addition, VersionPidMode also contains a fine grained option to change the conduct of Dataverse collections and their datasets for these PIDs.

…d conduct IQSS#4499 This commit adds a public method DataverseServiceBean.wantsDatasetVersionPids() that will determine how to deal with a dataset version (which belongs to a dataset that lives within a collection) in terms of "should a PID be registered/updated?". The background is: when a dataset is published, there will be the context of the owning Dataverse collection. It's important to take into account the configured conduct for the collection in the decision how to go ahead with a version's PID.

poikilotherm · 2023-03-23T08:47:16Z

Just a heads up for anyone interested - work on this has been started within #9462.

For a first set of attributes (name, alias, description and, most important for the PR about IQSS#4499, the dataset version PID conduct), make an endpoint available that allows changes via simple PUT HTTP commands.

…QSS#4499 This commit adds the methods necessary to be implemented by any provider that supports version PIDs. By default, the interface will ensure that exceptions are thrown about unsupported actions, when the methods are not implemented. Extensive JavaDoc describes details.

cmbz · 2024-09-16T14:11:27Z

2024/09/16: Keeping, currently in Dataverse backlog.

pdurbin added the Feature: DOI & Handle label Oct 13, 2018

pdurbin mentioned this issue Nov 19, 2020

Implement access to the files in the dataset as a virtual folder tree #7084

Closed

pdurbin mentioned this issue May 11, 2021

Feature request: DOI versioning #7867

Closed

philippconzett mentioned this issue Feb 5, 2022

Adding DataCite or other PIDs at the dataverse collection level #5930

Closed

poikilotherm added the HERMES related to @hermes-hmc work on Dataverse code label Feb 5, 2022

adam3smith mentioned this issue Feb 17, 2022

Add versioning notes to metadata #8431

Open

Danny-dK mentioned this issue Sep 28, 2022

[FEATURE] doi versioning UtrechtUniversity/yoda#140

Closed

scolapasta added this to IQSS Dataverse Project Nov 16, 2022

scolapasta moved this to New in IQSS Dataverse Project Nov 16, 2022

scolapasta moved this from New to Dataverse Team (Gustavo) in IQSS Dataverse Project Nov 16, 2022

pdurbin mentioned this issue Jan 4, 2023

ci: fix changed resolved ref for dataverse test, fix recent github ratelimit PR jupyterhub/binderhub#1615

Merged

mreekie added the Size: Queued PM has called this issue out specifically for sizing label Jan 23, 2023

poikilotherm linked a pull request Mar 22, 2023 that will close this issue

4499 - dataset version pids #9462

Draft

27 tasks

pdurbin added Type: Feature a feature request User Role: Depositor Creates datasets, uploads data, etc. labels Oct 7, 2023

poikilotherm added this to Forschungszentrum Jülich Jul 10, 2024

poikilotherm moved this to WIP in Forschungszentrum Jülich Jul 10, 2024

vkush mentioned this issue Oct 7, 2024

PIDs for dataset and file versions needed nfdi4cat/repo4cat#5

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DOIs for Dataset versions #4499

DOIs for Dataset versions #4499

adam3smith commented Mar 9, 2018

philippconzett commented Jan 22, 2021

poikilotherm commented Jan 23, 2021

djbrooke commented May 11, 2021

poikilotherm commented Feb 5, 2022

pdurbin commented Feb 7, 2022

poikilotherm commented Feb 7, 2022

pdurbin commented Feb 7, 2022

scolapasta commented Feb 7, 2022

poikilotherm commented Feb 7, 2022 •

edited

Loading

Danny-dK commented Sep 15, 2022

poikilotherm commented Sep 15, 2022 •

edited

Loading

philippconzett commented Sep 15, 2022

scolapasta commented Sep 15, 2022

poikilotherm commented Sep 15, 2022 •

edited

Loading

poikilotherm commented Sep 27, 2022

mreekie commented Jan 18, 2023 •

edited

Loading

mreekie commented Jan 23, 2023

poikilotherm commented Mar 23, 2023 •

edited

Loading

cmbz commented Sep 16, 2024

DOIs for Dataset versions #4499

DOIs for Dataset versions #4499

Comments

adam3smith commented Mar 9, 2018

Userstory

How this should probably look

Example

Relationship to other features

File DOIs:

UNFs

philippconzett commented Jan 22, 2021

poikilotherm commented Jan 23, 2021

djbrooke commented May 11, 2021

poikilotherm commented Feb 5, 2022

pdurbin commented Feb 7, 2022

poikilotherm commented Feb 7, 2022

pdurbin commented Feb 7, 2022

scolapasta commented Feb 7, 2022

poikilotherm commented Feb 7, 2022 • edited Loading

Danny-dK commented Sep 15, 2022

poikilotherm commented Sep 15, 2022 • edited Loading

philippconzett commented Sep 15, 2022

scolapasta commented Sep 15, 2022

poikilotherm commented Sep 15, 2022 • edited Loading

poikilotherm commented Sep 27, 2022

mreekie commented Jan 18, 2023 • edited Loading

mreekie commented Jan 23, 2023

poikilotherm commented Mar 23, 2023 • edited Loading

cmbz commented Sep 16, 2024

poikilotherm commented Feb 7, 2022 •

edited

Loading

poikilotherm commented Sep 15, 2022 •

edited

Loading

poikilotherm commented Sep 15, 2022 •

edited

Loading

mreekie commented Jan 18, 2023 •

edited

Loading

poikilotherm commented Mar 23, 2023 •

edited

Loading