Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOIs for Dataset versions #4499

Open
adam3smith opened this issue Mar 9, 2018 · 19 comments · May be fixed by #9462
Open

DOIs for Dataset versions #4499

adam3smith opened this issue Mar 9, 2018 · 19 comments · May be fixed by #9462
Labels
Feature: DOI & Handle HERMES related to @hermes-hmc work on Dataverse code Size: Queued PM has called this issue out specifically for sizing Type: Feature a feature request User Role: Depositor Creates datasets, uploads data, etc.

Comments

@adam3smith
Copy link
Contributor

Following up on the thread on the google group:
https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!topic/dataverse-community/34E9foKnxQs

Userstory

As a researcher, I want to be able to cite, using a permanent identifier, a specific version of dataset to avoid any ambiguity and to make the citation machine-actionable.

How this should probably look

Zenodo would be a good template here.

  1. A dataset has one "generic" DOI that always points to the latest version (this would be used if you e.g. just cite the data generically)
  2. A dataset has one DOI per version to allow to point to a specific version

Example

https://doi.org/10.5281/zenodo.1041767 is the generic dataset DOI, always pointing to the most recent version, with 10.5281/zenodo.1188752 being the DOI for the 2nd (current) version and 10.5281/zenodo.1041768 the DOI for the first version

Relationship to other features

File DOIs:

File DOIs are great (and by themselves necessary), but they are not a replacement for dataset version DOIs. Datasets are often made up of multiple files. An analysis script (which itself may be part of the data and thus versioned) may point to multiple files in a dataset. It's not feasible (or desirable) for a researcher to include the DOI for every file used in a citation. They want to point to one single DOI to reference the exact data (made up of multiple files) they've been using. This is equally true for quantiative and qualitative data, btw.

File DOIs of files that are not changed between versions should remain stable (otherwise there's a potential to massively inflate the number of DOIs issued for no good reason)

UNFs

UNFs (at least for tabular data) help ensure that a cited dataset is the one being used, so they avoid using incorrect versions, but they do not function as identifiers, i.e. UNF:6:edx+kB6SY2N3Zt9OsUbp4A== tells me nothing about where to find that dataset.

@philippconzett
Copy link
Contributor

It would be good to prioritize this issue as DOI versioning at dataset level 1) is recommended to support reproducibility; 2) is implemented by other repository applications (Zenodo); and 3) is a feature desired by the Dataverse community. See also a current discussion in the Dataverse Google Group: https://groups.google.com/u/1/g/dataverse-community/c/34E9foKnxQs.

@poikilotherm
Copy link
Contributor

Nice find of this long standing issue!

This is also related to my plans on adding research software depositing support.

@djbrooke
Copy link
Contributor

Adding content from #7867:

I had a question on DOI and versions within Dataverse. Data in Dataverse gets 1 DOI, even when there are multiple versions present of the data package. Considering that this might hamper machine readability to refer to a specific version of a data package, would a Zenodo implementation be something to consider? That would allow getting people immediately to the correct version of a data package through a specific resolvable DOI (and avoids possible confusion on which version to use).

https://help.zenodo.org/#versioning%22

image

I understand that you can refer to a version by for example:

https://dataverse.nl/dataset.xhtml?persistentId=doi:10.34894/2WZ0S9&version=2.0
https://dataverse.nl/dataset.xhtml?persistentId=doi:10.34894/2WZ0S9&version=1.0

But that is not really the same as one resolvable DOI (https://doi.org/10.34894/2WZ0S9)

@poikilotherm poikilotherm added the HERMES related to @hermes-hmc work on Dataverse code label Feb 5, 2022
@poikilotherm
Copy link
Contributor

While writing the concept paper for HERMES https://arxiv.org/abs/2201.09015, this issue moved inside my head...

It is crucial for Dataverse software to support this to be compliant with FORCE11 Software Citation guidelines.

It should be an optional feature, configurable like File PIDs.

I pretty much dislike how Zenodo did this: from a version DOI it's cumbersome to get to the concept DOI.

How about using the same trick as we did for File PIDs and add a /... after the DOI?

Sth. like https://doi.org/10.11111/FOOBAR/171727/V1?

For software, it would be beneficial to use the software version instead of the dataset version. We could use the software version from the datasets metadata and either use it exclusively or register both as DOI.

When also enabling File PIDs, how does this currently work? When a file has its own DOI, does it change when the file is updated?

@pdurbin
Copy link
Member

pdurbin commented Feb 7, 2022

When also enabling File PIDs, how does this currently work? When a file has its own DOI, does it change when the file is updated?

@poikilotherm good question. As far as I can tell, this isn't well explained in the guides. I also looked at pull request #4350 where file PIDs were introduced. For simplicity, let's just talk about DOIs.

DOIs are stored on the dvobject database table so when a new file/dvobject is created when file level PIDs are enabled, a new DOI created for that file. Even if you use "file replace", a new DOI is created. The old file and the new file will have both different database IDs and different DOIs.

Now, if you are only changing the metadata of a file (description, tags, etc.) a new DOI is not created for the file. So you can safely fix typos in file descriptions, for example, without worrying about the DOI of the file changing.

I hope this helps!

@poikilotherm
Copy link
Contributor

So there is no need to worry about files for version PIDs, right? The version / DvObject knows which files belong to it and in which version, so nothing will break from this, aye?

WDYT about the idea of a version suffix?

@pdurbin
Copy link
Member

pdurbin commented Feb 7, 2022

I'm not worried about anything breaking.

What we have now is version numbers like 3.0 and 4.1. If you supply either of these numbers to Dataverse, it will give back to you the correct files for the version. You can look at the downloadAllFromVersion code in pull request #7086, for example, to see how it works.

With DOIs for versions, it just means 3.0 and 4.1 (examples from above) will still work as before but you'd also be able to supply an alias, if you will, a DOI, for each version, to download files (or whatever operation).

I'm not opposed to the version suffix idea but my understanding is that you're generally not suppose to encode anything meaningful into DOIs. They're supposed to just be meaningless strings.

@scolapasta
Copy link
Contributor

there are different schools of thought on this. With file dois we did it this way doi:xxx/1 doi:xxx/2 etc to have two files within the same datasets. For version we had discussed using "." as the separator, e.g. doi.xxx.1 and doi.xxx.2

While this is not meaningless, it is not encoding anything about the content just the connection bwtween the two.

@poikilotherm
Copy link
Contributor

poikilotherm commented Feb 7, 2022

As I wrote before - Zenodo uses a DOI for the "concept" or whole dataset and completely unrelated ones for "versions". And it's really hard to tell them apart, as they are of the same length etc.

For the software use case (and presumably others), IMHO it would be nice to make it more obvious we are talking about versions of a dataset. If that means skip the software version as it's real metadata @pdurbin , that's fine for me as long as we have an identifier for the dataset version.

Currently there is no way to jump from an identifier right into a specific version of a dataset, as resolvers remove any query parts etc.

@scolapasta using the dots is also a great idea! Maybe it just should be a configurable thing to enable broader use cases.

WDYT would be the best approach to get a discussion started? As this will definitely have an impact on the UI, there are more people involved. And what about community consensus?

@Danny-dK
Copy link

The Figshare approach may be nicer than Zenodo.

Figshare:
https://help.figshare.com/article/can-i-edit-or-delete-my-research-after-it-has-been-made-public
One base doi and versions are appended to the base doi.
https://doi.org/10.6084/m9.figshare.2066037
https://doi.org/10.6084/m9.figshare.2066037.v16
https://doi.org/10.6084/m9.figshare.2066037.v2

@poikilotherm
Copy link
Contributor

poikilotherm commented Sep 15, 2022

This is precisely what I was intending to do (see above, the delimiter character maybe should be configurable), as I don't like the Zenodo approach, either.

@philippconzett
Copy link
Contributor

However, contrary to the figshare example, I'd suggest we avoid semantic loading of any parts of the DOI, except for the indication of versions belonging to the same dataset/file, thus using opaque strings and not including branding like "figshare", or in our case "dataverse" or whatever. See the recommendations in section 2.2 Syntax of a DOI name in the DOI Handbook.

@scolapasta
Copy link
Contributor

As we did for file PIDs, I'd suggest we enable both ways to configure - either going in order numerical or random string. (after a "." to identify it's a version:
So either:
https://doi.org/10.11111/XYZABC.1
https://doi.org/10.11111/XYZABC.2

or:

https://doi.org/10.11111/XYZABC.DEFGHI
https://doi.org/10.11111/XYZABC.JKLMNO

(I'm more wary of making the . configurable, as I'd prefer the consistency of meaning there, plus not allowing for /, which already means file)

@poikilotherm
Copy link
Contributor

poikilotherm commented Sep 15, 2022

@philippconzett I don't think the proposal by @Danny-dK was about incorporating something like this. Plus Dataverse is already configurable to do this if desired via the shoulder setting.

@poikilotherm
Copy link
Contributor

I just found https://doi.org/10.1371/journal.pbio.2001414 and think their statement on including versioning in the PID is valuable for this discussion (see "Lesson 6. Implement a version-management policy"):

Embedding versioning in identifiers is recommended if the prevailing use of an unversioned identifier results in “breaking changes” (e.g., a change in the hypothesized cause of a disease). However, if new information about the entity emerges slowly and the changes are “nonbreaking”, it is reasonable to instead maintain a machine-actionable change history in the entity’s metadata.

Which would mean: we should create version PIDs only for "major" changes, not "minor". Is our change history accessible via API?

I also like their approach with the "version after dot". It might still be worth the "trouble" to make the separation character configurable, though.

@mreekie
Copy link

mreekie commented Jan 18, 2023

backlog prio meeting:

  • This overlaps with work that Oliver is doing and we should coordinate with him with sizing and the work

@mreekie mreekie added the Size: Queued PM has called this issue out specifically for sizing label Jan 23, 2023
@mreekie
Copy link

mreekie commented Jan 23, 2023

sizing:

  • PM added to ordered sizing queue

poikilotherm added a commit to poikilotherm/dataverse that referenced this issue Mar 22, 2023
This commit adds a new scope and setting to the JvmSettings,
enabling the configuration of different modes for Dataset Version PIDs.
These modes are depicted in VersionPidMode. A test ensures the
parsability.

In addition, VersionPidMode also contains a fine grained option
to change the conduct of Dataverse collections and their datasets
for these PIDs.
@poikilotherm poikilotherm linked a pull request Mar 22, 2023 that will close this issue
27 tasks
poikilotherm added a commit to poikilotherm/dataverse that referenced this issue Mar 22, 2023
…d conduct IQSS#4499

This commit adds a public method DataverseServiceBean.wantsDatasetVersionPids()
that will determine how to deal with a dataset version (which belongs to a dataset
that lives within a collection) in terms of "should a PID be registered/updated?".

The background is: when a dataset is published, there will be the context of the
owning Dataverse collection. It's important to take into account the configured
conduct for the collection in the decision how to go ahead with a version's PID.
@poikilotherm
Copy link
Contributor

poikilotherm commented Mar 23, 2023

Just a heads up for anyone interested - work on this has been started within #9462.

poikilotherm added a commit to poikilotherm/dataverse that referenced this issue Mar 24, 2023
For a first set of attributes (name, alias, description and, most
important for the PR about IQSS#4499, the dataset version PID conduct),
make an endpoint available that allows changes via simple PUT
HTTP commands.
poikilotherm added a commit to poikilotherm/dataverse that referenced this issue Apr 19, 2023
…QSS#4499

This commit adds the methods necessary to be implemented by any provider
that supports version PIDs. By default, the interface will ensure that
exceptions are thrown about unsupported actions, when the methods are not
implemented. Extensive JavaDoc describes details.
@pdurbin pdurbin added Type: Feature a feature request User Role: Depositor Creates datasets, uploads data, etc. labels Oct 7, 2023
@cmbz
Copy link

cmbz commented Sep 16, 2024

2024/09/16: Keeping, currently in Dataverse backlog.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature: DOI & Handle HERMES related to @hermes-hmc work on Dataverse code Size: Queued PM has called this issue out specifically for sizing Type: Feature a feature request User Role: Depositor Creates datasets, uploads data, etc.
Projects
Status: No status
Development

Successfully merging a pull request may close this issue.

9 participants