-
Notifications
You must be signed in to change notification settings - Fork 493
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DOIs for Dataset versions #4499
Comments
It would be good to prioritize this issue as DOI versioning at dataset level 1) is recommended to support reproducibility; 2) is implemented by other repository applications (Zenodo); and 3) is a feature desired by the Dataverse community. See also a current discussion in the Dataverse Google Group: https://groups.google.com/u/1/g/dataverse-community/c/34E9foKnxQs. |
Nice find of this long standing issue! This is also related to my plans on adding research software depositing support. |
Adding content from #7867:
|
While writing the concept paper for HERMES https://arxiv.org/abs/2201.09015, this issue moved inside my head... It is crucial for Dataverse software to support this to be compliant with FORCE11 Software Citation guidelines. It should be an optional feature, configurable like File PIDs. I pretty much dislike how Zenodo did this: from a version DOI it's cumbersome to get to the concept DOI. How about using the same trick as we did for File PIDs and add a Sth. like For software, it would be beneficial to use the software version instead of the dataset version. We could use the software version from the datasets metadata and either use it exclusively or register both as DOI. When also enabling File PIDs, how does this currently work? When a file has its own DOI, does it change when the file is updated? |
@poikilotherm good question. As far as I can tell, this isn't well explained in the guides. I also looked at pull request #4350 where file PIDs were introduced. For simplicity, let's just talk about DOIs. DOIs are stored on the Now, if you are only changing the metadata of a file (description, tags, etc.) a new DOI is not created for the file. So you can safely fix typos in file descriptions, for example, without worrying about the DOI of the file changing. I hope this helps! |
So there is no need to worry about files for version PIDs, right? The version / DvObject knows which files belong to it and in which version, so nothing will break from this, aye? WDYT about the idea of a version suffix? |
I'm not worried about anything breaking. What we have now is version numbers like 3.0 and 4.1. If you supply either of these numbers to Dataverse, it will give back to you the correct files for the version. You can look at the With DOIs for versions, it just means 3.0 and 4.1 (examples from above) will still work as before but you'd also be able to supply an alias, if you will, a DOI, for each version, to download files (or whatever operation). I'm not opposed to the version suffix idea but my understanding is that you're generally not suppose to encode anything meaningful into DOIs. They're supposed to just be meaningless strings. |
there are different schools of thought on this. With file dois we did it this way doi:xxx/1 doi:xxx/2 etc to have two files within the same datasets. For version we had discussed using "." as the separator, e.g. doi.xxx.1 and doi.xxx.2 While this is not meaningless, it is not encoding anything about the content just the connection bwtween the two. |
As I wrote before - Zenodo uses a DOI for the "concept" or whole dataset and completely unrelated ones for "versions". And it's really hard to tell them apart, as they are of the same length etc. For the software use case (and presumably others), IMHO it would be nice to make it more obvious we are talking about versions of a dataset. If that means skip the software version as it's real metadata @pdurbin , that's fine for me as long as we have an identifier for the dataset version. Currently there is no way to jump from an identifier right into a specific version of a dataset, as resolvers remove any query parts etc. @scolapasta using the dots is also a great idea! Maybe it just should be a configurable thing to enable broader use cases. WDYT would be the best approach to get a discussion started? As this will definitely have an impact on the UI, there are more people involved. And what about community consensus? |
The Figshare approach may be nicer than Zenodo. Figshare: |
This is precisely what I was intending to do (see above, the delimiter character maybe should be configurable), as I don't like the Zenodo approach, either. |
However, contrary to the figshare example, I'd suggest we avoid semantic loading of any parts of the DOI, except for the indication of versions belonging to the same dataset/file, thus using opaque strings and not including branding like "figshare", or in our case "dataverse" or whatever. See the recommendations in section 2.2 Syntax of a DOI name in the DOI Handbook. |
As we did for file PIDs, I'd suggest we enable both ways to configure - either going in order numerical or random string. (after a "." to identify it's a version: or: https://doi.org/10.11111/XYZABC.DEFGHI (I'm more wary of making the . configurable, as I'd prefer the consistency of meaning there, plus not allowing for /, which already means file) |
@philippconzett I don't think the proposal by @Danny-dK was about incorporating something like this. Plus Dataverse is already configurable to do this if desired via the shoulder setting. |
I just found https://doi.org/10.1371/journal.pbio.2001414 and think their statement on including versioning in the PID is valuable for this discussion (see "Lesson 6. Implement a version-management policy"):
Which would mean: we should create version PIDs only for "major" changes, not "minor". Is our change history accessible via API? I also like their approach with the "version after dot". It might still be worth the "trouble" to make the separation character configurable, though. |
backlog prio meeting:
|
sizing:
|
This commit adds a new scope and setting to the JvmSettings, enabling the configuration of different modes for Dataset Version PIDs. These modes are depicted in VersionPidMode. A test ensures the parsability. In addition, VersionPidMode also contains a fine grained option to change the conduct of Dataverse collections and their datasets for these PIDs.
…d conduct IQSS#4499 This commit adds a public method DataverseServiceBean.wantsDatasetVersionPids() that will determine how to deal with a dataset version (which belongs to a dataset that lives within a collection) in terms of "should a PID be registered/updated?". The background is: when a dataset is published, there will be the context of the owning Dataverse collection. It's important to take into account the configured conduct for the collection in the decision how to go ahead with a version's PID.
Just a heads up for anyone interested - work on this has been started within #9462. |
For a first set of attributes (name, alias, description and, most important for the PR about IQSS#4499, the dataset version PID conduct), make an endpoint available that allows changes via simple PUT HTTP commands.
…QSS#4499 This commit adds the methods necessary to be implemented by any provider that supports version PIDs. By default, the interface will ensure that exceptions are thrown about unsupported actions, when the methods are not implemented. Extensive JavaDoc describes details.
2024/09/16: Keeping, currently in Dataverse backlog. |
Following up on the thread on the google group:
https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!topic/dataverse-community/34E9foKnxQs
Userstory
As a researcher, I want to be able to cite, using a permanent identifier, a specific version of dataset to avoid any ambiguity and to make the citation machine-actionable.
How this should probably look
Zenodo would be a good template here.
Example
https://doi.org/10.5281/zenodo.1041767 is the generic dataset DOI, always pointing to the most recent version, with 10.5281/zenodo.1188752 being the DOI for the 2nd (current) version and 10.5281/zenodo.1041768 the DOI for the first version
Relationship to other features
File DOIs:
File DOIs are great (and by themselves necessary), but they are not a replacement for dataset version DOIs. Datasets are often made up of multiple files. An analysis script (which itself may be part of the data and thus versioned) may point to multiple files in a dataset. It's not feasible (or desirable) for a researcher to include the DOI for every file used in a citation. They want to point to one single DOI to reference the exact data (made up of multiple files) they've been using. This is equally true for quantiative and qualitative data, btw.
File DOIs of files that are not changed between versions should remain stable (otherwise there's a potential to massively inflate the number of DOIs issued for no good reason)
UNFs
UNFs (at least for tabular data) help ensure that a cited dataset is the one being used, so they avoid using incorrect versions, but they do not function as identifiers, i.e. UNF:6:edx+kB6SY2N3Zt9OsUbp4A== tells me nothing about where to find that dataset.
The text was updated successfully, but these errors were encountered: