Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resend metadata to PID providers when metadata schema used to register PIDs is modified #5144

Open
jggautier opened this issue Oct 5, 2018 · 27 comments
Labels
Feature: Metadata GREI Year 3 Year 3 GREI task GREI 2 Consistent Metadata Size: 30 A percentage of a sprint. 21 hours. (formerly size:33) Type: Suggestion an idea

Comments

@jggautier
Copy link
Contributor

jggautier commented Oct 5, 2018

During discussion of Github issue #5060, the team agreed to make a separate issue for resending metadata -- which Dataverse had already sent when registering persistent IDs for published datasets and files -- whenever Dataverse changes the metadata schema it uses to register those PIDs.

Currently (as of Dataverse 4.9.4), Dataverse should be sending new metadata to PID providers when:

  • a new dataset is published
  • a new version of an existing dataset is published

It's not sending new metadata of already-published datasets unless new versions of those datasets are published. For example, when Dataverse adds related publication information (i.e. the relationship between a dataset and articles) to its DataCite metadata, DataCite will get this new metadata only for newly published datasets and newly published versions of already published datasets. But DataCite won't know about the related publications of already published datasets for which new dataset versions will never be published.

In this example, the metadata that DataCite has for all Dataverse datasets will need to be updated, even for already published datasets that won't be getting a new version.

@qqmyers
Copy link
Member

qqmyers commented Jan 24, 2019

FWIW, the /modifyRegistrationPIDMetadataAll and {id}/modifyRegistrationMetadata api calls provide a way to do this that could be included in release instructions. As of now, they send an update whether or not it is needed, so DataCite sees a new update date. For QDR, I've modified these calls to check the existing metadata and targetUrl and only submit updates if there is a difference. Would that update be a useful contribution?

@pdurbin
Copy link
Member

pdurbin commented Jan 25, 2019

@qqmyers to me it sounds like a useful contribution. If it's not too much effort for you to create a pull request, please go ahead.

@qqmyers
Copy link
Member

qqmyers commented Jan 29, 2019

@pdurbin - looking into it. I've also realized that these API calls are pre-file-PID and don't handle data file updates...

@qqmyers
Copy link
Member

qqmyers commented Jan 31, 2019

@pdurbin - false alarm w.r.t. file PIDs - the APIs don't check files, but the commands they call do recurse through all files. So I created a separate issue #5505 and will submit a PR. I referenced this issue there.

@jggautier
Copy link
Contributor Author

jggautier commented Feb 14, 2022

A possible consequence of this issue came up last week when a depositor reported that an Elsevier product named DataMonitor, which harvests Dataverse repository metadata from DataCite, is sometimes unable to determine which files are part of which datasets because some of the metadata that DataCite has about datasets and files in Dataverse repositories doesn't include relationTypes.

DataMonitor somehow uses those relationTypes in the DataCite metadata to allow its users to filter files and datasets when searching for data. (This reminded me of the GitHub issue at #5086.)

In the dataset record that DataCite has at https://search.datacite.org/works/10.7910/dvn/ayxqij and in records DataCite has for that dataset's files (e.g. https://search.datacite.org/works/10.7910/dvn/ayxqij/6pw7rz), the DataCite XML available on those pages include relationTypes that indicate which files are part of the dataset. I think that's because that dataset was published on the Harvard Dataverse Repository after the repository started using a Dataverse software update that adds relationTypes to the metadata it sends to DataCite when registering DOIs for datasets and files.

In DataCite's records for the dataset at https://search.datacite.org/works/10.7910/dvn/ai2oxs and for its 105 files (e.g. https://search.datacite.org/works/10.7910/dvn/ai2oxs/pkvu06), the DataCite XML available on those pages don't include those relationTypes. I think that's because that dataset was published before the Harvard Dataverse Repository was updated to add relationTypes to the metadata it sends.

It looks like #5505 would also need to be resolved if we're going to use APIs to send updated metadata to DataCite, which would include relationTypes for datasets and files that have DOIs.

@pdurbin
Copy link
Member

pdurbin commented Feb 14, 2022

It looks like #5505 is about only sending metadata updates to DataCite when there's something new but the ability to re-send updates at all was added in pull request #5179 and documented at https://guides.dataverse.org/en/5.9/admin/dataverses-datasets.html#send-dataset-metadata-to-pid-provider

So if we want to update a single record like doi:10.7910/DVN/AYXQIJ we could do that now.

@jggautier
Copy link
Contributor Author

Cool! So if I gathered a list of dataset and file DOIs in a repository, like Harvard's, for which DataCite needed updated metadata, I could use that API endpoint on each DOI? Maybe I could figure out which datasets and files with DOIs in the Harvard repo were published or updated before relationTypes were added to the metadata that's sent to DataCite, then write a script to send the new metadata for those datasets and files to DataCite.

It sounds like #5505 would do the work of figuring out which dataset and file metadata needs to be updated in DataCite's database and then send the updated metadata. Is that right, too?

@pdurbin
Copy link
Member

pdurbin commented Feb 15, 2022

Cool! So if I gathered a list of dataset and file DOIs in a repository, like Harvard's, for which DataCite needed updated metadata, I could use that API endpoint on each DOI?

Yep, should work.

It sounds like #5505 would do the work of figuring out which dataset and file metadata needs to be updated in DataCite's database and then send the updated metadata. Is that right, too?

Right, that's the idea.

@sbarbosadataverse
Copy link

Ceilyn and Sonia prioritized and moved to sprint ready @jggautier @scolapasta

@cmbz
Copy link

cmbz commented Apr 29, 2024

@pdurbin Final item in this issue is to make certain that the dev guide is updated to indicate that when the metadata exporter is changed, the release notes should let those updating their Dataverse software know that they need to apply those changes to the exports of the datasets that were published before the exporter was changed. Then, this issue can be closed.

@pdurbin
Copy link
Member

pdurbin commented Apr 29, 2024

Sounds fine. It doesn't really fit into https://guides.dataverse.org/en/6.2/developers/making-releases.html#write-release-notes as written but I'm sure we'll figure something out.

@cmbz cmbz added the GREI Year 3 Year 3 GREI task label May 20, 2024
@jggautier
Copy link
Contributor Author

jggautier commented Jul 16, 2024

Kelly Stathis from DataCite let us know this week that the metadata that DataCite has for about 77,000 DOIs in Harvard Dataverse are in the Schema 3 version of their metadata standard. The first page of results in this DataCite API call shows some of these DOIs, and we can paginate through the results to see them all. Although at the end of that page, I see counts of 74,298, so maybe 77k was an older count?

And when DataCite deprecates Schema 3 on January 1, 2025, Harvard Dataverse won't be able to send to DataCite any updates of the metadata of 74k+ DOIs for which they still have Schema 3 metadata. I've seen only dataset DOIs but I'm assuming some of those DOIs point to files within datasets.

GitHub issues like #7551 make me think that on January 1, 2025, Harvard Dataverse will prevent the owners of those 74k+ DOIs from creating or publishing new versions, unless Harvard Dataverse sends DataCite the metadata using Schema 4.

The dataset at https://doi.org/10.7910/DVN/BRCBFA was among the 74k+ DOIs in that API call, so apparently the metadata that DataCite had for it was in the Schema 3 version (and not in the Schema 4 version that I see when I looked at that dataset's "DataCite" export).

I was able to use the "Send Dataset metadata to PID provider Dataverse" API endpoint to resend that dataset's metadata, and that DOI was removed from the results of that API call. https://api.datacite.org/dois?query=doi:10.7910/DVN/BRCBFA also shows schemaversion: "http://datacite.org/schema/kernel-4"

Screenshot 2024-07-16 at 3 31 58 PM

Resending metadata so that depositors are able to update their data seems more pressing than the other reasons we've talked about in this and other GitHub issues.

@landreev, @pdurbin, @qqmyers and anyone else who knows more about this general issue of updating the metadata that Dataverse installations export and about the recent development work to address them:

@qqmyers
Copy link
Member

qqmyers commented Jul 16, 2024

No one should get stuck. Any edit/publish of a new version would send the latest DataCite version. To update past ones, the /modifyRegistrationMetadata should work, and would be better if you run it on all DOIs since it will only update if the new XML is different from the XML at DataCite, but I think /modifyRegistration would be lighter weight if you can just call it for the ones you know are bad, and the aren't Drafts (which it skips). They are basically the same under the hood except for those differences as far as I recall. (There are ...All variants of these API calls, but I assume it would be bad to do all of them in one go.)

@landreev
Copy link
Contributor

Thanks, Jim. So, it sounds like the plan should be to deploy 6.3 in our prod. (this should happen within a few days), and then run /modifyRegistrationMetadata on the 74K+ affected datasets. Julian, I can run that batch, if you prefer.

@jggautier @qqmyers

@jggautier
Copy link
Contributor Author

jggautier commented Jul 17, 2024

Thanks @qqmyers and @landreev

@landreev, yeah it would probably be easier for you to do it.

Although, since depositors won't get stuck when they try to update their datasets after Jan 1, 2025, I'm less sure about how urgent this is. Does it even need to be done for these 74k+ DOIs?

@cmbz
Copy link

cmbz commented Jul 17, 2024

This issue has been Sprint Ready since April. Any reason it can't get picked up for our upcoming sprint? @landreev @jggautier

@landreev
Copy link
Contributor

@cmbz It has a dependency on the prod. upgrade to 6.3.
Would you be ok with handling the upgrade in the upcoming spring (opening a local issue for it), and this Datacite update in the next?

@qqmyers
Copy link
Member

qqmyers commented Jul 17, 2024

@jggautier FWIW - I thought DataCite was more worried about new v3 registrations being sent. That was happening at some sites because they used their DataCite account with non-Dataverse software. If the Harvard account(s) are used outside of Dataverse, making updates there might be a higher priority. That said, in addition to upping the version, I think we are adding more metadata, license info, etc. that wasn't in the originals, so updating older datasets would improve their findability. (You'd definitely want to do that after the proposed DataCite/OpenAire changes that are hopefully going into 6.4 - maybe that's a reason to delay updating right now?)

@landreev
Copy link
Contributor

... I'm less sure about how urgent this is. Does it even need to be done for these 74k+ DOIs?

No, it doesn't seem urgent. But also seems like something we should probably do anyway, as a matter of good housekeeping.

@cmbz
Copy link

cmbz commented Jul 17, 2024

Sounds good @landreev! I think we should just get this done as soon as we can.

@jggautier
Copy link
Contributor Author

jggautier commented Jul 17, 2024

@cmbz you also wrote in April that "to make certain that the dev guide is updated to indicate that when the metadata exporter is changed, the release notes should let those updating their Dataverse software know that they need to apply those changes to the exports of the datasets that were published before the exporter was changed. Then, this issue can be closed."

And @pdurbin replied that that "sounds fine. It doesn't really fit into https://guides.dataverse.org/en/6.2/developers/making-releases.html#write-release-notes as written but I'm sure we'll figure something out."

I think we could close this issue after that's figured out, right? How do we say in the dev guides that when a release includes changes to metadata exports, the release notes should encourage folks to update the metadata exports of datasets that were already published in their repositories?

To be honest I mentioned this "Schema 3" issue in this GitHub issue only because it seemed like another example of the need to make sure that when the metadata schema used to register PIDs is modified, repositories resend that metadata to PID providers. But should I or someone else create a new GitHub issue about this in Harvard Dataverse repo? Then we can record details of that work there, and we don't lose track of the broader goal of this GitHub issue.

@pdurbin
Copy link
Member

pdurbin commented Jul 18, 2024

How do we say in the dev guides that when a release includes changes to metadata exports, the release notes should encourage folks to update the metadata exports of datasets that were already published in their repositories?

Well, we have an "etc" at https://guides.dataverse.org/en/6.3/developers/version-control.html#writing-release-note-snippets to stand in for any upgrade task that should be mentioned in release note snippets. Perhaps we could add an explicit bullet for "re-export all".

@jggautier
Copy link
Contributor Author

jggautier commented Jul 18, 2024

I'm also curious about how often instructions related to "re-export-all" have been included in previous release notes where a change was made to the metadata sent to PID providers. I don't remember us talking in meetings about how often this has or hasn't happened, we haven't written about it in this GitHub issue, but I think it'll be useful to know.

@pdurbin or others, do you have a sense off-hand about how often instructions related to "re-export-all" have been included in previous release notes?

Otherwise I could take a look.

If notes for previous releases often or always included instructions, then this issue might not be resolved only by making sure that the release notes include these instructions when relevant, right? Maybe the release notes haven't been clear? Or individual steps get overlooked when different repositories upgrade through multiple Dataverse versions?

@pdurbin
Copy link
Member

pdurbin commented Jul 19, 2024

@jggautier looks like reexport all was mentioned in 12 recent releases:

% grep -i -l reexport doc/release-notes/*        
doc/release-notes/4.16-release-notes.md
doc/release-notes/4.19-release-notes.md
doc/release-notes/4.20-release-notes.md
doc/release-notes/5.0-release-notes.md
doc/release-notes/5.1-release-notes.md
doc/release-notes/5.10-release-notes.md
doc/release-notes/5.12-release-notes.md
doc/release-notes/5.14-release-notes.md
doc/release-notes/5.6-release-notes.md
doc/release-notes/5.9-release-notes.md
doc/release-notes/6.1-release-notes.md
doc/release-notes/6.3-release-notes.md

@jggautier
Copy link
Contributor Author

jggautier commented Jul 19, 2024

This is awesome! Thanks @pdurbin!

Seeing a list like this makes me think that re-export instructions are already almost always included in release notes 🥳. But maybe that's wrong and there have been releases that include changes to the metadata schema used to send metadata to PID providers and whose release notes don't include re-export instructions.

To feel more confident that changes to the dev guide will result in updated metadata more often being sent to PID providers when the metadata schema is changed, maybe we can:

  • Find Dataverse releases where the metadata schema used to send metadata to PID providers was changed, focusing on DataCite metadata schema changes
  • See if those release's release notes include re-export instructions
  • Look for and share cases where Dataverse installations have not followed those re-export instructions. Over the years I've mostly looked at DataCite metadata sent by Harvard Dataverse, like what I wrote in early 2022 and what I wrote earlier this week. But that's just one data point, so to speak. If folks from other installations updated their installation to a release whose release notes included re-export instructions, but we can tell from the metadata they've sent to DataCite that those instructions weren't followed, we could ask them why.

@cmbz
Copy link

cmbz commented Aug 20, 2024

To focus on the most important features and bugs, we are closing issues created before 2020 (version 5.0) that are not new feature requests with the label 'Type: Feature'.

If you created this issue and you feel the team should revisit this decision, please reopen the issue and leave a comment.

@cmbz cmbz closed this as completed Aug 20, 2024
@cmbz
Copy link

cmbz commented Aug 23, 2024

2024/08/23: Reopened. Connected to GREI and already sized and prioritized.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature: Metadata GREI Year 3 Year 3 GREI task GREI 2 Consistent Metadata Size: 30 A percentage of a sprint. 21 hours. (formerly size:33) Type: Suggestion an idea
Projects
Status: SPRINT READY
Status: Done
Development

No branches or pull requests

6 participants