-
Notifications
You must be signed in to change notification settings - Fork 492
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
5050 Parse all dc identifier elements and allow identifiers that don'… #7214
5050 Parse all dc identifier elements and allow identifiers that don'… #7214
Conversation
also refactor S3StorageIO to re-use single client per store, use more static methods
Nominally useful if code ever changes the storageIdentifier of the dvObject after the S3AccessIO instance is created, but real code shouldn't do that :-)
and the dataset pid being set multiple times.
for directUpload case
used in DatasetPage and editFilesFragment.xhtml
@pdurbin @jggautier Ready for review! |
Works great from what I can tell! Spun up the branch and just saw that records from sets from Zenodo and Figshare were all harvested successfully. Harvested a set from DANS (EASY) and 7 of 86 records were harvested. The other 79 failed. The set was "D10000:D18000". Harvested ICPSR and about a third of the 10k records failed. Can't tell why, but I think another GitHub issue should be opened for this. |
…t have "doi" or "hdl" in them.
@jggautier Great to hear! In the case of EASY there's forthcoming issues with the controlled vocabulary. The EASY datasets have dc language values of "en", fr", "de", but those are not allowed in Dataverse which supports values like "English", "German". What is the reason for this restriction actually? |
@JingMa87 I think the reason for the restriction is that Dataverse hasn't implemented a method for adding to the metadata exports the 2-3 letter code of the language that the depositor chooses from the Language controlled vocabulary. Dataverse simply adds to the metadata export the value that the depositor sees in the metadata form: I think we agree that it would be preferable if Dataverse shows the depositor the full language, e.g. "English", but uses the corresponding ISO codes in the metadata exports, e.g. Then the other way around, on import, Dataverse will need to know that Does internalization need to be considered? E.g. when dataset metadata is imported into an installation with "Spanish" internalization, I know the community's discussed this before, but I can't find a GitHub issue about it. Could you create a GitHub issue (if you're not already planning to)? |
@jggautier Here's the issue: #7243. It shouldn't have an impact on this PR. |
@jggautier @JingMa87 - is this ready for Code Review? Apologies for the delay. |
I think it is! |
@djbrooke Yes it is! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have made one small change request. Looks good otherwise!
Sorry again for the delay with this PR; we ended up ignoring some PRs and issues because of all the v5.0 related tasks.
if (!otherIds.isEmpty()) { | ||
// We prefer doi or hdl identifiers like "doi:10.7910/DVN/1HE30F" | ||
for (String otherId : otherIds) { | ||
if (otherId.contains(GlobalId.DOI_PROTOCOL) || otherId.contains(GlobalId.HDL_PROTOCOL)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be if (otherId.startsWith(GlobalId.DOI_PROTOCOL) ...
instead?
Or maybe even if (otherId.toLowerCase().startsWith(GlobalId.DOI_PROTOCOL) ...
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@landreev The identifier could also be "https://doi.org/10.7910/DVN/1HE30F". I made the identifier into lowercase just in case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JingMa87
Looking at your last commits, it's still looking like your code is doing "if ... contains()"... I still think it should be "if ... startsWith()" instead. Or it will just assume that any identifier that happens to contain the characters "hdl" is a handle, no?
And yes, it's a good idea to check for the "https://doi.org/10.7910/DVN/1HE30F" form as well. Please note that we also have GlobalId.DOI_RESOLVER_URL and GlobalId.HDL_RESOLVER_URL defined. So maybe add .startsWith() for these too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I pushed the changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm really sorry to be difficult, I know you have other things to work on - but do we really want to permanently convert to lower case?
I only suggested it for the test... to be able to catch both "hdl:..." and "HDL:..." - but I'm not sure even that is necessary...
I don't think we want to save "doi:10.7910/DVN/1HE30F" as "doi:10.7910/dvn/1he30f" - ?
Let's just remove the otherId = otherId.toLowerCase();
line.
Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries! I made the changes.
Just noticed the ICPSR part. ICPSR would be one archive from which we absolutely want to harvest DDI (and not DC). So we may not necessarily care why their DC records are failing to import (?). |
…t have "doi" or "hdl" in them.
What this PR does / why we need it: Allows dataverse to harvest more datasets
Which issue(s) this PR closes: 5050
Closes #5050
Special notes for your reviewer: Small change, big impact!
Suggestions on how to test this: Harvest the server https://api.figshare.com/v2/oai with prefix oai_dc and set portal_895. Without the fix they would all fail but now they're all harvested.
Does this PR introduce a user interface change? If mockups are available, please link/include them here: No
Is there a release notes update needed for this change?: Yes
Additional documentation: