Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

5050 Parse all dc identifier elements and allow identifiers that don'… #7214

Merged
merged 79 commits into from
Sep 30, 2020
Merged

5050 Parse all dc identifier elements and allow identifiers that don'… #7214

merged 79 commits into from
Sep 30, 2020

Conversation

JingMa87
Copy link
Contributor

…t have "doi" or "hdl" in them.

What this PR does / why we need it: Allows dataverse to harvest more datasets

Which issue(s) this PR closes: 5050

Closes #5050

Special notes for your reviewer: Small change, big impact!

Suggestions on how to test this: Harvest the server https://api.figshare.com/v2/oai with prefix oai_dc and set portal_895. Without the fix they would all fail but now they're all harvested.

Does this PR introduce a user interface change? If mockups are available, please link/include them here: No

Is there a release notes update needed for this change?: Yes

Additional documentation:

qqmyers added 30 commits April 11, 2020 15:02
also refactor S3StorageIO to re-use single client per store, use more
static methods
Nominally useful if code ever changes the storageIdentifier of the
dvObject after the S3AccessIO instance is created, but real code
shouldn't do that :-)
and the dataset pid being set multiple times.
used in DatasetPage and editFilesFragment.xhtml
@JingMa87
Copy link
Contributor Author

@pdurbin @jggautier Ready for review!

@jggautier
Copy link
Contributor

jggautier commented Aug 30, 2020

Works great from what I can tell! Spun up the branch and just saw that records from sets from Zenodo and Figshare were all harvested successfully.

Harvested a set from DANS (EASY) and 7 of 86 records were harvested. The other 79 failed. The set was "D10000:D18000".

Harvested ICPSR and about a third of the 10k records failed. Can't tell why, but I think another GitHub issue should be opened for this.

@JingMa87
Copy link
Contributor Author

@jggautier Great to hear! In the case of EASY there's forthcoming issues with the controlled vocabulary. The EASY datasets have dc language values of "en", fr", "de", but those are not allowed in Dataverse which supports values like "English", "German". What is the reason for this restriction actually?

@jggautier
Copy link
Contributor

@JingMa87 I think the reason for the restriction is that Dataverse hasn't implemented a method for adding to the metadata exports the 2-3 letter code of the language that the depositor chooses from the Language controlled vocabulary. Dataverse simply adds to the metadata export the value that the depositor sees in the metadata form:

Screen Shot 2020-08-31 at 11 43 11 AM

I think we agree that it would be preferable if Dataverse shows the depositor the full language, e.g. "English", but uses the corresponding ISO codes in the metadata exports, e.g. <dc:language>en</dc:language> for the OAI-PMH feed and <dcterms:language>en</dcterms:language> for the DC Terms metadata available through the metadata export dropdown and the API.

Then the other way around, on import, Dataverse will need to know that <dc:language>en</dc:language> means that the Language value it should display on the dataset page UI is "English" (for repositories in English)

Does internalization need to be considered? E.g. when dataset metadata is imported into an installation with "Spanish" internalization, <dcterms:language>en</dcterms:language> should be displayed on the dataset page as the localized value "Inglés"?

I know the community's discussed this before, but I can't find a GitHub issue about it. Could you create a GitHub issue (if you're not already planning to)?

@JingMa87
Copy link
Contributor Author

@jggautier Here's the issue: #7243. It shouldn't have an impact on this PR.

@djbrooke
Copy link
Contributor

@jggautier @JingMa87 - is this ready for Code Review? Apologies for the delay.

@jggautier
Copy link
Contributor

I think it is!

@JingMa87
Copy link
Contributor Author

@djbrooke Yes it is!

Copy link
Contributor

@landreev landreev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made one small change request. Looks good otherwise!
Sorry again for the delay with this PR; we ended up ignoring some PRs and issues because of all the v5.0 related tasks.

if (!otherIds.isEmpty()) {
// We prefer doi or hdl identifiers like "doi:10.7910/DVN/1HE30F"
for (String otherId : otherIds) {
if (otherId.contains(GlobalId.DOI_PROTOCOL) || otherId.contains(GlobalId.HDL_PROTOCOL)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be if (otherId.startsWith(GlobalId.DOI_PROTOCOL) ... instead?
Or maybe even if (otherId.toLowerCase().startsWith(GlobalId.DOI_PROTOCOL) ...?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@landreev The identifier could also be "https://doi.org/10.7910/DVN/1HE30F". I made the identifier into lowercase just in case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JingMa87
Looking at your last commits, it's still looking like your code is doing "if ... contains()"... I still think it should be "if ... startsWith()" instead. Or it will just assume that any identifier that happens to contain the characters "hdl" is a handle, no?
And yes, it's a good idea to check for the "https://doi.org/10.7910/DVN/1HE30F" form as well. Please note that we also have GlobalId.DOI_RESOLVER_URL and GlobalId.HDL_RESOLVER_URL defined. So maybe add .startsWith() for these too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I pushed the changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm really sorry to be difficult, I know you have other things to work on - but do we really want to permanently convert to lower case?
I only suggested it for the test... to be able to catch both "hdl:..." and "HDL:..." - but I'm not sure even that is necessary...
I don't think we want to save "doi:10.7910/DVN/1HE30F" as "doi:10.7910/dvn/1he30f" - ?
Let's just remove the otherId = otherId.toLowerCase(); line.
Thank you!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries! I made the changes.

@landreev landreev self-assigned this Sep 28, 2020
@landreev
Copy link
Contributor

@jggautier

... Spun up the branch
...
Harvested ICPSR and about a third of the 10k records failed. Can't tell why, but I think another GitHub issue should be opened for this.

Just noticed the ICPSR part. ICPSR would be one archive from which we absolutely want to harvest DDI (and not DC). So we may not necessarily care why their DC records are failing to import (?).

@kcondon kcondon self-assigned this Sep 30, 2020
@kcondon kcondon merged commit 4a56ab0 into IQSS:develop Sep 30, 2020
@JingMa87 JingMa87 deleted the 5050-broaden-allowed-dc-identifiers branch September 30, 2020 16:25
@djbrooke djbrooke added this to the 5.1 milestone Sep 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Harvesting zenodo client fail
9 participants