5050 Parse all dc identifier elements and allow identifiers that don'… #7214

JingMa87 · 2020-08-21T12:20:16Z

…t have "doi" or "hdl" in them.

What this PR does / why we need it: Allows dataverse to harvest more datasets

Which issue(s) this PR closes: 5050

Special notes for your reviewer: Small change, big impact!

Suggestions on how to test this: Harvest the server https://api.figshare.com/v2/oai with prefix oai_dc and set portal_895. Without the fix they would all fail but now they're all harvested.

Does this PR introduce a user interface change? If mockups are available, please link/include them here: No

Is there a release notes update needed for this change?: Yes

Additional documentation:

also refactor S3StorageIO to re-use single client per store, use more static methods

Nominally useful if code ever changes the storageIdentifier of the dvObject after the S3AccessIO instance is created, but real code shouldn't do that :-)

and the dataset pid being set multiple times.

for directUpload case

used in DatasetPage and editFilesFragment.xhtml

JingMa87 · 2020-08-26T21:19:00Z

@pdurbin @jggautier Ready for review!

jggautier · 2020-08-30T01:34:35Z

Works great from what I can tell! Spun up the branch and just saw that records from sets from Zenodo and Figshare were all harvested successfully.

Harvested a set from DANS (EASY) and 7 of 86 records were harvested. The other 79 failed. The set was "D10000:D18000".

Harvested ICPSR and about a third of the 10k records failed. Can't tell why, but I think another GitHub issue should be opened for this.

…t have "doi" or "hdl" in them.

JingMa87 · 2020-08-30T14:36:26Z

@jggautier Great to hear! In the case of EASY there's forthcoming issues with the controlled vocabulary. The EASY datasets have dc language values of "en", fr", "de", but those are not allowed in Dataverse which supports values like "English", "German". What is the reason for this restriction actually?

jggautier · 2020-08-31T16:02:19Z

@JingMa87 I think the reason for the restriction is that Dataverse hasn't implemented a method for adding to the metadata exports the 2-3 letter code of the language that the depositor chooses from the Language controlled vocabulary. Dataverse simply adds to the metadata export the value that the depositor sees in the metadata form:

I think we agree that it would be preferable if Dataverse shows the depositor the full language, e.g. "English", but uses the corresponding ISO codes in the metadata exports, e.g. <dc:language>en</dc:language> for the OAI-PMH feed and <dcterms:language>en</dcterms:language> for the DC Terms metadata available through the metadata export dropdown and the API.

Then the other way around, on import, Dataverse will need to know that <dc:language>en</dc:language> means that the Language value it should display on the dataset page UI is "English" (for repositories in English)

Does internalization need to be considered? E.g. when dataset metadata is imported into an installation with "Spanish" internalization, <dcterms:language>en</dcterms:language> should be displayed on the dataset page as the localized value "Inglés"?

I know the community's discussed this before, but I can't find a GitHub issue about it. Could you create a GitHub issue (if you're not already planning to)?

JingMa87 · 2020-08-31T18:01:00Z

@jggautier Here's the issue: #7243. It shouldn't have an impact on this PR.

djbrooke · 2020-09-17T15:21:03Z

@jggautier @JingMa87 - is this ready for Code Review? Apologies for the delay.

jggautier · 2020-09-17T15:25:29Z

I think it is!

JingMa87 · 2020-09-17T16:01:40Z

@djbrooke Yes it is!

landreev

I have made one small change request. Looks good otherwise!
Sorry again for the delay with this PR; we ended up ignoring some PRs and issues because of all the v5.0 related tasks.

landreev · 2020-09-28T20:07:34Z

src/main/java/edu/harvard/iq/dataverse/api/imports/ImportGenericServiceBean.java

+        if (!otherIds.isEmpty()) {
+            // We prefer doi or hdl identifiers like "doi:10.7910/DVN/1HE30F"
+            for (String otherId : otherIds) {
+                if (otherId.contains(GlobalId.DOI_PROTOCOL) || otherId.contains(GlobalId.HDL_PROTOCOL)) {


Should this be if (otherId.startsWith(GlobalId.DOI_PROTOCOL) ... instead?
Or maybe even if (otherId.toLowerCase().startsWith(GlobalId.DOI_PROTOCOL) ...?

@landreev The identifier could also be "https://doi.org/10.7910/DVN/1HE30F". I made the identifier into lowercase just in case.

@JingMa87
Looking at your last commits, it's still looking like your code is doing "if ... contains()"... I still think it should be "if ... startsWith()" instead. Or it will just assume that any identifier that happens to contain the characters "hdl" is a handle, no?
And yes, it's a good idea to check for the "https://doi.org/10.7910/DVN/1HE30F" form as well. Please note that we also have GlobalId.DOI_RESOLVER_URL and GlobalId.HDL_RESOLVER_URL defined. So maybe add .startsWith() for these too?

Agreed, I pushed the changes.

I'm really sorry to be difficult, I know you have other things to work on - but do we really want to permanently convert to lower case?
I only suggested it for the test... to be able to catch both "hdl:..." and "HDL:..." - but I'm not sure even that is necessary...
I don't think we want to save "doi:10.7910/DVN/1HE30F" as "doi:10.7910/dvn/1he30f" - ?
Let's just remove the otherId = otherId.toLowerCase(); line.
Thank you!

No worries! I made the changes.

landreev · 2020-09-28T20:33:02Z

@jggautier

... Spun up the branch
...
Harvested ICPSR and about a third of the 10k records failed. Can't tell why, but I think another GitHub issue should be opened for this.

Just noticed the ICPSR part. ICPSR would be one archive from which we absolutely want to harvest DDI (and not DC). So we may not necessarily care why their DC records are failing to import (?).

qqmyers added 30 commits April 11, 2020 15:02

cleanup - add explicit type

a018055

add multipart upload api calls and add to S3StorageIO class

d314431

also refactor S3StorageIO to re-use single client per store, use more static methods

Merge remote-tracking branch 'IQSS/develop' into IQSS/6763

dbbd569

update error msgs

61448a8

try test update

cd7e345

Restore driverId check in getMainKey to pass tests

8b88b39

Nominally useful if code ever changes the storageIdentifier of the dvObject after the S3AccessIO instance is created, but real code shouldn't do that :-)

remove unused imports

7b91c07

Merge remote-tracking branch 'IQSS/develop' into IQSS/6763

6b03884

report the exception message when can't access bucket

e707519

typo on comparison

dab4af6

IQSS/6829 - account for file failures

d322809

remove ~duplicate code

6ca407c

IQSS-6829 - avoid race with 2+ files uploading

b0557aa

and the dataset pid being set multiple times.

IQSS/6829 - create mode uses DatasetPage - initialize identifier there

beb419e

for directUpload case

move directUploadEnabled to systemconfig

8400c20

used in DatasetPage and editFilesFragment.xhtml

typo/fix method calls

6f9baed

set minimum part size

f841d59

add convenience urls

f844914

add dv_status=temp tag in multipart uploads

db00c6f

return no content for delete call

f473fab

add request for upload in parts method

088efb0

handle net:ERR_NETWORK_CHANGED errors w/o 'undefined' error

be27169

remove draft code from other branch

ef6ef13

handle dataverse change - keep GlobalId in direct upload case

a666a7b

IQSS/6881 delete temp files when cancelling dataset create

4bb1b66

IQSS 6881 cleanup temp files on cancel when creating dataset

45245b3

get/set for uploadInProgress

e2d7207

actually initialize uploadInProgress

56bd6a7

limit processing on cancelCreate

40f2cd1

catch all files in direct upload cancel + debug statements

dc3f130

Use handle library to resolve handle.

89cb145

JingMa87 added 2 commits August 30, 2020 16:15

5050 Parse all dc identifier elements and allow identifiers that don'…

c84ca49

…t have "doi" or "hdl" in them.

Use handle library to resolve handle.

ab1375b

landreev requested changes Sep 28, 2020

View reviewed changes

landreev self-assigned this Sep 28, 2020

JingMa87 added 5 commits September 28, 2020 22:37

Make identifier lower case.

06554b0

Make identifier lower case.

6d5fd09

Remove unused variable.

3d0524b

Change to startsWith() and check for doi and hdl URLs.

20a1a08

Remove toLowerCase().

23a0252

landreev approved these changes Sep 29, 2020

View reviewed changes

djbrooke unassigned landreev Sep 30, 2020

kcondon self-assigned this Sep 30, 2020

kcondon merged commit 4a56ab0 into IQSS:develop Sep 30, 2020

JingMa87 deleted the 5050-broaden-allowed-dc-identifiers branch September 30, 2020 16:25

djbrooke added this to the 5.1 milestone Sep 30, 2020

jggautier mentioned this pull request Dec 16, 2020

Re-harvesting ICPSR datasets IQSS/dataverse.harvard.edu#63

Open

pdurbin added the Feature: Harvesting label Apr 13, 2022

pdurbin mentioned this pull request Apr 13, 2022

Spike: Inventory and prioritize all existing Harvesting related issues IQSS/dataverse-pm#24

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

5050 Parse all dc identifier elements and allow identifiers that don'… #7214

5050 Parse all dc identifier elements and allow identifiers that don'… #7214

JingMa87 commented Aug 21, 2020

JingMa87 commented Aug 26, 2020

jggautier commented Aug 30, 2020 •

edited

Loading

JingMa87 commented Aug 30, 2020

jggautier commented Aug 31, 2020

JingMa87 commented Aug 31, 2020

djbrooke commented Sep 17, 2020

jggautier commented Sep 17, 2020

JingMa87 commented Sep 17, 2020

landreev left a comment

landreev Sep 28, 2020

JingMa87 Sep 28, 2020

landreev Sep 28, 2020

JingMa87 Sep 29, 2020

landreev Sep 29, 2020

JingMa87 Sep 29, 2020

landreev commented Sep 28, 2020

5050 Parse all dc identifier elements and allow identifiers that don'… #7214

5050 Parse all dc identifier elements and allow identifiers that don'… #7214

Conversation

JingMa87 commented Aug 21, 2020

JingMa87 commented Aug 26, 2020

jggautier commented Aug 30, 2020 • edited Loading

JingMa87 commented Aug 30, 2020

jggautier commented Aug 31, 2020

JingMa87 commented Aug 31, 2020

djbrooke commented Sep 17, 2020

jggautier commented Sep 17, 2020

JingMa87 commented Sep 17, 2020

landreev left a comment

Choose a reason for hiding this comment

landreev Sep 28, 2020

Choose a reason for hiding this comment

JingMa87 Sep 28, 2020

Choose a reason for hiding this comment

landreev Sep 28, 2020

Choose a reason for hiding this comment

JingMa87 Sep 29, 2020

Choose a reason for hiding this comment

landreev Sep 29, 2020

Choose a reason for hiding this comment

JingMa87 Sep 29, 2020

Choose a reason for hiding this comment

landreev commented Sep 28, 2020

jggautier commented Aug 30, 2020 •

edited

Loading