-
Notifications
You must be signed in to change notification settings - Fork 492
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for OAI-harvesting from DataCite #10909
Comments
Item 1. is also true for other repositories, and would greatly enhance Dataverse's harvesting capacity 😃 |
There is an extra issue @scolapasta pointed out in #10937 that I'm adding as task 3. here - in the current scheme of things the set name is stored in the database as a |
Hi Folks, Here's an example: |
…arvested datasets. #10909. (that whole block of extra checks on the harvest "style" may be redundant by now - I'll think about it)
…p, since it's already got a script with .2 in the name. #10909
…arvested datasets. #10909. (that whole block of extra checks on the harvest "style" may be redundant by now - I'll think about it)
DataCite maintains an OAI server (https://oai.datacite.org/oai) serving records for every DOI they have registered. There is a lot of interest in being able to harvest from them (since these are all registered DOIs, they will be redirecting to the original archival location of the actual studies/datasets etc.)
There is a couple of issues that must be addressed before our OAI client implementation is able to do that.
<dc:identifier>
field. DataCite however does not include the main DOI in the oai_dc - since they are using these DOIs as the OAI identifiers as well, they assume that it is enough to include them in the OAI record header, in the<identifier>
field, like this:Without the
<dc:identifier>
, our code in its current form is failing to import the record above.All that needs to be done, we need to add some logic to use the identifier from the OAI header in situations like this. (We actually used to do that in one of the previous iterations of the harvester).
Example:
Now you can harvest this "set" made up of one dataset above, as in
https://oai.datacite.org/oai?verb=ListRecords&metadataPrefix=oai_dc&set=~ZG9pJTNBMTAuNzkxMC9EVk4vVEpDTEtQCg==
Unfortunately for whatever reason, the above notation only works in ListRecords, but not in ListIdentifiers, that Dataverse actually uses. From talking to Datacite, they may be able to fix it eventually - but not in an instant, "oh yeah, we just had this one line commented out" way.
We should go ahead and implement support for harvesting using
ListRecords
(it should be faster, if nothing else; we handle it viaListIdentifiers
thenGetRecord
, one record at a time, for various historical reasons - but it may come handy in other situations, to have both modes supported (and configurable, per client maybe?)Clearly, we don't want to touch the current, JSF-based harvesting clients UI. But making the changes above, in the import and harvesting back end code, and then making it possible to set up or configure a client via the
/api/harvest/clients
API to take advantage of these improvements should be both useful and sufficient.The text was updated successfully, but these errors were encountered: