Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to create harvesting client for SRDA repository #7624

Closed
jggautier opened this issue Feb 17, 2021 · 21 comments
Closed

Unable to create harvesting client for SRDA repository #7624

jggautier opened this issue Feb 17, 2021 · 21 comments
Assignees
Labels
Feature: Harvesting pm.GREI-d-2.4.1B NIH AIM:4 YR:2 TASK:1B | 2.4.1B | (started yr1) Resolve OAI-PMH harvesting issues Size: 10 A percentage of a sprint. 7 hours.

Comments

@jggautier
Copy link
Contributor

jggautier commented Feb 17, 2021

I'm unable to create harvesting clients in the Harvard Dataverse Repository and Demo Dataverse repository using SRDA's own OAI-PMH feed. It's base URL is https://srda.sinica.edu.tw/oai_pmh/oai2.php.

Identifying it works - https://srda.sinica.edu.tw/oai_pmh/oai2.php?verb=Identify - and so does listing records -https://srda.sinica.edu.tw/oai_pmh/oai2.php?verb=ListRecords&metadataPrefix=oai_dc.

But when trying to create a client, Harvard Dataverse Repository and Demo Dataverse show errors about the base URL being an "Invalid URL. Failed to establish connection and receive a valid server response."

Harvard Dataverse Repository is harvesting SRDA's records into https://dataverse.harvard.edu/dataverse/srda_harvested, using DataCite's OAI-PMH feed. The admins created their own feed and emailed the repository's support to ask that Harvard Dataverse Repository to use that feed instead of the records that DataCite has.

The SRDA repository's admins are troubleshooting and leaving updates in the support email thread at https://help.hmdc.harvard.edu/Ticket/Display.html?id=287243. They have already changed the base URL and may change it again so when this is investigated, that email thread should be checked for the latest info.

@djbrooke djbrooke added the Small label Mar 10, 2021
@mreekie mreekie added the NIH OTA: 1.4.1 4 | 1.4.1 | Resolve OAI-PMH harvesting issues | 5 prdOwnThis is an item synched from the product ... label Oct 25, 2022
@mreekie mreekie removed the NIH OTA: 1.4.1 4 | 1.4.1 | Resolve OAI-PMH harvesting issues | 5 prdOwnThis is an item synched from the product ... label Nov 2, 2022
@mreekie mreekie removed the sz.Small label Jan 11, 2023
@cmbz cmbz added the pm.GREI-d-2.4.1B NIH AIM:4 YR:2 TASK:1B | 2.4.1B | (started yr1) Resolve OAI-PMH harvesting issues label Dec 18, 2023
@cmbz
Copy link

cmbz commented Dec 19, 2023

2023/12/19: Prioritized during meeting on 2023/12/18. Added to Needs Sizing.

@cmbz
Copy link

cmbz commented Dec 19, 2023

2023/12/19: @jggautier and @landreev will retest to see if problem still exists, then determine next steps afterwards. Also, issue should be moved to dataverse.harvard.edu.

@cmbz cmbz added the Size: 3 A percentage of a sprint. 2.1 hours. label Dec 19, 2023
@landreev
Copy link
Contributor

landreev commented Dec 19, 2023

So, we have a configured SRDA harvesting client in prod. (harvesting one specific set, GESIS.SRDA). [Edit: this is us harvesting SRDA content from DataCite's OAI-PMH feed; this is mentioned in Julian's opening comment; the problem is being able to harvest from them directly]

There appears to be some content successfully harvested via this client relatively recently.

@landreev landreev self-assigned this Jan 17, 2024
@landreev
Copy link
Contributor

landreev commented Jan 22, 2024

Their current working OAI endpoint appears to be https://srda.sinica.edu.tw/oai_pmh/oai. (not "/oai_pmh/oai2.php").
Creating a client with the url above appears to work. If you choose the single available set from the list (srda) however, an attempt to harvest fails with the noSetHierarchy response from their server. This is a problem on their end, for sure.
A second attempt, to create a client without selecting a set: this appears to work, I'm seeing some records being harvested: https://demo.dataverse.org/dataverse/srda/

However, you will notice that redirects to the remote locations are NOT working. ☹️
The following appears to be a problem: their OAI server is supplying the record identifiers like this:
10.6141/TW-SRDA-AN010012-1 - i.e., without the doi: prefix. This is a valid doi, and resolving it, as in https://doi.org/10.6141/TW-SRDA-AN010012-1, works. However our code appears to default to hdl: (!) - and that doesn't work of course. We just need to make this configurable on the client level, which protocol to default to when the prefix is not supplied.

I will open a new dev. issue for this. But creating a client for this repo is working just fine now, after all these years.

@landreev
Copy link
Contributor

(to be precise, there's not one, but 2 different problems that prevent the redirects from working)

@landreev
Copy link
Contributor

Opened the dev. issue for the redirect issues (linked above).

@landreev landreev assigned jggautier and unassigned landreev Jan 23, 2024
@landreev
Copy link
Contributor

(this one itself is a non-dev. issue, no PR associated with it, so I dragged it into "In Review" directly, asking @jggautier to take a look before we close it)

@jggautier
Copy link
Contributor Author

Should this issue still be moved to the Harvard Dataverse repo?

It sounds like we should let the SRDA folks know that when we try to harvest their srda set, it fails with "the noSetHierarchy response from their server". Is that right? I'd be happy to email them to let them know that this is preventing us from harvesting from that set and ask if they can look into it.

Since we're able to harvest from them when we don't specify, I wonder if we can do that instead. I can also ask them if we can do that.

@landreev
Copy link
Contributor

Since we are closing this issue, idk if it's worth moving it to the local repo - but, up to you.

Yes, we should just harvest from them without specifying the set. That's what their server supports. The only, minor problem on their end is that their server is for whatever reason advertising this unsupported set under ListSets. I mentioned it just to warn you not to select it when configuring the client. You may want to let them know. But no, it isn't preventing us from harvesting from them.

@landreev
Copy link
Contributor

BTW, why was it important to harvest from them directly - as opposed to harvesting their records from Datacite, as set up in prod.? Unfortunately, the records in prod. harvested via that client are not properly redirecting at the moment - but that's because of the bug that I opened #10254 for (and I'm really hoping to fix it asap).

Is the content expected to be different, between what we get from their own OAI vs. Datacite? (it looks like there are different numbers of records served between the two).

@jggautier
Copy link
Contributor Author

I assumed that they created their own harvesting server and emailed us to avoid the issue(s) that Dataverse used to have with harvesting sets from DataCite. In our emails with them I pointed out that issue, but I didn't ask them explicitly why they want us to harvest from their own OAI. I also wondered if they wanted us to harvest from their own OAI because they wanted more control over what we harvested.

I've been planning to email them again with our progress. Want me to ask them why exactly they'd like us to harvest from their OAI instead of from DataCite?

@landreev
Copy link
Contributor

I was just curious, really. It doesn't really matter.
We can harvest either way, direct or via Datacite. And, once #10254 is addressed, the harvested records will even be useful/usable (as in, our users will be able to get to their site by clicking on the search records).

It's up to you - but maybe we should wait for it to be fixed before contacting them? So that we could show them harvested and working records, even if it's on one of our test servers - otherwise it just doesn't feel like "progress", when the harvested records are all broken - ?

@jggautier
Copy link
Contributor Author

Yeah I agree. I could email them to let them know that Harvard Dataverse isn't able to harvest from the set they asked us to harvest from and ask them why'd they'd like us to harvest from that set, as opposed to harvesting everything in the repository by not specifying a set, which will work once #10254 is addressed.

@landreev
Copy link
Contributor

#10254 (comment)

@jggautier
Copy link
Contributor Author

I emailed the folks at SRDA to let them know that a problem on their side prevents Harvard Dataverse from harvesting from their srda set, to let them know that Harvard Dataverse is able to harvest all of their metadata without specifying a set, and to ask if they would like Harvard Dataverse to harvest from that srda set or harvest without specifying the srda set.

@landreev
Copy link
Contributor

I'd like to close this issue (for accounting purposes; I will also resize to 10, since we have put more work into it this week). Would you mind creating a new issue in the local repo, something like "Harvest metadata from SRDA", to keep track of the remaining effort? (or I can create it there)

@landreev landreev added Size: 10 A percentage of a sprint. 7 hours. and removed Size: 3 A percentage of a sprint. 2.1 hours. labels Jan 26, 2024
@jggautier
Copy link
Contributor Author

jggautier commented Jan 26, 2024

Great, closing this issue sounds okay to me since SRDA folks let us know yesterday that it's fine for Harvard Dataverse to harvest from them without specifying a set and you wrote that Harvard Dataverse is able to do that.

I'll close this issue with this comment, adjust the harvesting client for SRDA, and start the re-harvesting today.

There's more info in our email thread with the folks at SRDA about what's going on with that srda set and why they're recommending using their OAI instead of harvesting from DataCite.

@landreev
Copy link
Contributor

Great, thanks.
As for starting a new harvest, I would wait until 6.1 is in prod.

@jggautier
Copy link
Contributor Author

Ah okay. Although I already edited the harvesting client and told Harvard Dataverse to re-harvest.

Why would you wait until 6.1 is in prod? Is it because until #10254 is addressed, clicking on the dataset titles won't lead users to the dataset, and that fix won't be applied to metadata that was harvested before #10254 is addressed?

@landreev
Copy link
Contributor

I'm going to include a quick patch into 6.1 as deployed here that will fix the redirects, yes.
It will fix all the existing records with broken redirects, so that's not a problem.
I was just suggesting to wait until the redirects are fixed.

@jggautier
Copy link
Contributor Author

Ah okay. So I can create a new issue in the Harvard Dataverse repo like you suggested, to keep track of things to do to harvest SRDA's metadata.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature: Harvesting pm.GREI-d-2.4.1B NIH AIM:4 YR:2 TASK:1B | 2.4.1B | (started yr1) Resolve OAI-PMH harvesting issues Size: 10 A percentage of a sprint. 7 hours.
Projects
None yet
Development

No branches or pull requests

5 participants