Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error setting up harvesting client for ICPSR on UNC Dataverse and Demo Dataverse #7497

Closed
jggautier opened this issue Jan 4, 2021 · 8 comments

Comments

@jggautier
Copy link
Contributor

jggautier commented Jan 4, 2021

Thu-Mai at Odum/UNC let us know today (see RT support email) that UNC's Dataverse-based repository shows the following error during the first step of creating a harvesting client using the server URL https://www.icpsr.umich.edu/icpsrweb/neutral/oai/studies:

https://www.icpsr.umich.edu/icpsrweb/neutral/oai/studies: Invalid URL. Failed to establish connection and receive a valid server response.

Screen Shot 2020-12-16 at 2 23 58 PM

Demo Dataverse reports the same error when I try to create a harvesting client using https://www.icpsr.umich.edu/icpsrweb/neutral/oai/studies or ICPSR's "citations" Server URL (https://www.icpsr.umich.edu/icpsrweb/neutral/oai/citations)

UNC is running Dataverse version 4.16. Demo Dataverse is running 5.3.

Documentation of the two ICPSR OAI-PMH feeds is at https://www.icpsr.umich.edu/web/pages/membership/or/metdata/oai.html.

Harvard Dataverse, running 5.3, and Dataverse instances I create on AWS, do not show this error. I'm able to get through all four steps for creating a harvesting client for ICPSR.

I originally reported this bug in Harvard Dataverse's Github repo at IQSS/dataverse.harvard.edu#63, but this issue isn't really specific to Harvard Dataverse so I moved it here.

@donsizemore
Copy link
Contributor

@jggautier I'm able to set up the client via dataverse5.odum.unc.edu but harvesting doesn't go well, throwing one of two errors for each identifier:

  <message>Exception processing getRecord(), oaiUrl=https://www.icpsr.umich.edu/icpsrweb/neutral/oai/studies, identifier=10, javax.ejb.EJBTransactionRolledbackException, Exception thrown from bean: javax.ejb.EJBException: Failed to find a global identifier in the OAI_DC XML record.</message>
  <method>logGetRecordException</method>
  <message>Exception processing getRecord(), oaiUrl=https://www.icpsr.umich.edu/icpsrweb/neutral/oai/studies, identifier=11, javax.ejb.EJBTransactionRolledbackException, Exception thrown from bean: javax.ejb.EJBException: Failed to find a global identifier in the OAI_DC XML record.</message>
  <method>logGetRecordException</method>
  <message>Exception processing getRecord(), oaiUrl=https://www.icpsr.umich.edu/icpsrweb/neutral/oai/studies, identifier=12, javax.ejb.EJBTransactionRolledbackException, Exception thrown from bean: java.lang.NullPointerException</message>
  <method>logGetRecordException</method>
  <message>Exception processing getRecord(), oaiUrl=https://www.icpsr.umich.edu/icpsrweb/neutral/oai/studies, identifier=13, javax.ejb.EJBTransactionRolledbackException, Exception thrown from bean: java.lang.NullPointerException</message>
  <method>logGetRecordException</method>

@jggautier
Copy link
Contributor Author

jggautier commented Jan 8, 2021

Identifier confusion

The error "Failed to find a global identifier in the OAI_DC XML record" reminds me of the issue in #5050. In that issue, @JingMa87 found that Dataverse wants only identifiers that are DOIs or HDLs, and when the oai_dc record has two identifiers, Dataverse looks only at the first identifier. ICPSR oai_dc records have two dc:identifier elements, and its first is not the record's DOI.

E.g. https://www.icpsr.umich.edu/icpsrweb/neutral/oai/studies?verb=GetRecord&metadataPrefix=oai_dc&identifier=1

<dc:identifier>1</dc:identifier>
<dc:identifier>http://doi.org/10.3886/ICPSR00001.v3</dc:identifier>

(This issue wasn't resolved in #5050 because the scope was about Zenodo, which doesn't have this problem. The problem that we decided to resolve was about how restrictive Dataverse was when figuring out if the first identifier is a DOI or HDL.)

I would expect Dataverse to throw this "Failed to find a global identifier" error for all of these ICPSR records, since I think they all have two dc:identifier elements and the first element contains just a number. But some records aren't getting this "Failed to find a global identifier" error?

In cases where a record has more than one identifier, is there a way to have Dataverse look for a DOI or HDL first, maybe the first one it comes across, then try to use that when importing the metadata?

java.lang.NullPointerException

Maybe @scolapasta or @landreev can help with this? Not sure what it means and why it happens for some records and not all.

Which metadata format and "Archive Type" to use

Looks like you're using oai_dc. Which "Archive type" are you using?

Screen Shot 2021-01-08 at 2 52 04 PM

I know oai_dc should always work, but I've been encouraged to use and have been testing with the "ICPSR" Archive Type (and the oai_ddi25 metadata format since that should yield more metadata from each record). Although this seems to have its own problems: #7498)

@donsizemore
Copy link
Contributor

@jggautier On your not being able to add the client on demo.dataverse.org - is dataverse.siteUrl set?

@jggautier
Copy link
Contributor Author

hey @donsizemore. I see dataverse.siteUrl referenced in the installation and configuration guides but still don't know what it means and don't think I can check. @kcondon, would you know?

@kcondon
Copy link
Contributor

kcondon commented Jan 21, 2021

@jggautier siteUrl is a jvm-option that specifies the url to use to access this dataverse from the outside world, eg. https://dataverse.harvard.edu versus a local machine name, http://machine1.harvard.edu . Some functionality, over time, has made use of this setting and so will not function properly when not configured. It can be confusing because there is another, similar option, fqdn, that is just the full hostname accessible by the outside world rather than the url, eg. dataverse.harvard.edu. It does not include the protocol (https) nor the optional port number (443) .
https://guides.dataverse.org/en/5.3/installation/config.html#dataverse-siteurl

For convenience but maybe not ease of reading, there is a syntax that allows you to define one in terms of the other: <jvm-options>-Ddataverse.siteUrl=http://${dataverse.fqdn}:8080</jvm-options>

In this case: <jvm-options>-Ddataverse.siteUrl=https://demo.dataverse.org</jvm-options>
<jvm-options>-Ddataverse.fqdn=demo.dataverse.org</jvm-options>

@jggautier
Copy link
Contributor Author

Thanks for explaining!

I found a Dataverse support tickets with a text file attached that lists a Dataverse installation's jvm-options, including its dataverse.siteUrl. How would I see what Demo Dataverse's JVM options are? (I'm assuming I don't have access to it since I've never needed to.)

@kcondon
Copy link
Contributor

kcondon commented Jan 21, 2021

@jggautier I've put the two options from demo mentioned in the previous comment.

@jggautier
Copy link
Contributor Author

jggautier commented Jun 3, 2021

Sometime between the last comment in this issue and today (maybe "2021-03"), a notice was added to the top of the ICPSR documentation page stating that ICPSR is retiring its OAI-PMH service. At the end of this month (June 2021), it won't be available. (The notice also reads that they are "exploring an API-focused solution that will involve delivering metadata using the DCAT-US schema", but I think that should be addressed outside of this GitHub issue.)

There are some IQSS grant funded projects in the planning phases for improving the Dataverse software's harvesting capabilities, but that will start sometime after this month, which I think means that the OAI-PMH harvesting problems described in this and related GitHub issues won't be resolved in time for Dataverse repositories to harvest ICPSR's metadata before their OAI-PMH service is retired.

I think this and the related GitHub issues should be closed and we should follow up with ICPSR. The notice reads that they don't know when the API-focused solution will be completed, but I'm curious why they're retiring the OAI-PMH service.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants