Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Harvest: Json format harvest fails with unknown field when field exists on client. #7075

Closed
kcondon opened this issue Jul 13, 2020 · 11 comments

Comments

@kcondon
Copy link
Contributor

kcondon commented Jul 13, 2020

For reference, see pr #7057

For this instance, the fields do exist on the client side and surprisingly, the failing datasets are not the same as indicated in the ticket when the fields do not exist on the client. The failure appears to be the same, unknown field, mraCollection but it is coming from solr?

grep Error harvest_test_n99_2020-07-13T21-28-05.log
Exception processing getRecord(), oaiUrl=https://dataverse.harvard.edu/oai, identifier=doi:10.7910/DVN/B6OJKG, edu.harvard.iq.dataverse.api.imports.ImportException, Failed to import harvested dataset: class edu.harvard.iq.dataverse.engine.command.exception.CommandException (Command [DatasetCreate dataset:49321] failed: Exception thrown from bean: javax.ejb.EJBTransactionRolledbackException: Exception thrown from bean: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://localhost:8983/solr/collection1: ERROR: [doc=dataset_49321] unknown field 'mraCollection')
Exception processing getRecord(), oaiUrl=https://dataverse.harvard.edu/oai, identifier=doi:10.7910/DVN/I5O6OS, edu.harvard.iq.dataverse.api.imports.ImportException, Failed to import harvested dataset: class edu.harvard.iq.dataverse.engine.command.exception.CommandException (Command [DatasetCreate dataset:51066] failed: Exception thrown from bean: javax.ejb.EJBTransactionRolledbackException: Exception thrown from bean: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://localhost:8983/solr/collection1: ERROR: [doc=dataset_51066] unknown field 'mraCollection')

The two failing datasets are:
You are now connected to database "thedata_alt" as user "postgres".
select id, identifier, dtype from dvobject where id=49321;
id | identifier | dtype
-------+------------+---------
49321 | DVN/A9VJVR | Dataset
(1 row)

select id, identifier, dtype from dvobject where id=51066;
id | identifier | dtype
-------+------------+---------
51066 | DVN/3XMK0W | Dataset

In the original pr, the failing datasets were:
The controversial datasets are https://dataverse.harvard.edu/api/datasets/export?exporter=dataverse_json&persistentId=doi:10.7910/DVN/B6OJKG and https://dataverse.harvard.edu/api/datasets/export?exporter=dataverse_json&persistentId=doi:10.7910/DVN/I5O6OS.

@JingMa87
Copy link
Contributor

JingMa87 commented Jul 19, 2020

@kcondon I'm trying to reproduce the error, but I can't seem to add the custom field mraCollection in a way that reproduces the error. Do you have a TSV file for me so I can add this custom field to Dataverse?

@kcondon
Copy link
Contributor Author

kcondon commented Jul 19, 2020

Hi, I believe all I did was run the custom script for Harvard metadata in the dvinstall file: https://raw.githubusercontent.com/IQSS/dataverse/develop/scripts/api/setup-optional-harvard.sh

@JingMa87
Copy link
Contributor

JingMa87 commented Jul 21, 2020

I ran the script to add the Harvard metadata successfully. When I run the harvest, I get the same error as you but with less info and a null in the message. However, I decided to not dig deeper into this difference yet and focus on solr instead.

I think that the problem is that the new fields were added to the database and the UI, but not to solr. So I checked the new fields that have to be added to solr using the curl http://localhost:8080/api/admin/index/solr/schema call, which was correct. Then I checked schema.xml and found that the mraCollection value isn't in there yet. Then I read the documentation in the SearchFields.java class and metadatacustomization.rst and found out that you have to update schema.xml using a script. So I ended up running the updateSchemaMDB.sh file to update the schema.xml file.

I received an error: "Dataverse responded with empty file. When running on K8s: did you bootstrap yet?". This failure happens here:

if [[ "`wc -l ${TMPFILE}`" < "3" ]]; then
  echo "Dataverse responded with empty file. When running on K8s: did you bootstrap yet?"
  exit 123
fi

I think the writer of this script intended to check if the amount of lines in the file with the new fields is 3 or less, but the piece of code doesn't work because the file actually has 455 lines but still ends up returning the empty file error. If my assumption is correct, I have a fix for this using this first line which I can push in a PR:

if [[ `wc -l < ${TMPFILE}` -lt 3 ]]; then

When I make this change, the update script triggers successfully but the schema.xml still doesn't have the new fields like "mraCollection". Not sure where to go next.

@kcondon
Copy link
Contributor Author

kcondon commented Jul 21, 2020

@JingMa87 Thanks for looking into this! I can check with the team tomorrow when I'm back at work but I wonder whether @poikilotherm has any insight into the problem and your proposed fix since I think it is an area he is familiar with?

@poikilotherm
Copy link
Contributor

poikilotherm commented Jul 21, 2020

Hi @JingMa87,
where are you running this script? I tested this with GNU bash and ZSH on Linux, successfully using it with Docker containers etc. Both string comparison or integer comparison should be perfectly valid and yield the same result.

I'll go ahead and try to reproduce with develop.

@JingMa87
Copy link
Contributor

Hi @poikilotherm, I'm running this script on my mac using zsh. I have a local full dev environment of Dataverse running in Glassfish4. But even after I adjust the script and run it, the schema.xml doesn't update to contain the new fields.

@poikilotherm
Copy link
Contributor

Hi @JingMa87, I tried to reproduce this with latest develop@ 941d17d , running on Payara 5 and deploying customMRA.tsv via curl first. I tried the script with both zsh 5.7.1 (x86_64-redhat-linux-gnu) and bash 5.0.17(1)-release (x86_64-redhat-linux-gnu). I was not able to reproduce your problem, so I suspect your local environment.

Obviously, you can still gather the fields from the API endpoint and put it in the Solr schema files manually. Please reach out on IRC for more talk and help 😄

@JingMa87
Copy link
Contributor

JingMa87 commented Jul 23, 2020

@poikilotherm My colleague @mderuijter has had a similar problem actually. I decided to add the fields manually for now.

@kcondon Do you run Dataverse on Payara5? That might explain why my error message is less complete compared to yours. So I added the custom fields like mraCollection to Solr and then ran a harvest on the Princeton set, resulting in all successes. Do you have the mraCollection field in Solr? I presume this to be the problem.

image

@kcondon
Copy link
Contributor Author

kcondon commented Jul 23, 2020

@JingMa87 Yes, Payara 5 is now our target platform.

@JingMa87
Copy link
Contributor

@kcondon Did you add the custom fields like mraCollection to Solr yet? This is probably what causes the problem.

@kcondon
Copy link
Contributor Author

kcondon commented Aug 24, 2020

@JingMa87 Hi, that was the problem. Apologies for not catching that but this field used to be part of schema.xml but behavior has changed to separate those out.

@kcondon kcondon closed this as completed Aug 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants