Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't harvest when Dublin core field language is set #8139

Closed
tcoupin opened this issue Oct 12, 2021 · 5 comments · Fixed by #8689
Closed

Can't harvest when Dublin core field language is set #8139

tcoupin opened this issue Oct 12, 2021 · 5 comments · Fixed by #8689
Labels
Feature: Harvesting NIH OTA DC Grant: The Harvard Dataverse repository: A generalist repository integrated with a Data Commons NIH OTA: 1.4.1 4 | 1.4.1 | Resolve OAI-PMH harvesting issues | 5 prdOwnThis is an item synched from the product ... pm.epic.nih_harvesting pm.GREI-d-1.4.1 NIH, yr1, aim4, task1: Resolve OAI-PMH harvesting issues
Milestone

Comments

@tcoupin
Copy link
Member

tcoupin commented Oct 12, 2021

I try to harvest a record on an oaipmh server. This record is format in oai_dc schema and has the field language set to fr value (oai_dc specifies that language must be an ISO 639-1 code, 2 letters).

<OAI-PMH xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
	<responseDate>
		2021-10-12T15:14:19+00:00
	</responseDate>
	<request verb="GetRecord" identifier="https://doi.org/10.23708/herbier-guyane-ird" metadataPrefix="oai_dc">
		http://doi2pmh.ird.fr/oai/
	</request>
	<GetRecord>
		<record>
			<header>
				<identifier>
					https://doi.org/10.23708/herbier-guyane-ird
				</identifier>
				<datestamp>
					2021-10-12T20:21:00+00:00
				</datestamp>
				<setSpec>
					Doi2Pmh
				</setSpec>
				<setSpec>
					UMR-AMAP
				</setSpec>
			</header>
			<metadata>
				<dc xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
					<identifier>
						https://doi.org/10.23708/herbier-guyane-ird
					</identifier>
					<publisher>
						UMR AMAP. CIRAD, CNRS, INRAE, IRD, Univ. Montpellier (France)
					</publisher>
					<title>
						L'herbier IRD de Guyane
					</title>
					<creator>
						Gonzalez, Sophie
					</creator>
					<creator>
						Bilot-Guérin, Véronique
					</creator>
					<creator>
						Delprete, Piero
					</creator>
					<creator>
						Geniez, Chantal
					</creator>
					<creator>
						Molino, Jean-François
					</creator>
					<creator>
						Smock, Jean-Louis
					</creator>
					<creator>
						Théveny, Frédéric
					</creator>
					<creator>
						IRD
					</creator>
					<creator>
						CIRAD
					</creator>
					<creator>
						INRAE
					</creator>
					<creator>
						Université de Montpellier
					</creator>
					<creator>
						Herbier de Guyane, Cayenne, Guyane française
					</creator>
					<creator>
						CNRS
					</creator>
					<description>
						L’Herbier IRD de Guyane (CAY), joue un rôle central dans l’acquisition et la diffusion des connaissances sur la flore de la Guyane française, et plus largement du Bouclier Guyanais et de l'Amazonie. Il a été créé en 1965 par R.A.A. Oldeman, et abrite aujourd’hui près de 200 000 spécimens collectés pour la plupart en Guyane française, mais aussi au Surinam, au Guyana, au Brésil (notamment dans l’État de l’Amapá) et au Vénézuela (État d'Amazonas).
					</description>
					<subject>
						FOS: Biological sciences
					</subject>
					<language>
						fr
					</language>
					<type>
						article
					</type>
				</dc>
			</metadata>
		</record>
	</GetRecord>
</OAI-PMH>

But the harvest is failling with the following error:

Exception processing getRecord(), oaiUrl=https://doi2pmh.ird.fr/oai/, identifier=https://doi.org/10.23708/h
erbier-guyane-ird, edu.harvard.iq.dataverse.api.imports.ImportException, Failed to import harvested dataset: class edu
.harvard.iq.dataverse.util.json.ControlledVocabularyException (Value 'fr' does not exist in type 'language')

Language is a controlled vocabulary field and values are human readable: see https://github.com/IQSS/dataverse/blob/develop/scripts/api/data/metadatablocks/citation.tsv#L186

I think that the controlled vocabulary must refer to ISO 639-1 codes and human readable display value must be set with translation files.

Removing language field from record fix the harvesting.

@doigl
Copy link
Contributor

doigl commented Mar 9, 2022

Same problem here with oai_dc and language "en":
edu.harvard.iq.dataverse.api.imports.ImportException: Failed to import harvested dataset: class edu.harvard.iq.dataverse.util.json.ControlledVocabularyException (Value 'en' does not exist in type 'language')

@pdurbin
Copy link
Member

pdurbin commented Mar 9, 2022

In citation.tsv I see lines like this:

	language	English		40	eng
	language	French		47	fra

I assume that's what's needed is some way to map "en" to "eng" and "fr" to "fra".

It looks like the 3 letter ISO-639-3 codes were added in pull request #7690 because of an inability to harvest "eng" datasets from Zenodo in #7638.

This issue seems to be related:

@doigl
Copy link
Contributor

doigl commented Mar 9, 2022

@pdurbin : Yes, this seems to be the case. I tried to identify a place in the code, where such a mapping could take place, but wasn't successful (perhaps adding and handling a new further-processing column in the foreignmetadatafieldmapping table?). Thought about just removing the language entry language from the foreignmetadatafieldmapping table as an ugly hack (as language is not really an important field for the harvested datasets), but am also unsure about side effects of this.

@qqmyers
Copy link
Member

qqmyers commented Mar 9, 2022

#7638 (comment) indicates that we can have multiple alternates - could en be added into the tsv without removing eng, etc?

@landreev
Copy link
Contributor

Yes, this is just a matter of adding more alternative variants to the list of controlled vocabulary values in citation.tsv.
So yes, to add "fr" as a legitimate value you can change the following line in the citation.tsv that we distribute:

	language	French		47	fra

to

	language	French		47	fra	fr

and update the block (curl http://localhost:8080/api/admin/datasetfield/load -H "Content-type: text/tab-separated-values" -X POST --upload-file citation.tsv)

But yes, we should add all these standard 2-letter codes to the block in the next release.

@mreekie mreekie added pm.epic.nih_harvesting NIH OTA DC Grant: The Harvard Dataverse repository: A generalist repository integrated with a Data Commons labels May 9, 2022
tcoupin added a commit to tcoupin/dataverse that referenced this issue May 11, 2022
kcondon added a commit that referenced this issue May 23, 2022
Fix #8139 : add iso-639-1 code for language as oai_dc specification
@pdurbin pdurbin added this to the 5.11 milestone Jun 2, 2022
@mreekie mreekie added the NIH OTA: 1.4.1 4 | 1.4.1 | Resolve OAI-PMH harvesting issues | 5 prdOwnThis is an item synched from the product ... label Nov 4, 2022
@mreekie mreekie added the pm.GREI-d-1.4.1 NIH, yr1, aim4, task1: Resolve OAI-PMH harvesting issues label Mar 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature: Harvesting NIH OTA DC Grant: The Harvard Dataverse repository: A generalist repository integrated with a Data Commons NIH OTA: 1.4.1 4 | 1.4.1 | Resolve OAI-PMH harvesting issues | 5 prdOwnThis is an item synched from the product ... pm.epic.nih_harvesting pm.GREI-d-1.4.1 NIH, yr1, aim4, task1: Resolve OAI-PMH harvesting issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants