Harvest: DDI import appears not to include all fields exported as DDI. #3297

kcondon · 2016-08-18T22:00:21Z

Searching on fields in the DDI export of a dataset mostly works but there are some fields apparently not imported.

Fields not searchable/ not imported but that appear in the DDI export include:


Producer URI	software version	Terms of Use
Producer Logo URI	software name	Confidentiality Declaration
Keyword Vocab URI	data sources	Special Permissions
Topic Class Vocab URI	origin of sources	Restrictions
Bounding box coords	CharacteristicOfSourcesNoted	Citation Requirements
all related publication fields	DocumentationAndAccessToSources	Depositor Requirements
all other id fields	CollectorTraining	Terms of Access
all distributor fields	TargetSampleSizeFormula	Data Access Place
all contact fields	sample size	Conditions
all depositor fields	ControlOperations	Disclaimer
related material	StudyLevelErrorNotes	Original Archive
related datasets	EstimatesOfSamplingError	Availability Status
other references	NotesType	Contact for Access
series name	NotesSubject	Size of Collection
series information		Study Completion

djbrooke · 2016-08-19T15:13:12Z

Needs some further verification about whether or not this import is in scope for Harvesting.

I'll take a look through the documentation and see if I can find this.

djbrooke · 2016-08-19T18:14:10Z

Looking at "DocumentationAndAccessToSources" from above. From http://www.ddialliance.org/Specification/DDI-Codebook/2.5/XMLSchema/field_level_documentation_files/schemas/codebook_xsd/elements/srcDocu.html:

<xs:element name="srcDocu" type="simpleTextType">
  <xs:annotation>
    <xs:documentation>
      <xhtml:div>
        <xhtml:h1 class="element_title">Documentation and Access to Sources</xhtml:h1>
        <xhtml:div>
          <xhtml:h2 class="section_header">Description</xhtml:h2>
          <xhtml:div class="description"> Level of documentation of the original sources. 
May not be relevant to survey data. This element may be repeated to support multiple 
language expressions of the content. </xhtml:div>
        </xhtml:div>
      </xhtml:div>
    </xs:documentation>
  </xs:annotation>
</xs:element>

djbrooke · 2016-08-19T18:50:41Z

Removing from 4.5. We'll address this in the near future when our new metadata librarian joins the team. At that point we'll work with her/him to gather more information regarding the importance of including these fields.

jggautier · 2018-04-05T19:40:00Z

If we'd like people to be able to search for harvested datasets to the same extent that they're able to search for local datasets, (I say we should like to) I would think that the same number of searchable metadata fields should be available for both harvested and non-harvested datasets.

For example, people who know the title of an article and want to know if there are any associated datasets can search on related publication citations of local (non-harvested) datasets, so they should be able to search on related publication citations of harvested datasets. This isn't possible if all or some harvested related publication metadata are not searchable (as @kcondon reported).

Maybe a first pass could be to identify the metadata fields we think are important for searching by looking at what's been added to the advanced search field, plus any weighting added in solr so that searches favor certain fields over others. And make sure those fields are being indexed for harvested metadata.

jggautier · 2020-01-16T20:46:01Z

For this issue and in general, I think it would be helpful to record how Dataverse is mapping DDI Codebook fields to Dataverse fields when dataset metadata is imported. Makes sense to me to do this in the metadata crosswalk.

When Dataverse harvests DDI metadata over OAI-PMH, it's mapping DDI Codebook fields to Dataverse fields.
When people use the API's import DDI endpoint to create datasets by importing DDI xml files, Dataverse is also mapping DDI Codebook fields to Dataverse fields.

Can someone confirm by looking at the code that in both cases (harvesting over OAI-PMH and using the API endpoint) Dataverse is mapping DDI to Dataverse in the same way? Is it using the same code to map the fields? If we know that, it'll make documenting how DDI fields are mapped to Dataverse fields on import a lot easier.

jggautier · 2020-02-20T18:56:43Z

@landreev wrote in #4964 that "the only other harvesting format we (theoretically) recognize from a non-Dataverse OAI archive is DDI; in practice, it's extremely unlikely that we'll be able to parse a DDI that's produced by anything other than a Dataverse."

This makes me wonder how much we've prioritized harvesting DDI metadata in general. It makes sense that Dataverse would prioritize support for harvesting dataverse_json metadata from Dataverse repositories, at least for Dataverse 4+ repositories. After lots of testing and recent conversations about harvesting, here are the configurations I'm following when creating harvesting clients in Harvard Dataverse:

Data repository type	"Archive type"	Metadata format
Dataverse 4+ repositories	Dataverse v4+	dataverse_json
Dataverse 2-3.x repositories	DVN, v2-3	ddi
ICPSR	ICPSR	oai_ddi25
Other non-Dataverse repositories	Generic OAI resource (DC)	oai_dc

Of course harvesting DDI metadata from non-Dataverse repositories would be preferred over Dublin Core, but if DC is the best supported option for now, there seem to be only two cases where we would recommend that a Dataverse 4+ repository harvest DDI metadata (over OAI-PMH): When harvesting from Dataverse 2-3.x repositories and from ICPSR.

@kcondon, when you opened this ticket, were your findings the result of harvesting from Dataverse 2-3.x repositories?

landreev · 2020-03-09T22:52:20Z

Just to reiterate what I said in a comment in #6650: I believe this issue can be considered a duplicate of the above.
I don't think this issue is really specific to OAI and Harvesting.
Harvesting DDI metadata records, and not being able to search on everything in them is one of the symptoms of this issue. But at the core of it it is the mismatch between the export and import rules.
Which is being addressed in #6650, and that should supersede this issue.

landreev · 2020-03-09T23:37:59Z

@jggautier
To clarify what I said about importing DDI, quoted above:

the only other harvesting format we (theoretically) recognize from a non-Dataverse OAI archive is DDI; in practice, it's extremely unlikely that we'll be able to parse a DDI that's produced by anything other than a Dataverse.

I didn't mean we shouldn't try harvesting DDI; we always want to choose more metadata-rich formats whenever possible.
We have indeed successfully harvested DDI from a number of places: ICPSR, Nesstar, and Roper (I think? maybe some other places?)
What I meant specifically was that DDI is a very rich and complex format, with potentially too many ways to encode the same information. Which really makes it hard, or even impossible to create a fool proof, universal parser and mapper, that could take ANY syntactically legal DDI, and successfully translate its contents into our (Dataverse) metadata structure. Whenever we had to import non-Dataverse generated DDIs - the cases above - we had to tweak the import code to address whatever idiosyncratic traits specific to how they cooked their DDI. If we encounter yet another archive that offers harvestable DDI records, we'll likely have to add more of such tweaks to be able to process them.

Note that what I said above was in the context of harvesting from a "generic archive", defined as a repository that we don't know anything about, aside from the fact that it is OAI-compliant. As opposed to harvesting DDI from the above-mentioned ICPSR, Roper and Nesstar archives, for which we do have pre-defined redirect rules. (This is what appears in the "archive type" pull down menu on the Harvesting Clients page; "Nesstar" isn't shown there, because it's not supported for active harvesting, but it is defined for displaying legacy Nesstar records).

Hope this makes sense. Otherwise please let me know...

jggautier · 2020-03-12T16:14:13Z

That makes sense. Thanks @landreev!

When we spoke today we agreed that this issue can be closed, so I'm closing it. #6650 addresses all of the fields mentioned in this issue's original comment.

Some geospatial metadata won't be mapped on import, possibly because Codebook has no DDI elements for them (e.g. State, City). But I think that could be its own issue (that is, more research or confirming with the DDI technical committee that there's really no way to express geospatial metadata like State and City, or asking the DDI folks to consider adding a way, since they're actually in the middle of updates to Codebook).

kcondon added Feature: Metadata Feature: Harvesting Priority 2: Moderate labels Aug 18, 2016

kcondon added this to the 4.5 - Metadata Export and Harvesting milestone Aug 18, 2016

djbrooke self-assigned this Aug 19, 2016

djbrooke removed this from the 4.5 - Metadata Export and Harvesting milestone Aug 19, 2016

djbrooke removed their assignment Aug 22, 2016

pdurbin added User Role: Superuser Has access to the superuser dashboard and cares about how the system is configured and removed zPriority 2: Moderate labels Jul 12, 2017

djbrooke mentioned this issue Feb 19, 2020

Mismatch between export and import DDI functions #6650

Closed

jggautier closed this as completed Mar 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Harvest: DDI import appears not to include all fields exported as DDI. #3297

Harvest: DDI import appears not to include all fields exported as DDI. #3297

kcondon commented Aug 18, 2016 •

edited by jggautier

Loading

djbrooke commented Aug 19, 2016

djbrooke commented Aug 19, 2016 •

edited by jggautier

Loading

djbrooke commented Aug 19, 2016

jggautier commented Apr 5, 2018 •

edited

Loading

jggautier commented Jan 16, 2020 •

edited

Loading

jggautier commented Feb 20, 2020 •

edited

Loading

landreev commented Mar 9, 2020

landreev commented Mar 9, 2020 •

edited

Loading

jggautier commented Mar 12, 2020

Harvest: DDI import appears not to include all fields exported as DDI. #3297

Harvest: DDI import appears not to include all fields exported as DDI. #3297

Comments

kcondon commented Aug 18, 2016 • edited by jggautier Loading

djbrooke commented Aug 19, 2016

djbrooke commented Aug 19, 2016 • edited by jggautier Loading

djbrooke commented Aug 19, 2016

jggautier commented Apr 5, 2018 • edited Loading

jggautier commented Jan 16, 2020 • edited Loading

jggautier commented Feb 20, 2020 • edited Loading

landreev commented Mar 9, 2020

landreev commented Mar 9, 2020 • edited Loading

jggautier commented Mar 12, 2020

kcondon commented Aug 18, 2016 •

edited by jggautier

Loading

djbrooke commented Aug 19, 2016 •

edited by jggautier

Loading

jggautier commented Apr 5, 2018 •

edited

Loading

jggautier commented Jan 16, 2020 •

edited

Loading

jggautier commented Feb 20, 2020 •

edited

Loading

landreev commented Mar 9, 2020 •

edited

Loading