Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Harvest: DDI import appears not to include all fields exported as DDI. #3297

Closed
kcondon opened this issue Aug 18, 2016 · 9 comments
Closed
Labels
Feature: Harvesting Feature: Metadata User Role: Superuser Has access to the superuser dashboard and cares about how the system is configured

Comments

@kcondon
Copy link
Contributor

kcondon commented Aug 18, 2016

Searching on fields in the DDI export of a dataset mostly works but there are some fields apparently not imported.

Fields not searchable/ not imported but that appear in the DDI export include:

Producer URI software version Terms of Use
Producer Logo URI software name Confidentiality Declaration
Keyword Vocab URI data sources Special Permissions
Topic Class Vocab URI origin of sources Restrictions
Bounding box coords CharacteristicOfSourcesNoted Citation Requirements
all related publication fields DocumentationAndAccessToSources Depositor Requirements
all other id fields CollectorTraining Terms of Access
all distributor fields TargetSampleSizeFormula Data Access Place
all contact fields sample size Conditions
all depositor fields ControlOperations Disclaimer
related material StudyLevelErrorNotes Original Archive
related datasets EstimatesOfSamplingError Availability Status
other references NotesType Contact for Access
series name NotesSubject Size of Collection
series information Study Completion
@djbrooke
Copy link
Contributor

Needs some further verification about whether or not this import is in scope for Harvesting.

I'll take a look through the documentation and see if I can find this.

@djbrooke
Copy link
Contributor

djbrooke commented Aug 19, 2016

Looking at "DocumentationAndAccessToSources" from above. From http://www.ddialliance.org/Specification/DDI-Codebook/2.5/XMLSchema/field_level_documentation_files/schemas/codebook_xsd/elements/srcDocu.html:

<xs:element name="srcDocu" type="simpleTextType">
  <xs:annotation>
    <xs:documentation>
      <xhtml:div>
        <xhtml:h1 class="element_title">Documentation and Access to Sources</xhtml:h1>
        <xhtml:div>
          <xhtml:h2 class="section_header">Description</xhtml:h2>
          <xhtml:div class="description"> Level of documentation of the original sources. 
May not be relevant to survey data. This element may be repeated to support multiple 
language expressions of the content. </xhtml:div>
        </xhtml:div>
      </xhtml:div>
    </xs:documentation>
  </xs:annotation>
</xs:element>

@djbrooke
Copy link
Contributor

Removing from 4.5. We'll address this in the near future when our new metadata librarian joins the team. At that point we'll work with her/him to gather more information regarding the importance of including these fields.

@djbrooke djbrooke removed their assignment Aug 22, 2016
@pdurbin pdurbin added User Role: Superuser Has access to the superuser dashboard and cares about how the system is configured and removed zPriority 2: Moderate labels Jul 12, 2017
@jggautier
Copy link
Contributor

jggautier commented Apr 5, 2018

If we'd like people to be able to search for harvested datasets to the same extent that they're able to search for local datasets, (I say we should like to) I would think that the same number of searchable metadata fields should be available for both harvested and non-harvested datasets.

For example, people who know the title of an article and want to know if there are any associated datasets can search on related publication citations of local (non-harvested) datasets, so they should be able to search on related publication citations of harvested datasets. This isn't possible if all or some harvested related publication metadata are not searchable (as @kcondon reported).

Maybe a first pass could be to identify the metadata fields we think are important for searching by looking at what's been added to the advanced search field, plus any weighting added in solr so that searches favor certain fields over others. And make sure those fields are being indexed for harvested metadata.

@jggautier
Copy link
Contributor

jggautier commented Jan 16, 2020

For this issue and in general, I think it would be helpful to record how Dataverse is mapping DDI Codebook fields to Dataverse fields when dataset metadata is imported. Makes sense to me to do this in the metadata crosswalk.

  • When Dataverse harvests DDI metadata over OAI-PMH, it's mapping DDI Codebook fields to Dataverse fields.
  • When people use the API's import DDI endpoint to create datasets by importing DDI xml files, Dataverse is also mapping DDI Codebook fields to Dataverse fields.

Can someone confirm by looking at the code that in both cases (harvesting over OAI-PMH and using the API endpoint) Dataverse is mapping DDI to Dataverse in the same way? Is it using the same code to map the fields? If we know that, it'll make documenting how DDI fields are mapped to Dataverse fields on import a lot easier.

@jggautier
Copy link
Contributor

jggautier commented Feb 20, 2020

@landreev wrote in #4964 that "the only other harvesting format we (theoretically) recognize from a non-Dataverse OAI archive is DDI; in practice, it's extremely unlikely that we'll be able to parse a DDI that's produced by anything other than a Dataverse."

This makes me wonder how much we've prioritized harvesting DDI metadata in general. It makes sense that Dataverse would prioritize support for harvesting dataverse_json metadata from Dataverse repositories, at least for Dataverse 4+ repositories. After lots of testing and recent conversations about harvesting, here are the configurations I'm following when creating harvesting clients in Harvard Dataverse:

Data repository type "Archive type" Metadata format
Dataverse 4+ repositories Dataverse v4+ dataverse_json
Dataverse 2-3.x repositories DVN, v2-3 ddi
ICPSR ICPSR oai_ddi25
Other non-Dataverse repositories Generic OAI resource (DC) oai_dc

Of course harvesting DDI metadata from non-Dataverse repositories would be preferred over Dublin Core, but if DC is the best supported option for now, there seem to be only two cases where we would recommend that a Dataverse 4+ repository harvest DDI metadata (over OAI-PMH): When harvesting from Dataverse 2-3.x repositories and from ICPSR.

@kcondon, when you opened this ticket, were your findings the result of harvesting from Dataverse 2-3.x repositories?

@landreev
Copy link
Contributor

landreev commented Mar 9, 2020

Just to reiterate what I said in a comment in #6650: I believe this issue can be considered a duplicate of the above.
I don't think this issue is really specific to OAI and Harvesting.
Harvesting DDI metadata records, and not being able to search on everything in them is one of the symptoms of this issue. But at the core of it it is the mismatch between the export and import rules.
Which is being addressed in #6650, and that should supersede this issue.

@landreev
Copy link
Contributor

landreev commented Mar 9, 2020

@jggautier
To clarify what I said about importing DDI, quoted above:

the only other harvesting format we (theoretically) recognize from a non-Dataverse OAI archive is DDI; in practice, it's extremely unlikely that we'll be able to parse a DDI that's produced by anything other than a Dataverse.

I didn't mean we shouldn't try harvesting DDI; we always want to choose more metadata-rich formats whenever possible.
We have indeed successfully harvested DDI from a number of places: ICPSR, Nesstar, and Roper (I think? maybe some other places?)
What I meant specifically was that DDI is a very rich and complex format, with potentially too many ways to encode the same information. Which really makes it hard, or even impossible to create a fool proof, universal parser and mapper, that could take ANY syntactically legal DDI, and successfully translate its contents into our (Dataverse) metadata structure. Whenever we had to import non-Dataverse generated DDIs - the cases above - we had to tweak the import code to address whatever idiosyncratic traits specific to how they cooked their DDI. If we encounter yet another archive that offers harvestable DDI records, we'll likely have to add more of such tweaks to be able to process them.

Note that what I said above was in the context of harvesting from a "generic archive", defined as a repository that we don't know anything about, aside from the fact that it is OAI-compliant. As opposed to harvesting DDI from the above-mentioned ICPSR, Roper and Nesstar archives, for which we do have pre-defined redirect rules. (This is what appears in the "archive type" pull down menu on the Harvesting Clients page; "Nesstar" isn't shown there, because it's not supported for active harvesting, but it is defined for displaying legacy Nesstar records).

Hope this makes sense. Otherwise please let me know...

@jggautier
Copy link
Contributor

That makes sense. Thanks @landreev!

When we spoke today we agreed that this issue can be closed, so I'm closing it. #6650 addresses all of the fields mentioned in this issue's original comment.

Some geospatial metadata won't be mapped on import, possibly because Codebook has no DDI elements for them (e.g. State, City). But I think that could be its own issue (that is, more research or confirming with the DDI technical committee that there's really no way to express geospatial metadata like State and City, or asking the DDI folks to consider adding a way, since they're actually in the middle of updates to Codebook).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature: Harvesting Feature: Metadata User Role: Superuser Has access to the superuser dashboard and cares about how the system is configured
Projects
None yet
Development

No branches or pull requests

5 participants