Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remap oai_dc fields dc:type and dc:date #10737

Merged
merged 9 commits into from
Sep 11, 2024
Merged

Conversation

pdurbin
Copy link
Member

@pdurbin pdurbin commented Aug 1, 2024

What this PR does / why we need it:

The oai_dc export and harvesting format has had the following fields remapped:

As these are backward incompatible changes, they have been emphasized in the release note snippet.

Which issue(s) this PR closes:

Special notes for your reviewer:

Should these backward-incompatible changes be hidden behind a feature flag?

Suggestions on how to test this:

See rules above under "what this PR does". Also, below are some examples of before and after.

Before

  • dc:date is mapped to the field "Production Date" when available and otherwise to "Publication Date".
  • We see "survey" under dc:type because it was entered in the "Kind of Data" field. dc:type will be absent if "Kind of Data" isn't filled in.
<oai_dc:dc xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:dc="http://purl.org/dc/elements/1.1/" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
  <dc:title>Darwin's Finches</dc:title>
  <dc:identifier>https://doi.org/10.5072/FK2/QHIUBQ</dc:identifier>
  <dc:creator>Finch, Fiona</dc:creator>
  <dc:publisher>Root</dc:publisher>
  <dc:description>Darwin's finches (also known as the Galápagos finches) are a group of about fifteen species of passerine birds.</dc:description>
  <dc:subject>Medicine, Health and Life Sciences</dc:subject>
  <dc:date>2024-09-10</dc:date>
  <dc:type>survey</dc:type>
</oai_dc:dc>

After

<oai_dc:dc xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:dc="http://purl.org/dc/elements/1.1/" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
  <dc:title>Darwin's Finches</dc:title>
  <dc:identifier>https://doi.org/10.5072/FK2/QHIUBQ</dc:identifier>
  <dc:creator>Finch, Fiona</dc:creator>
  <dc:publisher>Root</dc:publisher>
  <dc:description>Darwin's finches (also known as the Galápagos finches) are a group of about fifteen species of passerine birds.</dc:description>
  <dc:subject>Medicine, Health and Life Sciences</dc:subject>
  <dc:date>2024-09-10</dc:date>
  <dc:type>Dataset</dc:type>
</oai_dc:dc>

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

No.

Is there a release notes update needed for this change?:

Yes, included.

Additional documentation:

None.

The `oai_dc` export and harvesting format has had the following fields remapped:

- dc:type was mapped to the field "Kind of Data". Now it is hard-coded to the word "Dataset".
- dc:date was mapped to the field "Production Date" when available and otherwise to "Publication Date". Now it is mapped only to the field "Publication Date".
- dc:rights was not mapped to anything. Now it is mapped (when available) to terms of use, restrictions, and license.
@pdurbin pdurbin added Feature: Harvesting Size: 10 A percentage of a sprint. 7 hours. GREI 3 Search and Browse FY25 Sprint 3 FY25 Sprint 3 labels Aug 1, 2024
@pdurbin pdurbin requested a review from landreev August 1, 2024 18:02
@pdurbin
Copy link
Member Author

pdurbin commented Aug 1, 2024

@jggautier heads up that this relates to this issue in that we are now adding "dc:rights" to the oai_dc export/harvesting format:

@pdurbin pdurbin requested a review from tcoupin August 1, 2024 18:07
@pdurbin
Copy link
Member Author

pdurbin commented Aug 1, 2024

@tcoupin I'm requesting a review from you because you modified the dc:date login in the following pull request and I changed it (as explained above):

@qqmyers
Copy link
Member

qqmyers commented Aug 1, 2024

Re: dc:date - should it be mapped to the same field as https://guides.dataverse.org/en/latest/api/native-api.html#set-citation-date-field-type-for-a-dataset ? That is publicationDate by default.

@coveralls
Copy link

coveralls commented Aug 1, 2024

Coverage Status

coverage: 20.735%. remained the same
when pulling 01e266c on 8129-harvesting
into 4143031 on develop.

This comment has been minimized.

@pdurbin
Copy link
Member Author

pdurbin commented Aug 1, 2024

@qqmyers well, Publication Date is what @philippconzett asked for in the issue (#8129).

@qqmyers
Copy link
Member

qqmyers commented Aug 1, 2024

@philippconzett 's notes also point out that this date potentially going to be interpreted as the citation date. Since we allow configuring that in the local installation, it seems like it could be confusing to hardcode it for harvesting. If the harvester used the field from that setting, citations would be consistent in the local display and harvesting sites, and it would default to publicationDate as requested in the issue.

@pdurbin
Copy link
Member Author

pdurbin commented Aug 1, 2024

I don't have a strong opinion about it.

@philippconzett
Copy link
Contributor

I think @qqmyers's suggestion for dc:date makes sense.

@plecor
Copy link
Contributor

plecor commented Aug 2, 2024

I've taken @tcoupin's role on Dataverse issues, so I am looking at this for him.

Part of the context for the change he implemented (mapping dc:date to Publication Date if Production Date is empty) was that when Dataverse harvests another OAI-PMH repo, dc:date is mapped to productionDate and this production date is then used in the citation of the harvested dataset. #8733 and #8732 were both part of an effort to guarantee the coherence between citation dates when harvesting another Dataverse.

So I agree with @qqmyers's suggestion on dc:date.

Hardcoding dc:type to Dataset would certainly simplify things. In practice, I know some Dataverse instances allow for kindOfData values that are not synonyms of 'Dataset'. For instance : https://entrepot.recherche.data.gouv.fr/oai?verb=GetRecord&metadataPrefix=oai_dc&identifier=doi:10.57745/DZIM2L

There might be an alternative solution where there is always at least a dc:type tag with value Dataset and the list of kindOfData values (making sure that Dataset occurs only once)? That would however means that not all dc:type values come from a controlled vocabulary.

@pdurbin pdurbin self-assigned this Aug 2, 2024

This comment has been minimized.

@pdurbin
Copy link
Member Author

pdurbin commented Aug 2, 2024

@plecor thanks. One thing to consider with "dc:type" is that types other than datasets (like software and workflows) are coming...

... so maybe we can revisit "dc:type" once that pull request is merged.

To all, I pushed some tests to exercise export and setting the citation date.

Now I'm trying to see if there's a small change I can make to DublinCoreExportUtil to get the citation date out.

I can get just the year (YYYY) with code like this...

String citation = version.getCitation();
// We're looking for ", YYYY, " in a citation like this:
// Finch, Fiona, 1999, "Darwin's Finches", https://doi.org/10.5072/FK2/WSSYBE, Root, V1
Pattern pattern = Pattern.compile(", (\\d{4}), ");
Matcher matcher = pattern.matcher(citation);
matcher.find();
String yearInCitation = matcher.group(1);
writeFullElement(xmlw, dcFlavor+":"+"date", yearInCitation);

... but I need the full YYYY-MM-DD version to put in the the oai_dc output. 🤔 If you have any ideas for me, please let me know.

@pdurbin
Copy link
Member Author

pdurbin commented Aug 2, 2024

I dug a little more and our citation code is focused on returning just a 4 digit year for the date. This would be a change from what we do now (YYYY-MM-DD)

@philippconzett @plecor @qqmyers what do you think? Should we change oai_dc to YYYY for dc:date?

The spec Philipp found seems to say it's ok. Check out the year 1650 as an example at https://www.base-search.net/about/en/faq_oai.php#dc-date

Screenshot 2024-08-02 at 4 22 32 PM

@pdurbin pdurbin removed their assignment Aug 2, 2024
@qqmyers
Copy link
Member

qqmyers commented Aug 2, 2024

Since the citationDateFieldType is part of the Dataset, I'd think at some point it could/should be part of the DatasetDTO and JSON export, thereby being available to other exporters (will the SPA or other client need this info (in the JSON returned from the dataset api) at some point?).

If that's too much for now, I think the idea of parsing it from the citation as YYYY makes sense, assuming that's sufficient for how people want to use that field. Alternately, I think you could 'go around' the exporter SPI interface and get the full value directly pretty easily as well, e.g. with something like:

        DatasetFieldType citationDataType = jakarta.enterprise.inject.spi.CDI.current().select(DatasetServiceBean.class).get().findByGlobalId(globalId.asString()).getCitationDateDatasetFieldType();
        if(citationDataType!= null) {
            date = dto2Primitive(version, citationDataType.getName());
        } else {
            date = datasetDto.getPublicationDate();
        }

This would not be the only current exporter doing that (e.g. the DDI exporter grabs the ExportInstallationAsDistributorOnlyWhenNotSet Setting it needs).

@pdurbin
Copy link
Member Author

pdurbin commented Aug 5, 2024

Alternately, I think you could 'go around' the exporter SPI interface and get the full value directly pretty easily as well, e.g. with something like...

@qqmyers I gave this a try but citationDataType.getName() yields a four digit year (YYYY) so it's no better than getting (just) the year from the citation with the regex I showed above.

@qqmyers
Copy link
Member

qqmyers commented Aug 5, 2024

Are you thinking of

? citationDataType.getName() should get the name of the field type and then dto2Primitive should get the full value from the field itself in the metadatablock, which I think should always be the full date (unless there's a field that is YYYY only?).

@jggautier
Copy link
Contributor

Ah, @philippconzett, just to clarify, in my screenshot where I pointed out <dc:rights>Restrictions</dc:rights> in green, "Restrictions" is what I entered in the dataset's "Restrictions" field, which is in that "Dataset Terms" collapsible panel.

Here's a screenshot of that panel and that Restrictions field noted in green:
Screenshot 2024-08-28 at 9 11 59 AM

@philippconzett
Copy link
Contributor

@jggautier Thanks for clarifying. Is there a way to figure out what information we deliver to harvesters like BASE Bielefeld? See the initial post in #8129.

@jggautier
Copy link
Contributor

@philippconzett, do you mean if there's a way to figure out what we currently deliver to harvesters like BASE Bielefeld or what we should deliver? Sorry if that sounds like a dumb question lol. I've been focused a lot on how decisions are made as well as what decisions are made, which maybe led to me overthink your question!

If it's currently, you wrote in #8129 about what we currently deliver when it comes to rights metadata, which is that we don't deliver anything since Dataverse doesn't provide the field dc:rights. I can say that nothing's changed about this since you opened #8129 a few years ago. So when I asked if it's appropriate that this PR being merged will close #8129, and that there's discussion in that GitHub issue that isn't addressed, I was mostly thinking about your comments related to dc:rights and degrees of open access.

For what we should start delivering to harvesters like BASE Bielefeld, you mentioned the guidance at https://www.base-search.net/about/en/faq_oai.php#dc-rights, which recommends the two vocabularies you pointed to earlier today: the info-eu-repo-Access-Rights vocabulary and the COAR-Access-Rights vocabulary.

I think it'll be helpful to consider what I wrote in #5920, where I wrote about what we've learned and challenges about how the info-eu-repo-Access-Rights vocabulary is being included in the OpenAIRE exports that Dataverse creates.

This might be a matter of scoping and timing, too, right? We could create a new GitHub issue specifically about the use of the info-eu-repo-Access-Rights and COAR-Access-Rights vocabularies, that mentions what's discussed in #8129. So when #8129 is closed because this PR is merged, the unaddressed goals you mentioned in #8129 aren't lost and there's a place where the community can focus on how to address those goals.

And if we can learn how @pdurbin and others made the decisions in this PR about what goes into dc:rights, it'll be easier to think about how effective those decisions are.

@philippconzett
Copy link
Contributor

Hi @jggautier! Thanks for disentangling this! I wasn't aware of #5920, but have read up about it now and added some comments there. I don't really know what this means for #8129. Maybe a temporary solution to make dataset metadata from Dataverse more visible in OpenAIRE could be a kind of "inverted" and slightly adapted version of what you described in #5920:

  • openAccess: If any files are set to non-restricted, the metadata export uses "openAccess".
  • restrictedAccess: If all of the files in the dataset are set to restricted and the option to request. access is enabled (people are allowed to request access using Dataverse's request access feature), the metadata export uses "restrictedAccess".
  • closedAccess: If all of the files in the dataset are set to restricted and the option to request access is disabled, the metadata export uses "closedAccess".
  • embargoedAccess: If all of the files in the dataset are set to embargoed, the metadata export uses "embargoedAccess".

This comment has been minimized.

@pdurbin
Copy link
Member Author

pdurbin commented Sep 4, 2024

could you write about how these mapping decisions were made?

@jggautier back in 59850ce these lines were added as part of PR #3308:

writeFullElement(xmlw, "dcterms:license", version.getLicense());
writeFullElement(xmlw, "dcterms:rights", version.getTermsOfUse());
writeFullElement(xmlw, "dcterms:rights", version.getRestrictions());

This was for Dataverse 4.5 ( https://github.com/IQSS/dataverse/releases/tag/v4.5 ) when harvesting was first introduced (in 4.x). It looks like the code was mostly worked on by @landreev and @sekmiller. In short, the logic has been here for 8 years and I can't find any comment on why we do it this way. @jggautier what you're showing in that screenshot is a reflection of the same, unchanged logic.

The code above is for the "dcterms" flavor of Dublin Core. In this current PR, I copied the logic above for the "dc" flavor. I hope this helps!

@jggautier
Copy link
Contributor

Ah, thanks @pdurbin! I updated the Dataverse crosswalk to reflect this, specifically the part of the crosswalk showing that what's entered in the Restrictions field is included in the DC Terms export, as dcterms:rights like you wrote. The crosswalk used to indicate this, and for some reason I don't remember now, in 2022 I edited it to read that it was "(Not mapped)".

To learn more about why the predefined license, Terms of Use, and Restrictions are included in the DC Terms export, I tried to find the Functional Requirements Document mentioned in that PR you linked to. The Functional Requirements Document folder in our Google Drive has some info, but I haven't seen any FRDs so far that go into enough detail.

I think we should:

@pdurbin pdurbin changed the title Remap oai_dc fields dc:type, dc:date, and dc:rights Remap oai_dc fields dc:type and dc:date Sep 9, 2024
@pdurbin
Copy link
Member Author

pdurbin commented Sep 9, 2024

As discussed in Slack and elsewhere:

@landreev I'm going to unassign myself but please let me know if you'd like me to jump back on this branch and do any additional coding or testing!

@pdurbin pdurbin removed their assignment Sep 9, 2024
Copy link

github-actions bot commented Sep 9, 2024

📦 Pushed preview images as

ghcr.io/gdcc/dataverse:8129-harvesting
ghcr.io/gdcc/configbaker:8129-harvesting

🚢 See on GHCR. Use by referencing with full name as printed above, mind the registry name.

@landreev
Copy link
Contributor

landreev commented Sep 10, 2024

@pdurbin
Just for clarity, could you please attach an actual example - 2 exported oai_dc fragments, before and after, to illustrate the final result of the changes made in the PR. (For the benefit of somebody reading the PR in the future; the discussion above is quite extensive and potentially confusing)

@pdurbin
Copy link
Member Author

pdurbin commented Sep 10, 2024

@landreev sure, I added two XML examples, before and after, to the description of this PR.

Copy link
Contributor

@landreev landreev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have confirmed that the change does not affect harvesting. In normal practice, this should not even be a concern, as two Dataverses should never use the oai_dc as the format for harvesting from each other. But it's not entirely impossible that it will be a practical use case for somebody. Plus it would simply feel wrong, for Dataverse not to be able to import its own metadata exports. So, happy to report that 2 Dataverses can still harvest from each other using the format.

@landreev landreev removed their assignment Sep 10, 2024
@stevenwinship stevenwinship self-assigned this Sep 11, 2024
@stevenwinship stevenwinship merged commit 4b96cec into develop Sep 11, 2024
23 checks passed
@stevenwinship stevenwinship removed their assignment Sep 11, 2024
@pdurbin pdurbin added this to the 6.4 milestone Sep 11, 2024
@stevenwinship stevenwinship deleted the 8129-harvesting branch September 17, 2024 14:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature: Harvesting FY25 Sprint 3 FY25 Sprint 3 FY25 Sprint 4 FY25 Sprint 4 FY25 Sprint 5 FY25 sprint 5 GREI 3 Search and Browse Size: 10 A percentage of a sprint. 7 hours.
Projects
Status: Done 🧹
Development

Successfully merging this pull request may close these issues.

Change Dataverse / Dublin Core mapping to improve OAI-PMH harvesting
9 participants