Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change Dataverse / Dublin Core mapping to improve OAI-PMH harvesting #8129

Closed
philippconzett opened this issue Oct 3, 2021 · 8 comments · Fixed by #10737
Closed

Change Dataverse / Dublin Core mapping to improve OAI-PMH harvesting #8129

philippconzett opened this issue Oct 3, 2021 · 8 comments · Fixed by #10737
Labels
Feature: Harvesting FY25 Sprint 3 FY25 Sprint 3 GREI 3 Search and Browse Size: 10 A percentage of a sprint. 7 hours. Type: Suggestion an idea User Role: Curator Curates and reviews datasets, manages permissions
Milestone

Comments

@philippconzett
Copy link
Contributor

philippconzett commented Oct 3, 2021

Note: dc:rights is being handled in #5920 and #4176 but the original description of this issue has been preserved.

Based on a semi-systematic survey of how DataverseNO metadata is harvested in Bielefeld Academic Search Engine (BASE; https://www.base-search.net/Search/Advanced), a major search engine for research outputs, we have noticed some issues related to the way the Dataverse software provides Dublin Core metadata for OAI-PMH harvesting.

dc:type
BASE harvests multiple types of research output, e.g. publications and datasets. Searching BASE you can filter/limit the search result to only include datasets by selecting Dataset in the Document Type section of advanced search:
image

However, only very few metadata records harvested directly from DataverseNO are marked as Document Type = Dataset.
It seems that in the oai_dc format, which BASE uses for harvesting, Document Type is based on the dc:type field. According to the Dataverse Metadata Crosswalk, dc:type corresponds to the Dataverse metadata field Kind of Data. But this field may contain very different values, e.g., “survey data”, “survey”, “observations” etc. Dublin Core (see https://www.dublincore.org/specifications/dublin-core/dcmi-terms/#http://purl.org/dc/terms/type) recommends “to use a controlled vocabulary such as the DCMI Type Vocabulary” for dc:type. The DCMI Type Vocabulary has “dataset” as one of its values. I therefore suggest changing the Dataverse / DC Element (oai_dc) mapping, so that dc:type is hard-coded as “dataset” for all dataset metadata in Dataverse.

dc:date
The Dataverse metadata field Publication Date is available as dcterms:issued, but it doesn’t seem to be among the oai_dc fields Dataverse exposes for OAI-PMH harvesting. According to the Dataverse Metadata Crosswalk, dc:date corresponds to the Dataverse metadata field Deposit Date, but all the random samples I tested in BASE indicate that dc:date, which BASE uses as input for their metadata field Year of Publication, corresponds to the Dataverse field Date of Production. I suggest changing the Dataverse / DC Element (oai_dc) mapping, so that dc:date is mapped with Publication Date. This is also in line with citation recommendations. The publication date is the preferred date when citing research data; see, e.g., page 12 in The Tromsø Recommendations for Citation of Research Data in Linguistics; https://doi.org/10.15497/rda00040.

dc:rights
For some of the sources included in BASE, there is an indication of the degree of Open Access. Among them are some Dataverse-based repositories. On the other side, for DataverseNO and other Dataverse-based repositories, this information is not available / unknown (“unbekannt”):
image

The Open Access information in BASE is based on the Dublin Core field dc:rights. Dataverse does not provide the field dc:rights. A correct value in this field would enable BASE to indicate the degree of Open Access (see more information at https://www.base-search.net/about/en/faq_oai.php#dc-rights). For datasets without access restriction, the dc:rights field could look like this: info:eu-repo/semantics/openAccess (see more information at https://guidelines.openaire.eu/en/latest/data/field_rights.html#rightsuri-ma).

@poikilotherm
Copy link
Contributor

poikilotherm commented May 14, 2022

@pdurbin
Copy link
Member

pdurbin commented Oct 16, 2022

I suggest changing the Dataverse / DC Element (oai_dc) mapping, so that dc:date is mapped with Publication Date.

I believe that @tcoupin fixed this in the following pull request, which we just merged and will be available in the next version of Dataverse (5.13 as of this writing):

By the way, thank you @philippconzett for the extensive write up! It's a lot to go through. Very thorough. 😄

@mreekie mreekie added NIH OTA: 1.4.1 4 | 1.4.1 | Resolve OAI-PMH harvesting issues | 5 prdOwnThis is an item synched from the product ... and removed NIH OTA: 1.4.1 4 | 1.4.1 | Resolve OAI-PMH harvesting issues | 5 prdOwnThis is an item synched from the product ... labels Oct 25, 2022
@mreekie mreekie removed the NIH OTA: 1.4.1 4 | 1.4.1 | Resolve OAI-PMH harvesting issues | 5 prdOwnThis is an item synched from the product ... label Nov 2, 2022
@pdurbin pdurbin added Type: Suggestion an idea User Role: Curator Curates and reviews datasets, manages permissions labels Oct 9, 2023
@jggautier
Copy link
Contributor

jggautier commented Mar 28, 2024

@cmbz
Copy link

cmbz commented May 8, 2024

2024/05/08

  • Identify which elements of the request have already been addressed by open or closed issues
  • Split remaining elements into their own issues
  • Then close this issue

@cmbz cmbz added the Size: 10 A percentage of a sprint. 7 hours. label May 8, 2024
@DS-INRAE
Copy link
Member

@pdurbin pdurbin added Champion: pdurbin Championed by @pdurbin for inclusion in the next release and removed Champion: pdurbin Championed by @pdurbin for inclusion in the next release labels Jul 19, 2024
@pdurbin pdurbin self-assigned this Jul 31, 2024
@cmbz cmbz added FY25 Sprint 3 FY25 Sprint 3 GREI 3 Search and Browse labels Aug 1, 2024
pdurbin added a commit that referenced this issue Aug 1, 2024
The `oai_dc` export and harvesting format has had the following fields remapped:

- dc:type was mapped to the field "Kind of Data". Now it is hard-coded to the word "Dataset".
- dc:date was mapped to the field "Production Date" when available and otherwise to "Publication Date". Now it is mapped only to the field "Publication Date".
- dc:rights was not mapped to anything. Now it is mapped (when available) to terms of use, restrictions, and license.
@pdurbin
Copy link
Member

pdurbin commented Aug 1, 2024

@philippconzett (and any others watching this issue), I create a pull request to address the points you made above:

Please take a look and feel free to leave comments or a review on the pull request. Thanks.

@pdurbin pdurbin removed their assignment Aug 1, 2024
@philippconzett
Copy link
Contributor Author

@pdurbin Thanks! I just left a comment on the PR.

stevenwinship pushed a commit that referenced this issue Sep 11, 2024
* Remap oai_dc fields dc:type, dc:date, and dc:rights #8129.

The `oai_dc` export and harvesting format has had the following fields remapped:

- dc:type was mapped to the field "Kind of Data". Now it is hard-coded to the word "Dataset".
- dc:date was mapped to the field "Production Date" when available and otherwise to "Publication Date". Now it is mapped only to the field "Publication Date".
- dc:rights was not mapped to anything. Now it is mapped (when available) to terms of use, restrictions, and license.

* add tests for export and citation date #8129

* map dc:date to pub date or field for citation date  #8129

* back out of any changes to dc:rights #8129

* remove OAI-PMH changes from API changelog (also in release note) #8129

* tweak release note, mention backward incompatibility, reexport #8129
@pdurbin pdurbin added this to the 6.4 milestone Sep 11, 2024
@pdurbin
Copy link
Member

pdurbin commented Sep 11, 2024

This issue was just closed because we merged the following pull request:

As explained above, changes to dc:rights were not included in the scope of the pull request. Please look instead to these issues:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature: Harvesting FY25 Sprint 3 FY25 Sprint 3 GREI 3 Search and Browse Size: 10 A percentage of a sprint. 7 hours. Type: Suggestion an idea User Role: Curator Curates and reviews datasets, manages permissions
Projects
Status: Interested
Status: Done
Development

Successfully merging a pull request may close this issue.

7 participants