Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collection: Keep track of list of issues that we want to address as part of 1.4.1 #25

Closed
4 of 20 tasks
mreekie opened this issue May 9, 2022 · 4 comments
Closed
4 of 20 tasks
Assignees
Labels
pm.GREI-d-1.4.1 NIH, yr1, aim4, task1: Resolve OAI-PMH harvesting issues Project: NIH GREI Tasks related to the NIH GREI project

Comments

@mreekie
Copy link
Collaborator

mreekie commented May 9, 2022

The first step was to figure out what has already been done by the dataverse team and by the community towards this aim and what still remains to be done. Leonid's conclusions at the conclusion of Issue 8574 does that.

This Epic tracks the work as proposed by Leonid:

I do believe that the third item under the "definition of done" - "prioritize" - was the actual important part of this spike. I also believe that most of that effort of prioritizing what's important can only be done within the dev. team. I can't think of how anyone outside of it could be more qualified to make these calls. So I'm going to make such an attempt.
(Note that I'm interpreting the word "prioritizing" as assigning some order of importance to these issues and bugs, what makes sense to fix first and/or what's ready to be worked on vs. what needs more discussion; not as scheduling them for specific sprints, etc.!)

The single most important harvesting issue: (ok, maybe not the most important - but seriously, this should be the first step of any meaningful cleanup of our harvesting implementation; should be fairly easy to wrap up too)

The following issues are important in that fixing them will make harvesting more reliable and robust overall (for example, in the current implementation a single missing metadata export that's supposed to be cached is going to break the entire harvesting run). All of the issues on the list below are defined clearly enough that they are ready to be worked on and fixed, without needing to conduct any extra research first. Some of them may be VERY OLD; but they look like something we should fix.

the following 3 issues are basically the same thing - people requesting extra ISO language codes to be added as legitimate controlled vocab. values (this is just a matter of adding extra values to citation.tsv); these are NOT duplicates, different things are being requested to be added in the issues below, but makes sense to get all 3 out of the way at the same time:

The following issues are about the DDI exporter producing XML that is not valid under the schema.

Similarly, the following issues are requests for changes in how we export DC; I believe these need to be reviewed/discussed, perhaps together?

The following issues are proposed changes to the design of the harvesting framework and/or metadata exports. Meaning this is something we probably need to discuss as a team, before we decide that these are good ideas and proceed to implement them. But IMO they are (I opened all of them 😄):

There is of course this issue that was opened for figuring out what needs to be added specifically for the NIH/GREI grant:

The list above is by no means complete. If an issue is not listed, it does not necessarily mean that it's not important. But the ones that are listed above should be a good subset to start with.


More Background:

This is in support of:

an NIH grant "The Harvard Dataverse repository: A generalist repository integrated with a Data Commons",
Aim 4: Improve harvesting and packaging standards to share metadata and data across repositories,

There is a lot packaged into Aim 4

Improved Harvesting via the OAI-PMH standard
Improved support for Bagit
Improved support for Signposting
The scope for this issue is Harvesting via the OAI-PMH standard

Aim 4:

Improve harvesting and packaging standards to share metadata and data across repositories

Our proposed project will significantly improve the widely-used Harvard Dataverse repository to better support NIH-funded research.

A critical measure of the GREI program’s success is to standardize the discoverability across generalist repositories.

To help with this, **we propose to improve the existing harvesting functionality in the Dataverse software based on the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) standard, and coordinate with other repository packaging standards to share or move metadata and data. **

Dataverse already supports the Bags as defined by the Research Data Alliance (RDA) Research Data Repository Interoperability Working Group. Here we proposed to improve the support for Bags, test it for NIH-funded datasets, and explore and define the appropriate standard to use to move the metadata and data across generalist repositories. This will help with a sustainable and succession plan - if one repository cannot support anymore a specific dataset, it will allow to easily move the dataset to another repository without losing any information about the dataset.

Additionally we propose to implement Signposting in the Dataverse software. By adding additional http link headers throughout the application, we can more easily support automated metadata and data discovery in the repository, and allow for other applications and services to more accurately and completely represent the content in the Harvard Dataverse repository.

Related documents

Notes on Dataverse Deliverablas for NIH OTA
NIH OTA Progress Notes
NIH OTA
Exposing and harvesting metadata using the OAI metadata harvesting protocol: A tutoria (2001)
Getting Started with BagIt in 2018
NIH OTA
bagit from Library of Congress video

@mreekie mreekie self-assigned this May 9, 2022
@mreekie mreekie changed the title PM.Feature: Harvesting Work for the NIH deliverable PM.Epic: Harvesting Work for the NIH deliverable May 9, 2022
@mreekie mreekie changed the title PM.Epic: Harvesting Work for the NIH deliverable PM.Epic: Harvesting Implementation for the NIH deliverable May 9, 2022
@mreekie
Copy link
Collaborator Author

mreekie commented May 9, 2022

Not finished defining. It seems like some of the work may apply to other NIH objectives that Len has mentioned. A

Next steps:

  • I left off at the start of this paragraph: "The following issues are about the DDI exporter producing XML ..." All of the issues prior to that now have at a minimum the labels: Feature: Harvesting, NIH OTA DC, pm.epic.nih_harvesting
  • Figure out where these fit. Should the be part of another epic for the NIH grant?

@mreekie mreekie changed the title PM.Epic: Harvesting Implementation for the NIH deliverable Collection: Keep track of list of issues that we want to address as part of 1.4.1 Jan 9, 2023
@mreekie
Copy link
Collaborator Author

mreekie commented Jan 9, 2023

Met with Leonid today.

@mreekie mreekie transferred this issue from IQSS/dataverse Mar 10, 2023
@mreekie mreekie added the pm.GREI-d-1.4.1 NIH, yr1, aim4, task1: Resolve OAI-PMH harvesting issues label Mar 31, 2023
@cmbz cmbz added the Project: NIH GREI Tasks related to the NIH GREI project label Jan 3, 2024
@cmbz cmbz assigned cmbz and unassigned mreekie Jan 3, 2024
@cmbz
Copy link
Contributor

cmbz commented Jan 29, 2024

2024/01/29

@cmbz cmbz closed this as completed Jan 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pm.GREI-d-1.4.1 NIH, yr1, aim4, task1: Resolve OAI-PMH harvesting issues Project: NIH GREI Tasks related to the NIH GREI project
Projects
None yet
Development

No branches or pull requests

2 participants