Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spike: Inventory and prioritize all existing Harvesting related issues #24

Closed
3 tasks
mreekie opened this issue Apr 4, 2022 · 19 comments
Closed
3 tasks
Assignees
Labels
pm.GREI-d-1.4.1 NIH, yr1, aim4, task1: Resolve OAI-PMH harvesting issues pm.GREI-d-1.4.2 NIH, yr1, aim4, task2: Create working group on packaging standards

Comments

@mreekie
Copy link
Collaborator

mreekie commented Apr 4, 2022

This is in support of:

The first step is to figure out what has already been done by the dataverse team and by the community towards this aim and what still remains to be done.

For example:

And then to prioritize which issues are to be fixed.

Def of done

As completely as is reasonably possible in a 2 week period (sprint):

  • Search out previous related issues that are problems with the current implementation. Take an inventory.
  • Search out previous work done within the dataverse community as well.
  • prioritize which of the issues/PRs that should be moved forward.

We need to keep in mind that to harvest something from a particular source requires that that source be bug free. Identify which sources have which bugs so that bugs for a particular source can be targeted. for example: ICPSR as an example. Zenodo is another.

More information:

There is a lot packaged into Aim 4

  1. Improved Harvesting via the OAI-PMH standard
  2. Improved support for Bagit
  3. Improved support for Signposting

The scope for this issue is Harvesting via the OAI-PMH standard

Aim 4:

Improve harvesting and packaging standards to share metadata and data across repositories

Our proposed project will significantly improve the widely-used Harvard Dataverse repository to better support NIH-funded research.

A critical measure of the GREI program’s success is to standardize the discoverability across generalist repositories.

To help with this, **we propose to improve the existing harvesting functionality in the Dataverse software based on the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) standard, and coordinate with other repository packaging standards to share or move metadata and data. **

Dataverse already supports the Bags as defined by the Research Data Alliance (RDA) Research Data Repository Interoperability Working Group. Here we proposed to improve the support for Bags, test it for NIH-funded datasets, and explore and define the appropriate standard to use to move the metadata and data across generalist repositories. This will help with a sustainable and succession plan - if one repository cannot support anymore a specific dataset, it will allow to easily move the dataset to another repository without losing any information about the dataset.

Additionally we propose to implement Signposting in the Dataverse software. By adding additional http link headers throughout the application, we can more easily support automated metadata and data discovery in the repository, and allow for other applications and services to more accurately and completely represent the content in the Harvard Dataverse repository.

Related documents

@pdurbin
Copy link
Member

pdurbin commented Apr 13, 2022

The definition of done for this issue includes three items:

  • Search out previous related issues that are problems with the current implementation. Take an inventory.
  • Search out previous work done within the Dataverse community as well.
  • Prioritize which of the issues/PRs that should be moved forward.

I'm going to provide lists below for the first two items (open issues and merged pull requests). Please note that PR IQSS/dataverse#7053 is an open pull request that addresses a harvesting issue. I'm not sure which list it belongs in but it closes IQSS/dataverse#7502 which is in the open issues list.

The third item is about prioritizing what to work on. I'm leaving this for the prioritization group (@lenwiz, @scolapasta, @TaniaSchlatter, @sbarbosadataverse, and @mreekie). At standup today Gustavo did say that developers like me are welcome to give opinions on what we think should be the priority. From a quick look I'd say we should work on IQSS/dataverse#5840 and IQSS/dataverse#8267 because they are both small documentation updates. Otherwise, it's hard for me to judge the priority. I'm always for fixing bugs and there are plenty of small bugs.

I did reach out to the community to encourage them to open new issues about harvesting if they know of any:

Here are the lists.

Search out previous related issues that are problems with the current implementation. Take an inventory.

Here's the oneliner I used to get the following list of open issues from the GitHub Search API:

curl -H 'Accept: application/vnd.github.v3.text-match+json' 'https://api.github.com/search/issues?q=is:issue%20is:open%20label:%22Feature:%20Harvesting%22%20repo:IQSS/dataverse&per_page=100' | jq '.items[] | "- #\(.number)"' -r

Search out previous work done within the Dataverse community as well.

Here's the oneliner I used to get the following list of merged pull requests from the GitHub Search API:

curl -H 'Accept: application/vnd.github.v3.text-match+json' 'https://api.github.com/search/issues?q=is:pr%20is:merged%20label:%22Feature:%20Harvesting%22%20repo:IQSS/dataverse&per_page=100' | jq '.items[] | "- #\(.number)"' -r

@pdurbin pdurbin removed their assignment Apr 13, 2022
@landreev
Copy link

landreev commented Apr 20, 2022

(the lists below are work in progress, I'm actively working on them!)

I do believe that the third item under the "definition of done" - "prioritize" - was the actual important part of this spike. I also believe that most of that effort of prioritizing what's important can only be done within the dev. team. I can't think of how anyone outside of it could be more qualified to make these calls. So I'm going to make such an attempt.
(Note that I'm interpreting the word "prioritizing" as assigning some order of importance to these issues and bugs, what makes sense to fix first and/or what's ready to be worked on vs. what needs more discussion; not as scheduling them for specific sprints, etc.!)

The single most important harvesting issue: (ok, maybe not the most important - but seriously, this should be the first step of any meaningful cleanup of our harvesting implementation; should be fairly easy to wrap up too)

The following issues are important in that fixing them will make harvesting more reliable and robust overall (for example, in the current implementation a single missing metadata export that's supposed to be cached is going to break the entire harvesting run). All of the issues on the list below are defined clearly enough that they are ready to be worked on and fixed, without needing to conduct any extra research first. Some of them may be VERY OLD; but they look like something we should fix.

the following 3 issues are basically the same thing - people requesting extra ISO language codes to be added as legitimate controlled vocab. values (this is just a matter of adding extra values to citation.tsv); these are NOT duplicates, different things are being requested to be added in the issues below, but makes sense to get all 3 out of the way at the same time:

The following issues are about the DDI exporter producing XML that is not valid under the schema. I would consider creating a wider scope umbrella issue, something like "Make sure our DDI is valid against the schema (and maybe add a real time validation step to the export?)
The first one is ready to be worked on, I believe:

Similarly, the following issues are requests for changes in how we export DC; I believe these need to be reviewed/discussed, perhaps together?

The following issues are proposed changes to the design of the harvesting framework and/or metadata exports. Meaning this is something we probably need to discuss as a team, before we decide that these are good ideas and proceed to implement them. But IMO they are (I opened all of them 😄):

There is of course this issue that was opened for figuring out what needs to be added specifically for the NIH/GREI grant:

The list above is by no means complete. If an issue is not listed, it does not necessarily mean that it's not important. But the ones that are listed above should be a good subset to start with.

@landreev
Copy link

The list(s) above should allow us to start working on cleaning up and improving harvesting.
There are still issues labeled "Feature: Harvesting" that need to be triaged (there are some that may have been resolved since they were opened; there are some that may not need to be addressed in the first place - but then we should close them). For the next stage of this work however, once the above issues are addressed, it'll probably be more important to figure out what needs to be resolved/what features need to be added specifically for the NIH/GREI project (the spike IQSS/dataverse#8575 on the list above).

@landreev landreev removed their assignment Apr 25, 2022
@scolapasta scolapasta self-assigned this Apr 25, 2022
@scolapasta
Copy link

I'm going to go ahead and close this issue as the spike work of identifying and prioritizing is done above. We've started adding a few of these issues onto the board (in Next Sprint) and we'll continue to use this as a reference to add more as that work gets completed. But no need to keep the actual spike open.

@mreekie
Copy link
Collaborator Author

mreekie commented Dec 5, 2022

Grooming note:

  • The content here needs to be looked at in more detail.
  • The gist I've got from this readthru is that there is plenty to fix in harvesting and we need to not just blindly fix what is here.
  • It seems like there are items that are adjacent to harvesting that we could argue or not that need to be fixed to make harvesting work well.

@mreekie mreekie reopened this Jan 9, 2023
@mreekie mreekie changed the title Spike: Inventory and prioritize existing Harvesting related issues Collection: Inventory and prioritize all existing Harvesting related issues Jan 9, 2023
@mreekie mreekie changed the title Collection: Inventory and prioritize all existing Harvesting related issues Spike: Inventory and prioritize all existing Harvesting related issues Jan 9, 2023
@mreekie
Copy link
Collaborator Author

mreekie commented Jan 9, 2023

sizing:

  • This is a time bounded spike.
  • In this issue leonid proposes to review all of the known harvesting issues and make sense of them.
  • Once he is done we will have an ordered list and a decision on where to stop for the purposes of the NIH grant.

That decision will be reflected in a list in

@mreekie
Copy link
Collaborator Author

mreekie commented Jan 10, 2023

Priority Review with Stefano:

  • Moved from NIH Deliverables Backlog to Ordered Backlog

@mreekie
Copy link
Collaborator Author

mreekie commented Jan 11, 2023

Sizing:

  • We have changed (as of now) how we size spikes.
  • The idea on this one is to invest 3 days (33)

@landreev landreev self-assigned this Jan 19, 2023
@mreekie
Copy link
Collaborator Author

mreekie commented Jan 19, 2023

Daily

  • Phil raised that he's gotten reports that harvesting breaks after upgrades. He's been asking folks to create issues and include specifics on what versions they are using.
  • Leonid mentioned that in general that folks should recreate their harvesting after upgrades.

@scolapasta scolapasta removed their assignment Jan 19, 2023
@pdurbin
Copy link
Member

pdurbin commented Jan 19, 2023

Phil raised that he's gotten reports that harvesting breaks after upgrades. He's been asking folks to create issues and include specifics on what versions they are using.

Yeah, I started a thread on Slack about this: https://iqss.slack.com/archives/C010LA04BCG/p1674160762504939

@landreev
Copy link

landreev commented Jan 25, 2023

Here's the reordered/revisited list as it stands now. (split in parts; work in progress)

1. Progress update.

Harvesting issues that have been addressed/closed since the spike was created:

(Some of the issues above, those with low numbers especially, were closed having been fixed as part of more recent efforts/PRs. For example, IQSS/dataverse#3797 was resolved as part of a general overhaul in IQSS/dataverse#8372)

All of the above are being taken off the remaining prioritized list.

(there's also IQSS/dataverse#8629 that's still listed as open; but we have discussed it during a tech hour and it's waiting for me to update it w/ some info)

@landreev
Copy link

landreev commented Jan 25, 2023

2. Already estimated and/or prioritized:

("prioritized" in this context means have at least been reviewed and deemed important/necessary to be addressed soon, and have been assigned NIH grant labels; and specifically pm.epic.nih_harvesting)

2a. The following issues have NOT been tagged pm.epic.nih_harvesting yet
but they deal with metadata issues (see labels), and have been proposed for being addressed together with the metadata issues from the list above; so these could be good candidates for being pulled in soon.

FInally, IQSS/dataverse#9309 + the fix pr IQSS/dataverse#9316 - a bug introduced in 5.12.1 hasn't been formally prioritized or sized; but it's been discussed and mostly approved for being included in 5.13; it's got the label "Size: Queued" on it, but I proposed 10. So because of this I'm not including it on the list of remaining issues.

@landreev
Copy link

landreev commented Jan 25, 2023

3. Finally, some choice candidates for the short-term queuing, in roughly prioritized order:

I would petition to prioritize this one:

This one is a good candidate:
This will ensure that our OAI gateway is valid per the spec to the outside clients (should be a good thing under the grant too?). The actual details/problems as described in the issue may no longer be true (resolved in the major reorganization of IQSS/dataverse#8372). So the issue should be used to re-run the validation and fix whatever issues may remain (or may have been introduced as of late).

The following one is a feature request that came in from an external contributor with a fix PR accompanying it; it may have got stuck in review limbo, we owe it to them to process it quickly:

Same as above; small-ish issue, with an accompanying PR from an external dev. From having taken a look, it's not as trivial as they think. But should not be difficult to resolve and we owe it to them, etc.

May already be resolved, but if not, def. a good thing to address promptly:

An interesting feature request, very specific and should be easy to address. Can definitely be useful to other instances. Not sure if explicitly "useful" from the point of view of the NIH grant though.

Change in specific harvesting behavior requested; lots of detailed discussion in the issue. Should be ready to be addressed:

This would be a genuinely useful format to add harvesting support for:

Not strictly harvesting-related - but if we were willing to handle it under this umbrella, I would give it high priority:

This is a very recent problem report from a remote installation. Their problem can be translated into a feature request, asking for more generic OAI_DC records (ones without persistent identifiers) to be importable. Could be a useful thing? - But I'm on a fence somewhat, about how it should be prioritized.

@mreekie
Copy link
Collaborator Author

mreekie commented Jan 25, 2023

Sprint review.

  • This issue is going to be closed.
  • The issues can be organized into the backlog with the NIH deliverables.

@mreekie mreekie closed this as completed Jan 25, 2023
@landreev
Copy link

Oh, this is the one I forgot to mention:
I'm not sure what to do with this huge jumbo issue that was recently opened:

@landreev
Copy link

... and maybe I should explicitly list the remaining "bulk" queue - these are all labeled with the "feature: Harvesting" label. Basically, by omitting them, I communicated that these should not be in the next wave of prioritization/scheduling. But I'll post that remaining list, w/ my comments I was adding in the process.

4. Remaining bulk list; I believe these can/should wait to be addressed, but opinions welcome.

And just because I think this could be wait, for the purposes of the immediate planning, doesn't mean that I don't want them close. We should continue looking at this list. As I mention in some of the comments above, some of these could be fixed/resolved by the open issues we've already prioritized, so we'll need to confirm that and close those accordingly.

@mreekie mreekie transferred this issue from IQSS/dataverse Mar 10, 2023
@mreekie
Copy link
Collaborator Author

mreekie commented Mar 10, 2023

Moved to dataverse-pm.
There's planning work to be don here based on Leonid's work.

@mreekie mreekie added pm.GREI-d-1.1.1 NIH, yr1, aim1, task1: MVP for registering metadata in the repository pm.GREI-d-1.1.2 NIH, yr1, aim1, task2: Define a cost recovery model for large dataset support pm.GREI-d-1.4.1 NIH, yr1, aim4, task1: Resolve OAI-PMH harvesting issues pm.GREI-d-1.4.2 NIH, yr1, aim4, task2: Create working group on packaging standards and removed pm.GREI-d-1.1.1 NIH, yr1, aim1, task1: MVP for registering metadata in the repository pm.GREI-d-1.1.2 NIH, yr1, aim1, task2: Define a cost recovery model for large dataset support labels Mar 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pm.GREI-d-1.4.1 NIH, yr1, aim4, task1: Resolve OAI-PMH harvesting issues pm.GREI-d-1.4.2 NIH, yr1, aim4, task2: Create working group on packaging standards
Projects
Status: No status
Development

No branches or pull requests

5 participants