Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update reingestion workflows to load and report data #1502

Closed
1 task
stacimc opened this issue Jul 20, 2022 · 0 comments · Fixed by WordPress/openverse-catalog#618
Closed
1 task

Update reingestion workflows to load and report data #1502

stacimc opened this issue Jul 20, 2022 · 0 comments · Fixed by WordPress/openverse-catalog#618
Assignees
Labels
💻 aspect: code Concerns the software code in the repository 🛠 goal: fix Bug fix ✨ goal: improvement Improvement to an existing user-facing feature 🟨 priority: medium Not blocking but should be addressed soon

Comments

@stacimc
Copy link
Collaborator

stacimc commented Jul 20, 2022

Problem

As described in #1642, it is necessary to backfill data for our dated DAGs. This is being resolved in the short term by turning catchup on for those DAGs. The long term fix is to re-enable our existing ingestion workflows (example wikimedia_ingestion_workflow).

Description

Copying context from this comment:

These are daily DAGs that generate a list of days, generally weighted toward more recent days, and then runs just the pull_data step for that day.

The purpose of these DAGs was to reingest/update old data in a smart way, with an emphasis on updating more recent data first. I believe the intention was to update popularity data and such, but if we could get these DAGs working, they would also naturally backfill data over time.

The workflows only run the pull_data step; tsvs are generated but the data is not loaded into the DB. So we’d need to update the workflow to make sure the data is loaded. But as an additional benefit, we would be able to run these independently of the normal ingestion process and so would not interrupt typical ingestion.

We'll need to update the DAG factory for ingestion workflows to include the load_data steps, and ideally also report data at the end of reingestion.

Implementation

  • 🙋 I would be interested in implementing this feature.
@stacimc stacimc added 🟨 priority: medium Not blocking but should be addressed soon ✨ goal: improvement Improvement to an existing user-facing feature 🛠 goal: fix Bug fix 💻 aspect: code Concerns the software code in the repository labels Jul 20, 2022
@stacimc stacimc self-assigned this Jul 20, 2022
@obulat obulat transferred this issue from WordPress/openverse-catalog Apr 17, 2023
@github-project-automation github-project-automation bot moved this to 📋 Backlog in Openverse Backlog Apr 17, 2023
@obulat obulat moved this from 📋 Backlog to ✅ Done in Openverse Backlog Apr 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository 🛠 goal: fix Bug fix ✨ goal: improvement Improvement to an existing user-facing feature 🟨 priority: medium Not blocking but should be addressed soon
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

1 participant