-
Notifications
You must be signed in to change notification settings - Fork 214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update dated DAGs to allow for backfilling data #1642
Comments
Upon further research I think we should test this to verify if it's true. It looks like We should keep this open to test and verify at the very least, since it's critical that we're able to backfill that data! |
Thanks for drafting this! Admittedly, it's very possible that I may have been misinterpreting the current setup and that it's behaving as expected now. Either way, it's a good opportunity to verify that and adjust if needed! |
Thinking about this again - what would be really cool is to use |
Is this issue the best place to discuss backfill for providers in general? I was just thinking about audio's beta status, and increasing the number of audio results would be a nice way to contribute to promoting audio as fully-supported. |
I think once we address the mechanism for backfilling data, we should definitely have a tracking ticket for setting up backfills on all providers that support it! |
I've been thinking about how to approach this: First for context, the dated DAGs for consideration are: Europeana, Flickr, Metropolitan, Phylopic, and Wikimedia Commons. Of these, Europeana and Flickr are currently turned off. Just turning
|
I think we could tackle this in two steps:
I only recommend going with that first step because it's fairly straightforward and unblocks the backfilling while we work on the second step. I have no idea how hairy or difficult step 2 will be, so having something running while we look into it seems ideal! |
My only hesitation was that it will result in pausing ingestion of more current data until the backfill is complete -- but I think that's actually not important since (a) we'll still be ingesting lots of new data and (b) the data refresh isn't turned on at the moment so it's a moot point 😄 As to working on the reingestion workflows, I prefer them for the long term solution in part because we need to get them working anyway. The simplest approach would be to just update it to trigger a DagRun of the normal provider workflow for each reingestion date, complete with the load_data step. The big thing to figure out will be concurrency problems 😱 In the meantime, I'll try to come up with some more reasonable |
Problem
A
dated
provider DAG is one whosemain
function for thepull_data
task accepts adate
parameter representing the date for which data should be ingested. An example is Wikimedia Commons . Generally, a dated DAG runs one day's worth of data, and therefore it makes sense that these DAGs should be on the@daily
schedule.* By default the date passed in is today's date, optionally shifted by a givenday_shift
(soday_shift=1
would run for yesterday's data).The problem is when a dated DAG is turned off/doesn't run for a period of time, there is currently no easy way to backfill the data for the missed days. Airflow catchup allows us to run all of the missed tasks: so for example, if a
@daily
DAG was turned off for five days, when we turn it back on it will be run 6 times (once for today and once for each missed run). But with the current setup, all 6 runs will ingest data for today's date.* Where this isn't true, we should probably fix it. #1643 for example will update Wikimedia Commons to run daily.
Description
We should instead use Airflow's
{{ execution_date }}
, which references the date of the scheduled DAG run -- meaning that when the catchup DAG runs kick off, they will ingest data for the correct date!Implementation
The text was updated successfully, but these errors were encountered: