Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2.1.3/4 queued dag runs changes catchup=False behaviour #18487

Closed
1 of 2 tasks
StewartThomson opened this issue Sep 23, 2021 · 13 comments · Fixed by #18897 or #19130
Closed
1 of 2 tasks

2.1.3/4 queued dag runs changes catchup=False behaviour #18487

StewartThomson opened this issue Sep 23, 2021 · 13 comments · Fixed by #18897 or #19130
Assignees
Labels
affected_version:2.1 Issues Reported for 2.1 affected_version:2.2 Issues Reported for 2.2 area:core kind:bug This is a clearly a bug

Comments

@StewartThomson
Copy link

Apache Airflow version

2.1.4 (latest released)

Operating System

Amazon Linux 2

Versions of Apache Airflow Providers

No response

Deployment

Docker-Compose

Deployment details

No response

What happened

Say, for example, you have a DAG that has a sensor. This DAG is set to run every minute, with max_active_runs=1, and catchup=False.

This sensor may pass 1 or more times per day.

Previously, when this sensor is satisfied once per day, there is 1 DAG run for that given day. When the sensor is satisfied twice per day, there are 2 DAG runs for that given day.

With the new queued dag run state, new dag runs will be queued for each minute (up to AIRFLOW__CORE__MAX_QUEUED_RUNS_PER_DAG), this seems to be against the spirit of catchup=False.

This means that if a dag run is waiting on a sensor for longer than the schedule_interval, it will still in effect 'catchup' due to the queued dag runs.

What you expected to happen

No response

How to reproduce

No response

Anything else

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@StewartThomson StewartThomson added area:core kind:bug This is a clearly a bug labels Sep 23, 2021
@uranusjr
Copy link
Member

uranusjr commented Sep 23, 2021

this seems to be against the spirit of catchup=False

What is your understanding to catchup? This does not match my understanding to it.

https://airflow.apache.org/docs/apache-airflow/stable/dag-run.html?highlight=catchup#catchup

From the documentation’s description, catchup only matters for runs missed their turn. But since the scheduler is able to schedule those runs at the pace of once a minute, those successfully scheduled runs did not miss their turn, so catchup does not matter to them. The behaviour you described for 2.1.2 and below sounds like a bug to me.

@StewartThomson
Copy link
Author

Per the documentation:

"When turned off, the scheduler creates a DAG run only for the latest interval."

With the queued state, it creates dag runs for the latest interval and the next 16 intervals (by default)

@eladkal eladkal added the affected_version:2.1 Issues Reported for 2.1 label Oct 6, 2021
@widewing
Copy link

widewing commented Oct 7, 2021

I'm having the exactly same issue in the latest version, it breaks many of our DAGs that rely on the catchup setting.

We have DAGs scheduled every minute, most of them should finish within 1 minute, but sometimes it could exceed, e.g. network issue, external service slow, etc. When this happen, it will cumulate the DagRuns that we don't expect. We have internal backfill mechanism already.

The document said "Turning catchup off is great if your DAG performs catchup internally." in section "catchup"

@ephraimbuddy
Copy link
Contributor

Can you try the setting AIRFLOW__SCHEDULER__MAX_QUEUED_DAG_RUNS_PER_DAG=1

@widewing
Copy link

widewing commented Oct 8, 2021

@ephraimbuddy thanks I'll try this out. I can accept it as workaround, but still it's conflicting with the idea of catch-up.

@dstandish
Copy link
Contributor

dstandish commented Oct 14, 2021

@uranusjr
RE

From the documentation’s description, catchup only matters for runs missed their turn. But since the scheduler is able to schedule those runs at the pace of once a minute, those successfully scheduled runs did not miss their turn, so catchup does not matter to them. The behaviour you described for 2.1.2 and below sounds like a bug to me.

Oftentimes the usage of catchup=False is a case where execution_date does not matter, and there should only be one instance of the task running at a given time.

An example would be an incremental process. Each dag run, you process the new data that has arrived since last run.

So if you have your dag scheduled every 10 minutes let's say, and it takes an hour for some reason, you don't want it to create 6 new queued dag runs while it is waiting.

It should only queue up one at most.

There are many kinds of pipelines that use this pattern.

But the current behavior (verified in 2.2.0 as well) is, even when catchup=False, it will queue up a bunch of dag runs (up to max_queued_runs_per_dag). And this should not happen.

I encountered this today with a stuck dag (task had been stuck in running state for a day or so for some reason). 16 dag runs were queued up and waiting to go. But this should not be.

Along these lines... @ephraimbuddy in your PR #18897 you explicitly mention the catchup=True case, but from my understanding, what this issue is about is the catchup=False case... just wanted to bring that to your attention in case it's either a typo or a misunderstanding.

@ephraimbuddy
Copy link
Contributor

The PR captures both cases where catchup is False or true. It's annoying that it broke. For now limit it with max_queued_dagruns_per_dag, at least you'll not have a lot of dagruns

@dstandish
Copy link
Contributor

The PR captures both cases where catchup is False or true. It's annoying that it broke. For now limit it with max_queued_dagruns_per_dag, at least you'll not have a lot of dagruns

OK nice! I figured but just wanted to mention in case.

Thanks

@dstandish dstandish added the affected_version:2.2 Issues Reported for 2.2 label Oct 14, 2021
@robinedwards
Copy link
Contributor

So i have been also been working on a fix for this and the MR #18897 does not fix the issue.

It's very simple to recreate with a test dag:

import logging
import time
from datetime import datetime

from airflow import DAG
from airflow.operators.python import PythonOperator


def some_task():
    time.sleep(5)
    logging.info("Slept for 5")


dag = DAG(
    dag_id="aa_catchup_1",
    schedule_interval='@daily',
    catchup=True,
    start_date=datetime(2021, 9, 9),
    max_active_runs=1,
    max_active_tasks=1,
)


with dag:
    PythonOperator(
        task_id="some_task",
        python_callable=some_task
    )
  1. Launch this dag
  2. let it schedule a couple of dag runs (to simulate it being behind schedule)
  3. Pause it.
  4. Edit the dag file to catchup=False
  5. Restart the scheduler
  6. Confirm in the UI that the dag details now read catchup=False
    ** Now we have a dag that is catchup=False and is behind schedule **
  7. Enable the dag
  8. The next dag run it schedules will still not be the latest it will simply be the next dagrun from the current state

@jedcunningham or @ephraimbuddy would you kindly re-open this issue an i can provide a PR

@ephraimbuddy
Copy link
Contributor

Since this is catchup True and max_active_runs 1,
Only one dagrun is created. No extra dagruns would be created so it won't queue dagruns as you think.

@robinedwards
Copy link
Contributor

If you take a look at the instructions beneath the example dag in my comment this is only used to get the dag behind schedule. It should then be amended to catchup=False

@ephraimbuddy
Copy link
Contributor

That's true.

@ephraimbuddy ephraimbuddy reopened this Oct 21, 2021
@robinedwards
Copy link
Contributor

Here is the fix #19130

robinedwards added a commit to robinedwards/airflow that referenced this issue Oct 21, 2021
When a dag has catchup=False we want to select the max value from the
earliest date and the end of the last execution period.

Thus if a dag is behind schedule it will select the latest execution
date.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affected_version:2.1 Issues Reported for 2.1 affected_version:2.2 Issues Reported for 2.2 area:core kind:bug This is a clearly a bug
Projects
None yet
7 participants