Do not queue tasks when the DAG goes missing #16182

ephraimbuddy · 2021-05-31T03:56:23Z

Currently, if a dag goes missing, the scheduler continues to queue the task instances
until the executor reports the tasks as failed and then the scheduler would now set the state to failed.

This change ensures that tasks are not queued when the dag goes missing. Instead of waiting on the
executor to fail the task without explicit reason, the task fails here with the reason why it failed. Thanks to
this, the Pool's queued slots will be freed for other tasks to be queued

Closes: #15488

^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.

airflow/jobs/scheduler_job.py

ashb

Some changes to the current implementation

Additionally I'm not sure if it makes sense to check this every time through the loop -- perhaps we could get away with something like we do with SchedulerJob._clean_tis_without_dagrun -- we call that once every 15s.

What would be the effect/drawback if we only did the tidy up then?

airflow/jobs/scheduler_job.py

airflow/config_templates/config.yml

airflow/jobs/scheduler_job.py

kaxil

Minor suggestion but LGTM

airflow/jobs/scheduler_job.py

jhtimmins · 2021-06-09T18:06:17Z

@ashb can you take another look and sign off if your requested changes have been sufficiently addressed?

airflow/config_templates/config.yml

airflow/models/dagbag.py

Currently, if a dag goes missing, the scheduler continues to queue the task instances until the executor reports the tasks as failed and then the scheduler would now set the state properly. This change ensures that tasks are not queued when the dag file goes missing. Instead of waiting on the executor to fail the task without explicit reason, the task fails here with the reason why it failed. Thanks to this, the Pool's queued slots will be freed for other tasks to be queued fixup! Do not queue tasks when the DAG goes missing add tests fixup! fixup! Do not queue tasks when the DAG goes missing Change implementation to check at regular interval instead of at every loop fixup! Change implementation to check at regular interval instead of at every loop fixup! fixup! Change implementation to check at regular interval instead of at every loop fixup! fixup! fixup! Change implementation to check at regular interval instead of at every loop change configuration name apply review suggestions fixup! apply review suggestions add has_dag method to DagBag and improve missing dag fail method in scheduler Update airflow/jobs/scheduler_job.py Co-authored-by: Kaxil Naik <kaxilnaik@gmail.com> Fix dag_bag.has_dag behaviour fixup! Fix dag_bag.has_dag behaviour Apply suggestions from code review Co-authored-by: Kaxil Naik <kaxilnaik@gmail.com> fixup! Apply suggestions from code review fixup! fixup! Apply suggestions from code review fixup! Do not queue tasks when the DAG goes missing

airflow/config_templates/config.yml

airflow/jobs/scheduler_job.py

ashb · 2021-07-05T08:56:43Z

airflow/models/dagbag.py

+        sd_last_updated_datetime = SerializedDagModel.get_last_updated_datetime(
+            dag_id=dag_id,
+            session=session,
+        )
+        sd_has_dag = sd_last_updated_datetime is not None
+        if dag_id not in self.dags:
+            return sd_has_dag
+        if dag_id not in self.dags_last_fetched:
+            return sd_has_dag
+        min_serialized_dag_fetch_secs = timedelta(seconds=settings.MIN_SERIALIZED_DAG_FETCH_INTERVAL)
+        if timezone.utcnow() < self.dags_last_fetched[dag_id] + min_serialized_dag_fetch_secs:
+            return sd_has_dag
+        if sd_has_dag:
+            return True


We should refactor this to delay the DB check until we need it -- for instance if we have the dag locally, and it was fetched less than the configured timeout already, then we don't need to ask the DB

So something like this (pseudo-python):

if dag_id in self.dags and timezone.utcnow() < self.dags_last_fetched[dag_id] + min_serialized_dag_fetch_secs: return True sd_last_updated_datetime = SerializedDagModel.get_last_updated_datetime( dag_id=dag_id, session=session, ) # etc ...

ashb · 2021-07-05T09:01:48Z

airflow/jobs/scheduler_job.py

+                continue
+            # Dag no longer in dagbag?
+            if not self.dagbag.has_dag(ti.dag_id, session=session):
+                ti.set_state(State.FAILED, session=session)


I wonder if this (and L823) should be set to State.REMOVED? It might be clearer for debugging for the user than a failure without any logs.

ashb · 2021-07-05T09:03:49Z

tests/models/test_dagbag.py

+
+            dag_bag = DagBag(read_dags_from_db=True)
+            dag_bag.get_dag(dag_id)  # Add dag to self.dags
+            assert dag_bag.has_dag(dag_id)


We should put a query count assertion around this line to ensure it is 0

tests/models/test_dagbag.py

ashb · 2021-07-05T09:06:17Z

tests/jobs/test_scheduler_job.py

+        with mock.patch.object(settings, "USE_JOB_SCHEDULE", False), conf_vars(
+            {('scheduler', 'clean_tis_without_dag_interval'): '0.001'}
+        ):
+            self.scheduler_job._run_scheduler_loop()


Not sure we need to run the whole scheduler loop here -- we could just call self.scheduler_job._clean_tis_without_dag() directly.

By calling scheduler_loop the only thing extra we check is that we've added this to the timer, but we can see that pretty easily.

Dunno :)

Co-authored-by: Ash Berlin-Taylor <ash_github@firemirror.com>

ephraimbuddy · 2021-07-06T08:32:41Z

Hi @ashb @kaxil, I'm closing this as I can't reproduce this scenario again manually.
This PR: #16368 easily handles missing dag file and any dag that's missing is immediately failed in executor and subsequently by this PR: #15929 in scheduler.

Unlike before, I tried severally to reproduce this case in manual testing but the two PRs above are hit thus not being able to reproduce it.

I'm closing now as I believe it has been solved

boring-cyborg bot added the area:Scheduler including HA (high availability) scheduler label May 31, 2021

ephraimbuddy force-pushed the dont-queue-missing-dag branch from 40859fb to f955403 Compare May 31, 2021 12:07

ephraimbuddy marked this pull request as ready for review May 31, 2021 12:07

ephraimbuddy requested review from ashb, kaxil and XD-DENG as code owners May 31, 2021 12:07

ephraimbuddy force-pushed the dont-queue-missing-dag branch from f955403 to 580d907 Compare May 31, 2021 17:58

uranusjr reviewed May 31, 2021

View reviewed changes

airflow/jobs/scheduler_job.py Outdated Show resolved Hide resolved

ashb requested changes Jun 1, 2021

View reviewed changes

uranusjr reviewed Jun 1, 2021

View reviewed changes

airflow/jobs/scheduler_job.py Outdated Show resolved Hide resolved

ephraimbuddy force-pushed the dont-queue-missing-dag branch from aeec1c5 to 04b614e Compare June 1, 2021 22:37

ephraimbuddy commented Jun 2, 2021

View reviewed changes

airflow/config_templates/config.yml Outdated Show resolved Hide resolved

ephraimbuddy force-pushed the dont-queue-missing-dag branch 6 times, most recently from ff88c3e to 885df4c Compare June 4, 2021 10:22

kaxil reviewed Jun 4, 2021

View reviewed changes

airflow/jobs/scheduler_job.py Outdated Show resolved Hide resolved

kaxil approved these changes Jun 4, 2021

View reviewed changes

ephraimbuddy force-pushed the dont-queue-missing-dag branch 3 times, most recently from c9a829a to 9e1c7dc Compare June 5, 2021 11:11

ashb requested changes Jun 7, 2021

View reviewed changes

airflow/jobs/scheduler_job.py Outdated Show resolved Hide resolved

airflow/jobs/scheduler_job.py Outdated Show resolved Hide resolved

ephraimbuddy force-pushed the dont-queue-missing-dag branch from 9e1c7dc to 0092269 Compare June 8, 2021 00:03

ephraimbuddy requested a review from turbaszek as a code owner June 8, 2021 00:03

ephraimbuddy force-pushed the dont-queue-missing-dag branch from 0092269 to a6f8c45 Compare June 8, 2021 17:00

kaxil closed this Jun 9, 2021

kaxil reopened this Jun 9, 2021

ephraimbuddy force-pushed the dont-queue-missing-dag branch 2 times, most recently from 20bc4f6 to 93ce02a Compare June 24, 2021 17:46

ephraimbuddy added this to the Airflow 2.1.2 milestone Jun 24, 2021

ephraimbuddy closed this Jun 25, 2021

ephraimbuddy reopened this Jun 25, 2021

ephraimbuddy force-pushed the dont-queue-missing-dag branch 4 times, most recently from df667e1 to adef9b4 Compare June 27, 2021 21:40

uranusjr reviewed Jun 28, 2021

View reviewed changes

airflow/config_templates/config.yml Outdated Show resolved Hide resolved

uranusjr reviewed Jun 28, 2021

View reviewed changes

airflow/models/dagbag.py Outdated Show resolved Hide resolved

ephraimbuddy force-pushed the dont-queue-missing-dag branch from adef9b4 to 06c87ec Compare June 28, 2021 16:04

ephraimbuddy added 2 commits June 28, 2021 21:10

fixup! Do not queue tasks when the DAG goes missing

93ae6bf

ephraimbuddy force-pushed the dont-queue-missing-dag branch from 06c87ec to 93ae6bf Compare June 28, 2021 20:11

ephraimbuddy closed this Jun 28, 2021

ephraimbuddy reopened this Jun 28, 2021

ashb reviewed Jul 5, 2021

View reviewed changes

airflow/config_templates/config.yml Outdated Show resolved Hide resolved

ashb reviewed Jul 5, 2021

View reviewed changes

airflow/jobs/scheduler_job.py Outdated Show resolved Hide resolved

ashb reviewed Jul 5, 2021

View reviewed changes

tests/models/test_dagbag.py Outdated Show resolved Hide resolved

ashb reviewed Jul 5, 2021

View reviewed changes

Apply suggestions from code review

9c4a894

Co-authored-by: Ash Berlin-Taylor <ash_github@firemirror.com>

ephraimbuddy closed this Jul 6, 2021

ephraimbuddy deleted the dont-queue-missing-dag branch July 6, 2021 08:33

ashb modified the milestones: Airflow 2.1.2, Airflow 2.1.3 Jul 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not queue tasks when the DAG goes missing #16182

Do not queue tasks when the DAG goes missing #16182

ephraimbuddy commented May 31, 2021 •

edited

Loading

ashb left a comment

kaxil left a comment

jhtimmins commented Jun 9, 2021

ashb Jul 5, 2021

ashb Jul 5, 2021

ashb Jul 5, 2021

ashb Jul 5, 2021

ashb Jul 5, 2021

ephraimbuddy commented Jul 6, 2021

Do not queue tasks when the DAG goes missing #16182

Do not queue tasks when the DAG goes missing #16182

Conversation

ephraimbuddy commented May 31, 2021 • edited Loading

ashb left a comment

Choose a reason for hiding this comment

kaxil left a comment

Choose a reason for hiding this comment

jhtimmins commented Jun 9, 2021

ashb Jul 5, 2021

Choose a reason for hiding this comment

ashb Jul 5, 2021

Choose a reason for hiding this comment

ashb Jul 5, 2021

Choose a reason for hiding this comment

ashb Jul 5, 2021

Choose a reason for hiding this comment

ashb Jul 5, 2021

Choose a reason for hiding this comment

ephraimbuddy commented Jul 6, 2021

ephraimbuddy commented May 31, 2021 •

edited

Loading