Schedular going down for 1-2 minute on every 10 minute as increase completed pods in EKS #22612

dviru · 2022-03-30T03:40:57Z

Apache Airflow version

2.2.4 (latest released)

What happened

Hi Team, I am using airflow 2.2.4 and deployed it on aws eks cluster. I noticed that every 5-10 minute schedular down message seeing on airflow UI. When I checked airflow schedular log, seeing the lot of below statements.

[2022-03-21 08:21:21,640] {kubernetes_executor.py:729} INFO - Attempting to adopt pod sampletask.05b6f567b4a64bd5beb16e526ba94d7a

This above statement will print for all completed pod which exist in eks, But it is repeating multiple time and as also invoking the PATCH api.

As per my understanding what happing is, below code pulling all the completed pod details for every time from EKS cluster and invoking the patch API on completed pod. So this activity for 1000 completed POD finishing in 1 minute, for 7000 completed POD its taking 3-5 minute, thats the reason scheduler is going down

What you think should happen instead

This schedular will be healthy when we set "delete_worker_pods = True". but when set delete_worker_pods =False and completed pod count goes to 7000 to 10,000 The scheduler should goes down.

The scheduler should be healthy irrespective of how many completed pod exist in EKS cluster.

How to reproduce

Deploy airflow in k8s cluster and set "delete_worker_pods = False". once completed pod reaches 7,000 to 10,000, you will able to see this issue.

Operating System

OS:Debian GNU/Linux, VERSION: 10

Versions of Apache Airflow Providers

No response

Deployment

Other Docker-based deployment

Deployment details

No response

Anything else

No response

Are you willing to submit PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

boring-cyborg · 2022-03-30T03:40:58Z

Thanks for opening your first issue here! Be sure to follow the issue template!

potiuk · 2022-03-30T10:52:54Z

cc: @dstandish -> what we talked about :)

github-actions · 2023-07-05T07:01:08Z

This issue has been automatically marked as stale because it has been open for 365 days without any activity. There has been several Airflow releases since last activity on this issue. Kindly asking to recheck the report against latest Airflow version and let us know if the issue is reproducible. The issue will be closed in next 30 days if no further activity occurs from the issue author.

github-actions · 2023-08-04T18:53:20Z

This issue has been closed because it has not received response from the issue author.

dirrao · 2024-05-02T09:02:49Z

We are seeing this issue in the airflow version 2.3.3. I strongly believe the issue is there in the latest airflow version 2.9.1 as well as per the latest code. I don't see any improvements in watcher performance between 2.3.3 and 2.9.1.
The primary reason for this issue is due to the Kubernetes pod watcher is not fast enough to cope with the Kubernetes events rate. This leads to Kubernetes watcher failure/restart and adopt_complete_pods take over the completed pods. The adopt_complete_pods will take a couple of minutes, causing the scheduler delayed heartbeat, and then scheduler liveness failures, and then scheduler pod restart.

ephraimbuddy · 2024-05-29T05:41:34Z

I have had no luck reproducing this cause my cluster gets destroyed when I get to 2000/3000 tasks by OOM. Pruned my system but still run into this ORM.

Edit:
I have seen the restarting

ephraimbuddy · 2024-05-29T06:42:27Z

[2024-05-29T06:28:38.937+0000] {process_utils.py:132} INFO - Sending Signals.SIGTERM to group 97. PIDs of all processes in the group: [7350, 7363, 97]
[2024-05-29T06:28:38.939+0000] {process_utils.py:87} INFO - Sending the signal Signals.SIGTERM to group 97
[2024-05-29T06:28:42.204+0000] {settings.py:425} DEBUG - Disposing DB connection pool (PID 7350)
[2024-05-29T06:28:42.212+0000] {settings.py:425} DEBUG - Disposing DB connection pool (PID 7363)
[2024-05-29T06:28:42.217+0000] {process_utils.py:263} INFO - Waiting up to 5 seconds for processes to exit...
[2024-05-29T06:28:42.237+0000] {process_utils.py:80} INFO - Process psutil.Process(pid=7363, status='terminated', started='06:28:38') (7363) terminated with exit code None
[2024-05-29T06:28:42.237+0000] {process_utils.py:80} INFO - Process psutil.Process(pid=97, status='terminated', exitcode=0, started='06:18:49') (97) terminated with exit code 0
[2024-05-29T06:28:42.237+0000] {process_utils.py:80} INFO - Process psutil.Process(pid=7350, status='terminated', started='06:28:36') (7350) terminated with exit code None
[2024-05-29T06:28:42.238+0000] {kubernetes_executor.py:740} INFO - Shutting down Kubernetes executor
[2024-05-29T06:28:42.238+0000] {kubernetes_executor.py:742} DEBUG - Flushing task_queue...
[2024-05-29T06:28:42.238+0000] {scheduler_job_runner.py:880} ERROR - Exception when executing Executor.end on KubernetesExecutor(parallelism=32)
Traceback (most recent call last):
  File "/opt/airflow/airflow/jobs/scheduler_job_runner.py", line 856, in _execute
    self._run_scheduler_loop()
  File "/opt/airflow/airflow/jobs/scheduler_job_runner.py", line 989, in _run_scheduler_loop
    num_queued_tis = self._do_scheduling(session)
  File "/opt/airflow/airflow/jobs/scheduler_job_runner.py", line 1071, in _do_scheduling
    callback_tuples = self._schedule_all_dag_runs(guard, dag_runs, session)
  File "/opt/airflow/airflow/utils/retries.py", line 89, in wrapped_function
    for attempt in run_with_db_retries(max_retries=retries, logger=logger, **retry_kwargs):
  File "/home/airflow/.local/lib/python3.8/site-packages/tenacity/__init__.py", line 435, in __iter__
    do = self.iter(retry_state=retry_state)
  File "/home/airflow/.local/lib/python3.8/site-packages/tenacity/__init__.py", line 368, in iter
    result = action(retry_state)
  File "/home/airflow/.local/lib/python3.8/site-packages/tenacity/__init__.py", line 390, in <lambda>
    self._add_action_func(lambda rs: rs.outcome.result())
  File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 437, in result
    return self.__get_result()
  File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
    raise self._exception
  File "/opt/airflow/airflow/utils/retries.py", line 98, in wrapped_function
    return func(*args, **kwargs)
  File "/opt/airflow/airflow/jobs/scheduler_job_runner.py", line 1415, in _schedule_all_dag_runs
    callback_tuples = [(run, self._schedule_dag_run(run, session=session)) for run in dag_runs]
  File "/opt/airflow/airflow/jobs/scheduler_job_runner.py", line 1415, in <listcomp>
    callback_tuples = [(run, self._schedule_dag_run(run, session=session)) for run in dag_runs]
  File "/opt/airflow/airflow/jobs/scheduler_job_runner.py", line 1433, in _schedule_dag_run
    dag_model = DM.get_dagmodel(dag_run.dag_id, session)
  File "/opt/airflow/airflow/utils/session.py", line 81, in wrapper
    return func(*args, **kwargs)
  File "/opt/airflow/airflow/models/dag.py", line 3737, in get_dagmodel
    return session.get(
  File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2853, in get
    return self._get_impl(
  File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2975, in _get_impl
    return db_load_fn(
  File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/orm/loading.py", line 530, in load_on_pk_identity
    session.execute(
  File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 1720, in execute
    result = compile_state_cls.orm_setup_cursor_result(
  File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/orm/context.py", line 349, in orm_setup_cursor_result
    return loading.instances(result, querycontext)
  File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/orm/loading.py", line 69, in instances
    *[
  File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/orm/loading.py", line 70, in <listcomp>
    query_entity.row_processor(context, cursor)
  File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/orm/context.py", line 2631, in row_processor
    _instance = loading._instance_processor(
  File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/orm/loading.py", line 796, in _instance_processor
    prop.create_row_processor(
  File "/home/airflow/.local/lib/python3.8/site-packages/sqlalchemy/orm/interfaces.py", line 658, in create_row_processor
    strat.create_row_processor(
  File "/opt/airflow/airflow/jobs/scheduler_job_runner.py", line 263, in _exit_gracefully
    sys.exit(os.EX_OK)
SystemExit: 0

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/airflow/airflow/jobs/scheduler_job_runner.py", line 878, in _execute
    executor.end()
  File "/opt/airflow/airflow/providers/cncf/kubernetes/executors/kubernetes_executor.py", line 743, in end
    self._flush_task_queue()
  File "/opt/airflow/airflow/providers/cncf/kubernetes/executors/kubernetes_executor.py", line 699, in _flush_task_queue
    self.log.debug("Executor shutting down, task_queue approximate size=%d", self.task_queue.qsize())
  File "<string>", line 2, in qsize
  File "/usr/local/lib/python3.8/multiprocessing/managers.py", line 834, in _callmethod
    conn.send((self._id, methodname, args, kwds))
  File "/usr/local/lib/python3.8/multiprocessing/connection.py", line 206, in send
    self._send_bytes(_ForkingPickler.dumps(obj))
  File "/usr/local/lib/python3.8/multiprocessing/connection.py", line 411, in _send_bytes
    self._send(header + buf)
  File "/usr/local/lib/python3.8/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
[2024-05-29T06:28:42.247+0000] {process_utils.py:132} INFO - Sending Signals.SIGTERM to group 97. PIDs of all processes in the group: []
[2024-05-29T06:28:42.247+0000] {process_utils.py:87} INFO - Sending the signal Signals.SIGTERM to group 97
[2024-05-29T06:28:42.247+0000] {process_utils.py:101} INFO - Sending the signal Signals.SIGTERM to process 97 as process group is missing.
[2024-05-29T06:28:42.247+0000] {scheduler_job_runner.py:886} INFO - Exited execute loop
[2024-05-29T06:28:42.258+0000] {cli_action_loggers.py:94} DEBUG - Calling callbacks: []
[2024-05-29T06:28:42.260+0000] {settings.py:425} DEBUG - Disposing DB connection pool (PID 11)

ephraimbuddy · 2024-05-29T07:22:49Z

Liveness:       exec [sh -c CONNECTION_CHECK_MAX_COUNT=0 AIRFLOW__LOGGING__LOGGING_LEVEL=ERROR exec /entrypoint \                                    │
│ airflow jobs check --job-type SchedulerJob --local                                                                                                       │
│ ] delay=10s timeout=20s period=60s #success=1 #failure=5                                                                                                 │
│     Startup:  exec [sh -c CONNECTION_CHECK_MAX_COUNT=0 AIRFLOW__LOGGING__LOGGING_LEVEL=ERROR exec /entrypoint \                                          │
│ airflow jobs check --job-type SchedulerJob --local                                                                                                       │
│ ] delay=0s timeout=20s period=10s #success=1 #failure=6

Events:                                                                                                                                                  │
│   Type     Reason     Age                   From     Message                                                                                             │
│   ----     ------     ----                  ----     -------                                                                                             │
│   Warning  Unhealthy  73s (x90 over 7h19m)  kubelet  Liveness probe failed: command "sh -c CONNECTION_CHECK_MAX_COUNT=0 AIRFLOW__LOGGING__LOGGING_LEVEL= │
│ ERROR exec /entrypoint \\\nairflow jobs check --job-type SchedulerJob --local\n" timed out                                                               │
│

ephraimbuddy · 2024-05-29T09:38:48Z

@dviru , why would you set the pods not to be deleted? This leads to OOM because the pods occupy some space. Just trying to understand your needs and see if we should also have another config to check maximum number of completed pods that should be allowed to be in the deployment

sunank200 · 2024-05-31T07:03:01Z

This also seems related: #38968

dviru · 2024-06-07T08:06:27Z

@dviru , why would you set the pods not to be deleted? This leads to OOM because the pods occupy some space. Just trying to understand your needs and see if we should also have another config to check maximum number of completed pods that should be allowed to be in the deployment

@ephraimbuddy Some time I want pod should be exist to check the pod logs. But I am not seeing this issue in. 2.7.3 even though we have 6000-7000 completed pods in cluster.

ephraimbuddy · 2024-06-07T13:11:45Z

@dviru , why would you set the pods not to be deleted? This leads to OOM because the pods occupy some space. Just trying to understand your needs and see if we should also have another config to check maximum number of completed pods that should be allowed to be in the deployment

@ephraimbuddy Some time I want pod should be exist to check the pod logs. But I am not seeing this issue in. 2.7.3 even though we have 6000-7000 completed pods in cluster.

This problem can be solved by using remote logging. It's not right to keep 7000 completed pods in your cluster.

Currently, when a pod completes and is not deleted due to the user's configuration, the watcher keeps listing these pods and checking their status. We should instead stop watching the pod once it succeeds. To do that, pods are created with the executor done label set to False and changed to True when the pod completes. The watcher then watches only those pods that the pod executor done label is False closes: apache#22612

…40183) * Fix Scheduler restarting due to too many completed pods in cluster Currently, when a pod completes and is not deleted due to the user's configuration, the watcher keeps listing these pods and checking their status. We should instead stop watching the pod once it succeeds. To do that, pods are created with the executor done label set to False and changed to True when the pod completes. The watcher then watches only those pods that the pod executor done label is False closes: #22612 * Update airflow/providers/cncf/kubernetes/pod_generator.py Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com> * Add back removed section * Don't add pod key label from get go * Update airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com> --------- Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>

…pache#40183) * Fix Scheduler restarting due to too many completed pods in cluster Currently, when a pod completes and is not deleted due to the user's configuration, the watcher keeps listing these pods and checking their status. We should instead stop watching the pod once it succeeds. To do that, pods are created with the executor done label set to False and changed to True when the pod completes. The watcher then watches only those pods that the pod executor done label is False closes: apache#22612 * Update airflow/providers/cncf/kubernetes/pod_generator.py Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com> * Add back removed section * Don't add pod key label from get go * Update airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com> --------- Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>

dviru added area:core kind:bug This is a clearly a bug labels Mar 30, 2022

eladkal added area:Scheduler including HA (high availability) scheduler provider:cncf-kubernetes Kubernetes provider related issues affected_version:2.2 Issues Reported for 2.2 labels Jul 4, 2022

github-actions bot added the Stale Bug Report label Jul 5, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Aug 4, 2023

dirrao reopened this May 2, 2024

dirrao added affected_version:2.3 Issues Reported for 2.3 and removed Stale Bug Report labels May 2, 2024

dirrao self-assigned this May 2, 2024

This was referenced May 11, 2024

Tasks are in queued state for a longer time and executor slots are exhausted often #38968

Closed

Kubernetes is not reporting back workers status to Airflow #39200

Closed

Kubernetes Executor Task Leak #36998

Closed

dirrao added the area:performance label May 11, 2024

ephraimbuddy mentioned this issue Jun 11, 2024

Fix Scheduler restarting due to too many completed pods in cluster #40183

Merged

ephraimbuddy closed this as completed in #40183 Jun 13, 2024

eladkal mentioned this issue Jun 22, 2024

Status of testing Providers that were prepared on June 22, 2024 #40382

Closed

96 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Schedular going down for 1-2 minute on every 10 minute as increase completed pods in EKS #22612

Schedular going down for 1-2 minute on every 10 minute as increase completed pods in EKS #22612

dviru commented Mar 30, 2022

boring-cyborg bot commented Mar 30, 2022

potiuk commented Mar 30, 2022

github-actions bot commented Jul 5, 2023

github-actions bot commented Aug 4, 2023

dirrao commented May 2, 2024 •

edited

Loading

ephraimbuddy commented May 29, 2024 •

edited

Loading

ephraimbuddy commented May 29, 2024

ephraimbuddy commented May 29, 2024 •

edited

Loading

ephraimbuddy commented May 29, 2024

sunank200 commented May 31, 2024

dviru commented Jun 7, 2024

ephraimbuddy commented Jun 7, 2024

Schedular going down for 1-2 minute on every 10 minute as increase completed pods in EKS #22612

Schedular going down for 1-2 minute on every 10 minute as increase completed pods in EKS #22612

Comments

dviru commented Mar 30, 2022

Apache Airflow version

What happened

What you think should happen instead

How to reproduce

Operating System

Versions of Apache Airflow Providers

Deployment

Deployment details

Anything else

Are you willing to submit PR?

Code of Conduct

boring-cyborg bot commented Mar 30, 2022

potiuk commented Mar 30, 2022

github-actions bot commented Jul 5, 2023

github-actions bot commented Aug 4, 2023

dirrao commented May 2, 2024 • edited Loading

ephraimbuddy commented May 29, 2024 • edited Loading

ephraimbuddy commented May 29, 2024

ephraimbuddy commented May 29, 2024 • edited Loading

ephraimbuddy commented May 29, 2024

sunank200 commented May 31, 2024

dviru commented Jun 7, 2024

ephraimbuddy commented Jun 7, 2024

dirrao commented May 2, 2024 •

edited

Loading

ephraimbuddy commented May 29, 2024 •

edited

Loading

ephraimbuddy commented May 29, 2024 •

edited

Loading