-
Notifications
You must be signed in to change notification settings - Fork 14.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Schedular going down for 1-2 minute on every 10 minute as increase completed pods in EKS #22612
Comments
Thanks for opening your first issue here! Be sure to follow the issue template! |
cc: @dstandish -> what we talked about :) |
This issue has been automatically marked as stale because it has been open for 365 days without any activity. There has been several Airflow releases since last activity on this issue. Kindly asking to recheck the report against latest Airflow version and let us know if the issue is reproducible. The issue will be closed in next 30 days if no further activity occurs from the issue author. |
This issue has been closed because it has not received response from the issue author. |
We are seeing this issue in the airflow version 2.3.3. I strongly believe the issue is there in the latest airflow version 2.9.1 as well as per the latest code. I don't see any improvements in watcher performance between 2.3.3 and 2.9.1. |
I have had no luck reproducing this cause my cluster gets destroyed when I get to 2000/3000 tasks by OOM. Pruned my system but still run into this ORM. Edit: |
|
|
@dviru , why would you set the pods not to be deleted? This leads to OOM because the pods occupy some space. Just trying to understand your needs and see if we should also have another config to check maximum number of completed pods that should be allowed to be in the deployment |
This also seems related: #38968 |
@ephraimbuddy Some time I want pod should be exist to check the pod logs. But I am not seeing this issue in. 2.7.3 even though we have 6000-7000 completed pods in cluster. |
This problem can be solved by using remote logging. It's not right to keep 7000 completed pods in your cluster. |
Currently, when a pod completes and is not deleted due to the user's configuration, the watcher keeps listing these pods and checking their status. We should instead stop watching the pod once it succeeds. To do that, pods are created with the executor done label set to False and changed to True when the pod completes. The watcher then watches only those pods that the pod executor done label is False closes: apache#22612
…40183) * Fix Scheduler restarting due to too many completed pods in cluster Currently, when a pod completes and is not deleted due to the user's configuration, the watcher keeps listing these pods and checking their status. We should instead stop watching the pod once it succeeds. To do that, pods are created with the executor done label set to False and changed to True when the pod completes. The watcher then watches only those pods that the pod executor done label is False closes: #22612 * Update airflow/providers/cncf/kubernetes/pod_generator.py Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com> * Add back removed section * Don't add pod key label from get go * Update airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com> --------- Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>
…pache#40183) * Fix Scheduler restarting due to too many completed pods in cluster Currently, when a pod completes and is not deleted due to the user's configuration, the watcher keeps listing these pods and checking their status. We should instead stop watching the pod once it succeeds. To do that, pods are created with the executor done label set to False and changed to True when the pod completes. The watcher then watches only those pods that the pod executor done label is False closes: apache#22612 * Update airflow/providers/cncf/kubernetes/pod_generator.py Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com> * Add back removed section * Don't add pod key label from get go * Update airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com> --------- Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>
Apache Airflow version
2.2.4 (latest released)
What happened
Hi Team, I am using airflow 2.2.4 and deployed it on aws eks cluster. I noticed that every 5-10 minute schedular down message seeing on airflow UI. When I checked airflow schedular log, seeing the lot of below statements.
[2022-03-21 08:21:21,640] {kubernetes_executor.py:729} INFO - Attempting to adopt pod sampletask.05b6f567b4a64bd5beb16e526ba94d7a
This above statement will print for all completed pod which exist in eks, But it is repeating multiple time and as also invoking the PATCH api.
As per my understanding what happing is, below code pulling all the completed pod details for every time from EKS cluster and invoking the patch API on completed pod. So this activity for 1000 completed POD finishing in 1 minute, for 7000 completed POD its taking 3-5 minute, thats the reason scheduler is going down
What you think should happen instead
This schedular will be healthy when we set "delete_worker_pods = True". but when set delete_worker_pods =False and completed pod count goes to 7000 to 10,000 The scheduler should goes down.
The scheduler should be healthy irrespective of how many completed pod exist in EKS cluster.
How to reproduce
Deploy airflow in k8s cluster and set "delete_worker_pods = False". once completed pod reaches 7,000 to 10,000, you will able to see this issue.
Operating System
OS:Debian GNU/Linux, VERSION: 10
Versions of Apache Airflow Providers
No response
Deployment
Other Docker-based deployment
Deployment details
No response
Anything else
No response
Are you willing to submit PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: