You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
running Google composer (composer-2.9.11-airflow-2.10.2)
the DAG starts several (e.g. 10) tasks using EksPodOperator on certain hour
composer runs in US region, while the tasks are started in AWS Asia region (thus in theory there is a bit of delay/slowness in communication)
The issue I'm observing is that workers get restarted due to liveness probe in composer set-up (maybe it is a composer-specific configuration, so far I tried scaling up the environment, but problem keeps popping up) after they are starting to execute the pod creation task.
The log line is
[2024-11-19, 01:52:42 UTC] {connection_wrapper.py:325} INFO - AWS Connection (conn_id='aws_g', conn_type='aws') credentials retrieved from login and password.
[2024-11-19, 01:52:45 UTC] {baseoperator.py:405} WARNING - EksPodOperator.execute cannot be called outside TaskInstance!
[2024-11-19, 01:52:45 UTC] {pod.py:1139} INFO - Building pod pcap-parse-ciavnb0t with labels: {'dag_id': 'pcap', 'task_id': 'parse.m0', 'run_id': 'scheduled__2024-11-18T0048000000-1fa6cc691', 'kubernetes_pod_operator': 'True', 'try_number': '4'}
after which the the worker gets killed, all existing running tasks are set to failed (they manage to re-claim the running pods if there are remaining attempts and the new attempt task get to run again).
When tasks get started in a slower fashion (e.g. one after another in delay of minutes), it seems to behave more stable, so this is likely just resource exhaustion on the worker, however I'm puzzled by how quickly it runs bad:
tried bumping available workers from 1 to 2
adding more cpu to the worker
switching composer environment to medium (in case this is related to some other ops being done e.g. on database)
It looks like starting something like >3 EksPodOperator tasks at the same moment on the worker makes those tasks stuck / extremely slow and taking the whole worker out.
I'm looking into suggestions if:
there is a way to limit concurrent starting of tasks (it seems that after the tasks get started successfully, it behaves mostly stable), as I would like to have the limit of concurrently running tasks high
this is in fact just CPU limit issue (at the time this happens the worker is clearly getting higher CPU usage, but still below like 60% of limit) and I should keep adding more cpu to worker(s)
EksPodOperator is doing something wrong / deadlock (?) due to several of them starting on the same worker
there are some other configuration knobs I could try
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
My setup is as follows:
The issue I'm observing is that workers get restarted due to liveness probe in composer set-up (maybe it is a composer-specific configuration, so far I tried scaling up the environment, but problem keeps popping up) after they are starting to execute the pod creation task.
The log line is
after which the the worker gets killed, all existing running tasks are set to failed (they manage to re-claim the running pods if there are remaining attempts and the new attempt task get to run again).
When tasks get started in a slower fashion (e.g. one after another in delay of minutes), it seems to behave more stable, so this is likely just resource exhaustion on the worker, however I'm puzzled by how quickly it runs bad:
It looks like starting something like >3
EksPodOperator
tasks at the same moment on the worker makes those tasks stuck / extremely slow and taking the whole worker out.I'm looking into suggestions if:
EksPodOperator
is doing something wrong / deadlock (?) due to several of them starting on the same workerBeta Was this translation helpful? Give feedback.
All reactions