Starting several EksPodOperator tasks clogs the worker making it slow/unresponsive to liveness probe #44169

kskalski · 2024-11-19T03:45:17Z

kskalski
Nov 19, 2024

My setup is as follows:

running Google composer (composer-2.9.11-airflow-2.10.2)
the DAG starts several (e.g. 10) tasks using EksPodOperator on certain hour
composer runs in US region, while the tasks are started in AWS Asia region (thus in theory there is a bit of delay/slowness in communication)

The issue I'm observing is that workers get restarted due to liveness probe in composer set-up (maybe it is a composer-specific configuration, so far I tried scaling up the environment, but problem keeps popping up) after they are starting to execute the pod creation task.

The log line is

[2024-11-19, 01:52:42 UTC] {connection_wrapper.py:325} INFO - AWS Connection (conn_id='aws_g', conn_type='aws') credentials retrieved from login and password.
[2024-11-19, 01:52:45 UTC] {baseoperator.py:405} WARNING - EksPodOperator.execute cannot be called outside TaskInstance!
[2024-11-19, 01:52:45 UTC] {pod.py:1139} INFO - Building pod pcap-parse-ciavnb0t with labels: {'dag_id': 'pcap', 'task_id': 'parse.m0', 'run_id': 'scheduled__2024-11-18T0048000000-1fa6cc691', 'kubernetes_pod_operator': 'True', 'try_number': '4'}

after which the the worker gets killed, all existing running tasks are set to failed (they manage to re-claim the running pods if there are remaining attempts and the new attempt task get to run again).

When tasks get started in a slower fashion (e.g. one after another in delay of minutes), it seems to behave more stable, so this is likely just resource exhaustion on the worker, however I'm puzzled by how quickly it runs bad:

tried bumping available workers from 1 to 2
adding more cpu to the worker
switching composer environment to medium (in case this is related to some other ops being done e.g. on database)

It looks like starting something like >3 EksPodOperator tasks at the same moment on the worker makes those tasks stuck / extremely slow and taking the whole worker out.

I'm looking into suggestions if:

there is a way to limit concurrent starting of tasks (it seems that after the tasks get started successfully, it behaves mostly stable), as I would like to have the limit of concurrently running tasks high
this is in fact just CPU limit issue (at the time this happens the worker is clearly getting higher CPU usage, but still below like 60% of limit) and I should keep adding more cpu to worker(s)
EksPodOperator is doing something wrong / deadlock (?) due to several of them starting on the same worker
there are some other configuration knobs I could try

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Starting several EksPodOperator tasks clogs the worker making it slow/unresponsive to liveness probe #44169

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Starting several EksPodOperator tasks clogs the worker making it slow/unresponsive to liveness probe #44169

kskalski Nov 19, 2024

Replies: 0 comments

kskalski
Nov 19, 2024