On failed KPO with deferrable=True, do_xcom_push=True - task never completes due to hanging xcom container #37298
Closed
2 tasks done
Labels
area:core
kind:bug
This is a clearly a bug
needs-triage
label for new issues that we didn't triage yet
Apache Airflow version
2.8.1
If "Other Airflow 2 version" selected, which one?
No response
What happened?
When KPO and KPO subclasses (like the GKEStartPodOperator) are set to deferrable=True and do_xcom_push=True and the base pod fails, the AF task hangs forever. This is because the xcom sidecar stays running and is not killed.
Path is as follows:
remote_pod.status.phase in PodPhase.terminal_states
never returns True since the xcom container stays runningSee screenshot -
As a result, this is what you see in the logs -
Despite the pod having already failed -
What you think should happen instead?
We should call self.extract_xcom in the failure scenario just as we do in the success scenario to kill the xcom sidecar. There might not be any values to extract but based on the docstring, extract_xcom also has the side-effect of killing the xcom container which will allow the while loop to reach a terminal state.
How to reproduce
Create failing dag with deferrable=True and do_xcom_push=True and observe hanging task on execute_complete.
Sample dag:
Operating System
debian11
Versions of Apache Airflow Providers
providers-cncf-kubernetes/7.14.0
Deployment
Official Apache Airflow Helm Chart
Deployment details
No response
Anything else?
NOTE: if using ADC credentials, this PR needs to be reverted which breaks ADC auth flow: #37081
cc @Lee-W @pankajkoti @dirrao @hussein-awala
Are you willing to submit PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: