Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Argo workflows fail to terminate on EKS #133

Open
fakeburst opened this issue Jan 27, 2022 · 5 comments
Open

Argo workflows fail to terminate on EKS #133

fakeburst opened this issue Jan 27, 2022 · 5 comments

Comments

@fakeburst
Copy link

Greetings!

I've encountered an issue while trying to implement Admiralty for Argo workflows. Running parallel steps on multiple clusters works like a charm, but terminating workflows does not work. It seems to me that it boils down to the way argo workflow-controller manages stopping the running pods. Simplified the process is (ref):

  • workflow-controller annotates the pod (proxy pod in our case) with workflows.argoproj.io/execution: '{"deadline":"2022-XX-XXTXX:XX:XXZ"}' and sends SIGUSR2 to wait sidecar
  • upon receiving the signal wait checks the said annotation and issues a kill command

The problem is that the proxy pod gets instantly synced with its PodChaperon, which has the annotations of the delegated pod, and the annotation never reaches the delegate. This causes wait to fail the check and not issue to kill the containers in the pod.

I assume having the delegate to be the only source of truth saves us from possible race conditions in terms of which annotations should be considered "true", but this way we are not able to stop the workflows, which is a needed feature for working with argo workflows.

Please let me know if you need any logs or additional info and/or whether I'm misunderstanding the process of syncing the annotations

@adrienjt
Copy link
Contributor

adrienjt commented Feb 1, 2022

Hi!

Admiralty doesn't currently support proxy-to-delegate pod updates, including annotation updates. However, as you noticed, Admiralty does support delegate-to-proxy pod annotation updates (so Argo can read step outputs).

To fully support Argo, especially stopping/terminating workflows and using daemon steps, as it used to work before version 3.2, i.e., using the workflows.argoproj.io/execution pod annotation, we'd need two-way updates. We could likely make them deterministic with a three-way merge algorithm and some priority rules.

Luckily, you won't have to wait, because Argo v3.2 doesn't use annotations for execution control anymore, sends TERM signals directly instead: argoproj/argo-workflows#6022

Upgrading Argo should fix your issue.

@fakeburst
Copy link
Author

Sorry for the delayed answer and thanks a lot for your advice!

We're using argo workflows as a part of kubeflow pipelines, so will need to wait for the next release of pipelines to have argo v3.2.3 integrated. Meanwhile, I had a custom wait sidecar image built with annotation check removed, so that SIGUSR2 signal triggers docker kill directly.

But, as EKS drops support for k8s 1.18 on March 31, 2022, I've upgraded to 1.20 and encountered the issue described at #120 - argo workflow controller executes kill commands via pods/exec request.

Could you please tell me whether you have any ETAs for #120?
I've tried bumping the dependencies myself, but failed with my little to no experience with go 😅

@adrienjt
Copy link
Contributor

Meanwhile, I had a custom wait sidecar image built with annotation check removed, so that SIGUSR2 signal triggers docker kill directly.

Good idea.

Could you please tell me whether you have any ETAs for #120?

This month, hopefully next week.

@fakeburst
Copy link
Author

fakeburst commented Apr 7, 2022

I'm terribly sorry for such a delayed response once again.

Thanks for the update, I've upgraded my EKS clusters to k8s 1.21 and re-installed admiralty with v0.15.1.
Yet it seems I'm missing something as I get this error in controller-manager pods

main.go:323] timed out waiting for virtual kubelet serving certificate to be signed, pod logs/exec won't be supported

I've checked the CSRs and they get Approved pretty much the second the controller-manager pod launches, but the agent times-out with the error anyway. What could be the reason for this error?
UPD: it seems to be related to aws/containers-roadmap#1604 (comment)
Indeed, the CSRs are never issued, only approved. Will a custom signer help in this case?

Please let me know if you need any logs or additional info.

@adrienjt
Copy link
Contributor

Let's continue the conversation about EKS logs/exec support in #120.

@adrienjt adrienjt changed the title Argo workflows fail to terminate Argo workflows fail to terminate on EKS Nov 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants