Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Agent stuck on drain to move from Idle to StagedUpdate #168

Closed
amille44420 opened this issue Mar 11, 2022 · 14 comments
Closed

Agent stuck on drain to move from Idle to StagedUpdate #168

amille44420 opened this issue Mar 11, 2022 · 14 comments

Comments

@amille44420
Copy link

Hi everyone, I am meeting all my updates to be stuck despite the fact draining is successful.

Image I'm using: I deployed the latest version of bottlerocket-update-operator.yaml which run public.ecr.aws/bottlerocket/bottlerocket-update-operator:v0.2.0

Issue: The node being updated is stuck on the draining with the following facts :

  • The node is empty with only exceptions the pods from DaemonSets which are ; aws-node, kube-proxy, brupop-agent, node-problem-detector as well as prometheus-node-exporter
  • The logs on the brupop API shows draining events, all seem to have ended
  • The latest log entry on the brupop agent on the node being updated is of type CORDON_AND_DRAIN_NODE

I met the same issue on multiple EKS clusters, though I do have a very similar setup for all and did install brupop the same way.

Finally, and funny fact, killing the agent responsible for the update seems to solve the issue, as the update resume soon after the pod is back.

@cbgbt
Copy link
Contributor

cbgbt commented Mar 11, 2022

Hi, thanks for opening this issue.

Can you tell me a bit more about the workloads that are on the host otherwise? Are there a large number of pods? Are you using PodDisruptionBudgets? Do you have a sense of how long the API server is spending draining the node?

During the cordon+drain, the agent synchronously requests that the API server perform the drain. This usually means waiting for the Kubernetes API to report that any of the interrupted pods have successfully been terminated. It sounds like there might be a gap in the drain implementation that causes the API server to never complete. This can be correct behavior in the presence of PodDisruptionBudgets, but based on everything you've said, I'm pretty doubtful that's what's happened here.

Brupop emits log messages for spans, so one way to confirm would be to see if there is a start and end message for in the APIServer for the span responsible for draining. I believe that span should be labelled DRAIN_NODE. Unfortunately, it may have a naming collision with the subsequent subspan. I'm doing some tests to double check.

@amille44420
Copy link
Author

@cbgbt The workload is not that much, around ten pods or so, most nodes are t3.small. I do not use PodDisruptionBudgets and I believe the draining is taking about less than 5mn, 1 or 2 most of the times. When I check I can see all pods are running on other nodes as expected and only DaemonSets remain on the node to update.

To be frank I looked at the logs to see if there was anything wrong but I couldn't see any leads. I have extracted the logs from the moment the update starts for both the apiserver and agent, you may find those in the following gist.

Which is quite surprising is that killing the agent seem to resume the process, therefor moving from Idle to StagedUpdate and so on.

@cbgbt
Copy link
Contributor

cbgbt commented Mar 15, 2022

Thanks so much for the logs! I'm a bit surprised that we didn't hit this in testing, but I think what's happening is that the agent drained node colocates the apiserver serving the request to handle the drain. There must be some case which causes this to hang, though I would expect the drain request should timeout and be retried. We'll dig deeper and introduce a fix (and some tests).

@Mgonand
Copy link

Mgonand commented Mar 17, 2022

I have the exact same issue with the same scenario. I was able to update all my nodes killing the update agent as @amille44420 said.

@cbgbt
Copy link
Contributor

cbgbt commented Mar 21, 2022

Thanks for the additional data point! We're working on providing a fix for this issue.

@gthao313
Copy link
Member

gthao313 commented Apr 2, 2022

Hi @amille44420 @Mgonand . Thanks for your patient. According to the logs @amille44420 shared , we found the agent was waiting for evicting the pod prjx-mailhog-0 during draining, but for some reasons the pod prjx-mailhog-0 can't be deleted or needed more time to process the eviction. Therefore, the agent was stuck on StagedUpdate . Killing the update agent restarted the eviction process and successfully drained the node (evict all pods living on this node). We tried to reproduce this issue; however, we only succeeded once. Can you provide more details for helping us fix this bug?

  1. How often did you meet this issue now? You still meet this issue every time?
  2. Can you provide logs about other stuck you met? We tried to make sure if waiting for deletion of pods really stuck the drain.

Appreciate that!

@amille44420
Copy link
Author

amille44420 commented Apr 2, 2022

@gthao313 Sure things, here is another gist with logs for another agent in the same situation. I also joined the list of remaining pods (retrieved with kubectl describe node ...), as you may see the drain was successful as only DaemonSet remains.

So far it seems it happened on all my clusters for all nodes. The clusters being setup the same way there may be some constant factors in play. I can't think of anything unusual about the cluster itself, created through AWS console, the only unusual setting might be increasing the number of IP for the EC2 but I don't see any errors related to this.

Is there anything else you wanna know or that I may provide to help you identify the issue ?

@gthao313
Copy link
Member

gthao313 commented Apr 4, 2022

@amille44420 Thanks for your logs. I tried to reproduce but still not get the stuck. Can you help with identifying the issue? I customized an amd64 brupop image which can logs on the k8s response during waiting for pods deletion. Can you install that brupop image on your cluster and provide the logs to us? Thank you. : )

bottlerocket-update-operator-amd64: public.ecr.aws/w5j4z0i9/bottlerocket-update-operator-amd64:latest

@amille44420
Copy link
Author

amille44420 commented Apr 4, 2022

@gthao313 I noticed something surprising while running your image. It updated a node successfully before failing on the next update (by remaining stuck again). Thanks to your latest image I could see a line repeating itself, most likely as if the server was waiting for a pod to be deleted.

[WAIT_FOR_DELETION - EVENT]  Pod 'XXXX-master-0' not yet deleted. Waiting 5s.

However when I check, this pod is now running on another node, only pods from DaemonSets are still running on the node to update. So I wonder if StatefulSet providing consistent names for the pods would be the issue here ?

@gthao313
Copy link
Member

gthao313 commented Apr 4, 2022

@amille44420 This log told 'XXXX-master-0' was live, and we can say it might be the eviction wasn't proper. Can you share the complete log as well, so I can check if it stuck on deleting 'XXXX-master-0' forever.

[WAIT_FOR_DELETION - EVENT]  Pod 'XXXX-master-0' not yet deleted. Waiting 5s.

However when I check, this pod is now running on another node, only pods from DaemonSets are still running on the node to update. So I wonder if StatefulSet providing consistent names for the pods would be the issue here ?

You mean the name of the evicted pod wasn't changed after drain? For example, before eviction pod' name is abcd-master-0. After draining, the name of pod which is running on other node is still abcd-master-0. Can you verify that to compare the pod name before draining and the pod name after draining (running on other node after eviction). Thanks

I realized the name of all pods running in your cluster name started with prjx-, and I think that might be the issue. Because in wait for delete function, we target the pod by searching the old pod name in brupop namespace; therefore, if the name isn't changed the wait for delete function will still wait for the pod until it be deleted which is impossible.

@amille44420
Copy link
Author

@gthao313 sure thing, here are the logs.

And yes I can confirm the name on pods issued from my StatefulSets does not change, the pod did get evicted and was created on a new node with the same name as I would expect for a StatefulSet.

@gthao313
Copy link
Member

gthao313 commented Apr 4, 2022

@amille44420 Awesome! I think we find the root cause. Appreciate for your help. We'll provide a fix on this issue : )

@cbgbt
Copy link
Contributor

cbgbt commented Apr 4, 2022

StatefulSet does seem to be the culprit, since the name of the pod won't change. Nice sleuthing!

The offending code in Brupop is here. We use the name of the evicted pod and query the k8s API, waiting for it to 404 before declaring a pod deleted. "drain" is unfortunately implemented client-side in k8s. This implementation is based on the kubectl drain implementation.

What we're missing is this bit where we also check to see if the UID of the pod with the given name has changed.

@gthao313
Copy link
Member

gthao313 commented Apr 6, 2022

Hi, we have already merged the fix PR and you're able to use it in next release. Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants