-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Agent stuck on drain to move from Idle to StagedUpdate #168
Comments
Hi, thanks for opening this issue. Can you tell me a bit more about the workloads that are on the host otherwise? Are there a large number of pods? Are you using PodDisruptionBudgets? Do you have a sense of how long the API server is spending draining the node? During the cordon+drain, the agent synchronously requests that the API server perform the drain. This usually means waiting for the Kubernetes API to report that any of the interrupted pods have successfully been terminated. It sounds like there might be a gap in the drain implementation that causes the API server to never complete. This can be correct behavior in the presence of PodDisruptionBudgets, but based on everything you've said, I'm pretty doubtful that's what's happened here. Brupop emits log messages for spans, so one way to confirm would be to see if there is a start and end message for in the APIServer for the span responsible for draining. I believe that span should be labelled |
@cbgbt The workload is not that much, around ten pods or so, most nodes are To be frank I looked at the logs to see if there was anything wrong but I couldn't see any leads. I have extracted the logs from the moment the update starts for both the Which is quite surprising is that killing the agent seem to resume the process, therefor moving from Idle to StagedUpdate and so on. |
Thanks so much for the logs! I'm a bit surprised that we didn't hit this in testing, but I think what's happening is that the agent drained node colocates the apiserver serving the request to handle the drain. There must be some case which causes this to hang, though I would expect the drain request should timeout and be retried. We'll dig deeper and introduce a fix (and some tests). |
I have the exact same issue with the same scenario. I was able to update all my nodes killing the update agent as @amille44420 said. |
Thanks for the additional data point! We're working on providing a fix for this issue. |
Hi @amille44420 @Mgonand . Thanks for your patient. According to the logs @amille44420 shared , we found the agent was waiting for evicting the pod
Appreciate that! |
@gthao313 Sure things, here is another gist with logs for another agent in the same situation. I also joined the list of remaining pods (retrieved with So far it seems it happened on all my clusters for all nodes. The clusters being setup the same way there may be some constant factors in play. I can't think of anything unusual about the cluster itself, created through AWS console, the only unusual setting might be increasing the number of IP for the EC2 but I don't see any errors related to this. Is there anything else you wanna know or that I may provide to help you identify the issue ? |
@amille44420 Thanks for your logs. I tried to reproduce but still not get the stuck. Can you help with identifying the issue? I customized an bottlerocket-update-operator-amd64: |
@gthao313 I noticed something surprising while running your image. It updated a node successfully before failing on the next update (by remaining stuck again). Thanks to your latest image I could see a line repeating itself, most likely as if the server was waiting for a pod to be deleted.
However when I check, this pod is now running on another node, only pods from DaemonSets are still running on the node to update. So I wonder if StatefulSet providing consistent names for the pods would be the issue here ? |
@amille44420 This log told 'XXXX-master-0' was live, and we can say it might be the eviction wasn't proper. Can you share the complete log as well, so I can check if it stuck on deleting 'XXXX-master-0' forever.
You mean the name of the evicted pod wasn't changed after drain? For example, before eviction pod' name is I realized the name of all pods running in your cluster name started with |
@gthao313 sure thing, here are the logs. And yes I can confirm the name on pods issued from my StatefulSets does not change, the pod did get evicted and was created on a new node with the same name as I would expect for a StatefulSet. |
@amille44420 Awesome! I think we find the root cause. Appreciate for your help. We'll provide a fix on this issue : ) |
StatefulSet does seem to be the culprit, since the name of the pod won't change. Nice sleuthing! The offending code in Brupop is here. We use the name of the evicted pod and query the k8s API, waiting for it to 404 before declaring a pod deleted. "drain" is unfortunately implemented client-side in k8s. This implementation is based on the What we're missing is this bit where we also check to see if the UID of the pod with the given name has changed. |
Hi, we have already merged the fix PR and you're able to use it in next release. Thanks |
Hi everyone, I am meeting all my updates to be stuck despite the fact draining is successful.
Image I'm using: I deployed the latest version of
bottlerocket-update-operator.yaml
which runpublic.ecr.aws/bottlerocket/bottlerocket-update-operator:v0.2.0
Issue: The node being updated is stuck on the draining with the following facts :
aws-node
,kube-proxy
,brupop-agent
,node-problem-detector
as well asprometheus-node-exporter
CORDON_AND_DRAIN_NODE
I met the same issue on multiple EKS clusters, though I do have a very similar setup for all and did install brupop the same way.
Finally, and funny fact, killing the agent responsible for the update seems to solve the issue, as the update resume soon after the pod is back.
The text was updated successfully, but these errors were encountered: