Cloudwatch agent pods don't get restarted when doing rollout-restart #124

Veronica4036 · 2024-03-04T15:24:26Z

Describe the bug

All the pods part of the ds cloudwatch-agent are not getting restarted when doing kubectl rollout restart ds cloudwatch-agent -n amazon-cloudwatch. Only one pod is getting restarted.

Steps to reproduce

Created a cluster of version 1.28 and installed the addon Amazon CloudWatch Observability of version v1.2.2-eksbuild.1.

Intially we have 2 pods:

kubectl get pods  -A -l app.kubernetes.io/component=amazon-cloudwatch-agent -o wide   
NAMESPACE           NAME                     READY   STATUS    RESTARTS   AGE   IP              NODE                           NOMINATED NODE   READINESS GATES
amazon-cloudwatch   cloudwatch-agent-hdbmv   1/1     Running   0          7s    172.31.78.188   ip-172-31-67-14.ec2.internal   <none>           <none>
amazon-cloudwatch   cloudwatch-agent-ttfbd   1/1     Running   0          7s    172.31.1.111    ip-172-31-5-6.ec2.internal     <none>           <none>

1st Restart:

kubectl rollout restart ds cloudwatch-agent -n amazon-cloudwatch                      
daemonset.apps/cloudwatch-agent restarted

We can see that only 1 pod got restarted, other pod is still running:

kubectl get pods  -A -l app.kubernetes.io/component=amazon-cloudwatch-agent -o wide  -w
NAMESPACE           NAME                     READY   STATUS    RESTARTS   AGE   IP              NODE                           NOMINATED NODE   READINESS GATES
amazon-cloudwatch   cloudwatch-agent-hdbmv   1/1     Running   0          33s   172.31.78.188   ip-172-31-67-14.ec2.internal   <none>           <none>
amazon-cloudwatch   cloudwatch-agent-l2mgm   1/1     Running   0          8s    172.31.0.110    ip-172-31-5-6.ec2.internal     <none>           <none>

Same behaviour every time:

kubectl rollout restart ds cloudwatch-agent -n amazon-cloudwatch                       
daemonset.apps/cloudwatch-agent restarted


kubectl get pods  -A -l app.kubernetes.io/component=amazon-cloudwatch-agent -o wide  
NAMESPACE           NAME                     READY   STATUS    RESTARTS   AGE   IP              NODE                           NOMINATED NODE   READINESS GATES
amazon-cloudwatch   cloudwatch-agent-hdbmv   1/1     Running   0          71s   172.31.78.188   ip-172-31-67-14.ec2.internal   <none>           <none>
amazon-cloudwatch   cloudwatch-agent-st4p9   1/1     Running   0          4s    172.31.1.111    ip-172-31-5-6.ec2.internal     <none>           <none>

What did you expect to see?
I expected that all the pods of the ds should be restarted

What did you see instead?
Instead, I see that only 1 pod is getting restarted

What version did you use?
v1.2.2-eksbuild.1

What config did you use?
NA

Environment
Tried for cluster version 1.26, 1.27 & 1.28

Additional context

I could observer difference in the creation of the controllerrevisisons.

For a sample ds, where rollout restart works perfectly fine, 1 new controllerrevision is created when we perform rollout restart

% kubectl get controllerrevision -A
NAMESPACE           NAME                              CONTROLLER                            REVISION   AGE
amazon-cloudwatch   cloudwatch-agent-6ddd78df4        daemonset.apps/cloudwatch-agent       1          34m
amazon-cloudwatch   fluent-bit-57659b7864             daemonset.apps/fluent-bit             1          34m
default             web-79dc58f667                    statefulset.apps/web                  1          46d
kube-system         aws-node-5b47bbc5c8               daemonset.apps/aws-node               2          16d
kube-system         aws-node-5bdc4b45f4               daemonset.apps/aws-node               3          16d
kube-system         aws-node-7845867c85               daemonset.apps/aws-node               1          31d

Whereas in case of cloudwatch agent pods, the 1st controllerrevision is deleted and 2 new controller revisions are created. 3rd one is same as the 1st one. Below is the pattern:

$kubectl get controllerrevision -A | grep watch               
amazon-cloudwatch   cloudwatch-agent-5f44485c55       daemonset.apps/cloudwatch-agent       1          20m

$kubectl get controllerrevision -A | grep watch               
amazon-cloudwatch   cloudwatch-agent-5f44485c55       daemonset.apps/cloudwatch-agent       2          36m
amazon-cloudwatch   cloudwatch-agent-746f576ff6       daemonset.apps/cloudwatch-agent       3          47m

$kubectl get controllerrevision -A | grep watch    
amazon-cloudwatch   cloudwatch-agent-5f44485c55       daemonset.apps/cloudwatch-agent       2          40m
amazon-cloudwatch   cloudwatch-agent-746f576ff6       daemonset.apps/cloudwatch-agent       5          51m
amazon-cloudwatch   cloudwatch-agent-cd885487d        daemonset.apps/cloudwatch-agent       4          16s

$kubectl get controllerrevision -A | grep watch                  
amazon-cloudwatch   cloudwatch-agent-5f44485c55       daemonset.apps/cloudwatch-agent       2          42m
amazon-cloudwatch   cloudwatch-agent-746f576ff6       daemonset.apps/cloudwatch-agent       7          53m
amazon-cloudwatch   cloudwatch-agent-779d495df4       daemonset.apps/cloudwatch-agent       6          4s
amazon-cloudwatch   cloudwatch-agent-cd885487d        daemonset.apps/cloudwatch-agent       4          2m2s

$kubectl get controllerrevision -A | grep watch                  
amazon-cloudwatch   cloudwatch-agent-5f44485c55       daemonset.apps/cloudwatch-agent       2          42m
amazon-cloudwatch   cloudwatch-agent-746f576ff6       daemonset.apps/cloudwatch-agent       9          53m
amazon-cloudwatch   cloudwatch-agent-779d495df4       daemonset.apps/cloudwatch-agent       6          21s
amazon-cloudwatch   cloudwatch-agent-84df56d566       daemonset.apps/cloudwatch-agent       8          3s
amazon-cloudwatch   cloudwatch-agent-cd885487d        daemonset.apps/cloudwatch-agent       4          2m19s

The text was updated successfully, but these errors were encountered:

jefchien · 2024-03-13T18:50:02Z

I was able to reproduce the issue with a new cluster on 1.29 and v1.3.0-eksbuild.1.

kubectl get pods  -A -l app.kubernetes.io/component=amazon-cloudwatch-agent -o wide
NAMESPACE           NAME                     READY   STATUS    RESTARTS   AGE   IP               NODE                                           NOMINATED NODE   READINESS GATES
amazon-cloudwatch   cloudwatch-agent-btzxh   1/1     Running   0          26s   192.168.19.153   ip-192-168-24-168.us-west-1.compute.internal   <none>           <none>
amazon-cloudwatch   cloudwatch-agent-fw5lf   1/1     Running   0          26s   192.168.50.234   ip-192-168-40-36.us-west-1.compute.internal    <none>           <none>

kubectl rollout restart ds cloudwatch-agent -n amazon-cloudwatch
daemonset.apps/cloudwatch-agent restarted

kubectl get pods  -A -l app.kubernetes.io/component=amazon-cloudwatch-agent -o wide
NAMESPACE           NAME                     READY   STATUS    RESTARTS   AGE   IP               NODE                                           NOMINATED NODE   READINESS GATES
amazon-cloudwatch   cloudwatch-agent-9hcxp   1/1     Running   0          12s   192.168.5.50     ip-192-168-24-168.us-west-1.compute.internal   <none>           <none>
amazon-cloudwatch   cloudwatch-agent-fw5lf   1/1     Running   0          44s   192.168.50.234   ip-192-168-40-36.us-west-1.compute.internal    <none>           <none>

jefchien · 2024-03-14T14:21:36Z

The current workaround is to delete the daemonset. The EKS Addon will recreate it.

kubectl get pods  -A -l app.kubernetes.io/component=amazon-cloudwatch-agent -o wide
NAMESPACE           NAME                     READY   STATUS    RESTARTS   AGE   IP               NODE                                           NOMINATED NODE   READINESS GATES
amazon-cloudwatch   cloudwatch-agent-4j44g   1/1     Running   0          19h   192.168.33.137   ip-192-168-40-36.us-west-1.compute.internal    <none>           <none>
amazon-cloudwatch   cloudwatch-agent-kbddt   1/1     Running   0          19h   192.168.12.41    ip-192-168-24-168.us-west-1.compute.internal   <none>           <none>

kubectl delete ds cloudwatch-agent -n amazon-cloudwatch
daemonset.apps "cloudwatch-agent" deleted

kubectl get pods  -A -l app.kubernetes.io/component=amazon-cloudwatch-agent -o wide
NAMESPACE           NAME                     READY   STATUS    RESTARTS   AGE   IP              NODE                                           NOMINATED NODE   READINESS GATES
amazon-cloudwatch   cloudwatch-agent-7nzlk   1/1     Running   0          11s   192.168.1.37    ip-192-168-24-168.us-west-1.compute.internal   <none>           <none>
amazon-cloudwatch   cloudwatch-agent-bz68l   1/1     Running   0          11s   192.168.39.69   ip-192-168-40-36.us-west-1.compute.internal    <none>           <none>

jefchien · 2024-03-14T14:25:39Z

The kubectl rollout restart adds an annotation to the resource it's restarting to change the state. The Addon reconciles the change and removes the annotation, which causes the multiple controllerrevisions.

jefchien transferred this issue from aws/amazon-cloudwatch-agent Mar 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cloudwatch agent pods don't get restarted when doing rollout-restart #124

Cloudwatch agent pods don't get restarted when doing rollout-restart #124

Veronica4036 commented Mar 4, 2024

jefchien commented Mar 13, 2024

jefchien commented Mar 14, 2024

jefchien commented Mar 14, 2024

Cloudwatch agent pods don't get restarted when doing rollout-restart #124

Cloudwatch agent pods don't get restarted when doing rollout-restart #124

Comments

Veronica4036 commented Mar 4, 2024

jefchien commented Mar 13, 2024

jefchien commented Mar 14, 2024

jefchien commented Mar 14, 2024