Tries to schedule on fargate nodes #183

wonko · 2024-06-07T14:01:27Z

In aws-observability/helm-charts#41 a default "all" toleration was added on all daemonsets. This results in a daemonset which can't roll out completely as the pods will never schedule on the fargate nodes (obviously, they have a taint eks.amazonaws.com/compute-type=fargate:NoSchedule, but this is ignored by the too-liberal-default-toleration).

This results in an addon upgrade which doesn't work (daemonset keeps having pending pods).

I believe the operator should set correct tolerations (or the default tolerations in the helm charts should be good defaults for a normal EKS cluster - then this issue should move to https://github.com/aws-observability/helm-charts).

The text was updated successfully, but these errors were encountered:

wonko · 2024-06-07T14:03:00Z

As an example, this is a 3-normal-node and 2-fargate-node cluster. It will never recover from this situation:

➜ kubectl get pods -n amazon-cloudwatch
NAME                                                              READY   STATUS    RESTARTS   AGE
amazon-cloudwatch-observability-controller-manager-6cf7f6d5cbb2   1/1     Running   0          82m
cloudwatch-agent-56h5t                                            0/1     Pending   0          82m
cloudwatch-agent-bwfqm                                            1/1     Running   0          82m
cloudwatch-agent-hc4xj                                            1/1     Running   0          82m
cloudwatch-agent-qxwg8                                            0/1     Pending   0          82m
cloudwatch-agent-r6zz8                                            1/1     Running   0          82m
fluent-bit-7dzv2                                                  1/1     Running   0          82m
fluent-bit-q2vmz                                                  0/1     Pending   0          82m
fluent-bit-tzdbh                                                  1/1     Running   0          82m
fluent-bit-zpddx                                                  1/1     Running   0          82m
fluent-bit-ztvl4                                                  0/1     Pending   0          82m

➜ kubectl get nodes
NAME                                                   STATUS   ROLES    AGE    VERSION
fargate-ip-10-0-13-234.eu-central-1.compute.internal   Ready    <none>   5h5m   v1.29.0-eks-680e576
fargate-ip-10-0-41-78.eu-central-1.compute.internal    Ready    <none>   5h5m   v1.29.0-eks-680e576
ip-10-0-12-246.eu-central-1.compute.internal           Ready    <none>   15h    v1.29.1-eks-61c0bbb
ip-10-0-30-109.eu-central-1.compute.internal           Ready    <none>   15h    v1.29.1-eks-61c0bbb
ip-10-0-38-61.eu-central-1.compute.internal            Ready    <none>   18h    v1.29.1-eks-61c0bbb

lisguo · 2024-06-11T17:25:08Z

I saw that you mentioned that this breaks the add-on upgrade on your cluster as part of this PR: aws-observability/helm-charts#41

I believe this resulted in the daemonsets trying to schedule onto fargate nodes, which will never work. This breaks the addon upgrade, as the daemonset never rolls out completely.

Can you elaborate more? Did you see the add-on in a "Degraded" status when you tried upgrading to 1.7.0?

wonko · 2024-06-11T18:21:42Z

It actually goes to a "Failed" state in the AWS overview page (addon tab on the cluster details). On the cluster itself, the deployment never completes, as it isn't possible to run the pods on the fargate nodes. See the output of kubectl in the comment above: 2 pending, 3 running, both for fluent-bit and cloudwatch-agent.

Removing the addon and re-installing version 1.6.0 fixes this. This is how I currently solved it.

LeoSpyke · 2024-07-02T13:34:45Z

We are having the same issue with version 1.7.0 and we partially solved it by setting

tolerations: []

in the addon configuration.

However, this only prevents the "fluent-bit" pods from being scheduled, not the "cloudwatch-agent" ones.

The result is the following:

$> kubectl get all -n amazon-cloudwatch
NAME                                                                  READY   STATUS    RESTARTS   AGE
pod/amazon-cloudwatch-observability-controller-manager-7cc96d555x9r   1/1     Running   0          25h
pod/cloudwatch-agent-fk966                                            0/1     Pending   0          24h
pod/cloudwatch-agent-hfrbq                                            0/1     Pending   0          4h22m
pod/cloudwatch-agent-jf2f5                                            1/1     Running   0          29h
pod/cloudwatch-agent-m5vhs                                            0/1     Pending   0          4h27m
pod/cloudwatch-agent-q42tc                                            0/1     Pending   0          4h27m
pod/cloudwatch-agent-qx5m5                                            0/1     Pending   0          4h18m
pod/cloudwatch-agent-ssffq                                            1/1     Running   0          29h
pod/cloudwatch-agent-x52bc                                            0/1     Pending   0          4h25m
pod/fluent-bit-b28vq                                                  1/1     Running   0          25h
pod/fluent-bit-tklzt                                                  1/1     Running   0          25h

EDIT: solved it by uninstalling and reinstalling the addon as suggested by @wonko.

Paramadon mentioned this issue Jul 2, 2024

EKS Addon Fargate Bug Fix aws-observability/helm-charts#58

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tries to schedule on fargate nodes #183

Tries to schedule on fargate nodes #183

wonko commented Jun 7, 2024

wonko commented Jun 7, 2024

lisguo commented Jun 11, 2024

wonko commented Jun 11, 2024

LeoSpyke commented Jul 2, 2024 •

edited

Loading

Tries to schedule on fargate nodes #183

Tries to schedule on fargate nodes #183

Comments

wonko commented Jun 7, 2024

wonko commented Jun 7, 2024

lisguo commented Jun 11, 2024

wonko commented Jun 11, 2024

LeoSpyke commented Jul 2, 2024 • edited Loading

LeoSpyke commented Jul 2, 2024 •

edited

Loading