Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tries to schedule on fargate nodes #183

Open
wonko opened this issue Jun 7, 2024 · 4 comments
Open

Tries to schedule on fargate nodes #183

wonko opened this issue Jun 7, 2024 · 4 comments

Comments

@wonko
Copy link

wonko commented Jun 7, 2024

In aws-observability/helm-charts#41 a default "all" toleration was added on all daemonsets. This results in a daemonset which can't roll out completely as the pods will never schedule on the fargate nodes (obviously, they have a taint eks.amazonaws.com/compute-type=fargate:NoSchedule, but this is ignored by the too-liberal-default-toleration).

This results in an addon upgrade which doesn't work (daemonset keeps having pending pods).

I believe the operator should set correct tolerations (or the default tolerations in the helm charts should be good defaults for a normal EKS cluster - then this issue should move to https://github.com/aws-observability/helm-charts).

@wonko
Copy link
Author

wonko commented Jun 7, 2024

As an example, this is a 3-normal-node and 2-fargate-node cluster. It will never recover from this situation:

➜ kubectl get pods -n amazon-cloudwatch
NAME                                                              READY   STATUS    RESTARTS   AGE
amazon-cloudwatch-observability-controller-manager-6cf7f6d5cbb2   1/1     Running   0          82m
cloudwatch-agent-56h5t                                            0/1     Pending   0          82m
cloudwatch-agent-bwfqm                                            1/1     Running   0          82m
cloudwatch-agent-hc4xj                                            1/1     Running   0          82m
cloudwatch-agent-qxwg8                                            0/1     Pending   0          82m
cloudwatch-agent-r6zz8                                            1/1     Running   0          82m
fluent-bit-7dzv2                                                  1/1     Running   0          82m
fluent-bit-q2vmz                                                  0/1     Pending   0          82m
fluent-bit-tzdbh                                                  1/1     Running   0          82m
fluent-bit-zpddx                                                  1/1     Running   0          82m
fluent-bit-ztvl4                                                  0/1     Pending   0          82m
➜ kubectl get nodes
NAME                                                   STATUS   ROLES    AGE    VERSION
fargate-ip-10-0-13-234.eu-central-1.compute.internal   Ready    <none>   5h5m   v1.29.0-eks-680e576
fargate-ip-10-0-41-78.eu-central-1.compute.internal    Ready    <none>   5h5m   v1.29.0-eks-680e576
ip-10-0-12-246.eu-central-1.compute.internal           Ready    <none>   15h    v1.29.1-eks-61c0bbb
ip-10-0-30-109.eu-central-1.compute.internal           Ready    <none>   15h    v1.29.1-eks-61c0bbb
ip-10-0-38-61.eu-central-1.compute.internal            Ready    <none>   18h    v1.29.1-eks-61c0bbb

@lisguo
Copy link
Contributor

lisguo commented Jun 11, 2024

I saw that you mentioned that this breaks the add-on upgrade on your cluster as part of this PR: aws-observability/helm-charts#41

I believe this resulted in the daemonsets trying to schedule onto fargate nodes, which will never work. This breaks the addon upgrade, as the daemonset never rolls out completely.

Can you elaborate more? Did you see the add-on in a "Degraded" status when you tried upgrading to 1.7.0?

@wonko
Copy link
Author

wonko commented Jun 11, 2024

It actually goes to a "Failed" state in the AWS overview page (addon tab on the cluster details). On the cluster itself, the deployment never completes, as it isn't possible to run the pods on the fargate nodes. See the output of kubectl in the comment above: 2 pending, 3 running, both for fluent-bit and cloudwatch-agent.

Removing the addon and re-installing version 1.6.0 fixes this. This is how I currently solved it.

@LeoSpyke
Copy link

LeoSpyke commented Jul 2, 2024

We are having the same issue with version 1.7.0 and we partially solved it by setting

tolerations: []

in the addon configuration.

However, this only prevents the "fluent-bit" pods from being scheduled, not the "cloudwatch-agent" ones.

The result is the following:

$> kubectl get all -n amazon-cloudwatch
NAME                                                                  READY   STATUS    RESTARTS   AGE
pod/amazon-cloudwatch-observability-controller-manager-7cc96d555x9r   1/1     Running   0          25h
pod/cloudwatch-agent-fk966                                            0/1     Pending   0          24h
pod/cloudwatch-agent-hfrbq                                            0/1     Pending   0          4h22m
pod/cloudwatch-agent-jf2f5                                            1/1     Running   0          29h
pod/cloudwatch-agent-m5vhs                                            0/1     Pending   0          4h27m
pod/cloudwatch-agent-q42tc                                            0/1     Pending   0          4h27m
pod/cloudwatch-agent-qx5m5                                            0/1     Pending   0          4h18m
pod/cloudwatch-agent-ssffq                                            1/1     Running   0          29h
pod/cloudwatch-agent-x52bc                                            0/1     Pending   0          4h25m
pod/fluent-bit-b28vq                                                  1/1     Running   0          25h
pod/fluent-bit-tklzt                                                  1/1     Running   0          25h

EDIT: solved it by uninstalling and reinstalling the addon as suggested by @wonko.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants