cluster autoscaler doesn't apply eks-managed-ng's taint #5902

0xF0D0 · 2023-06-28T12:39:52Z

Which component are you using?: registry.k8s.io/autoscaling/cluster-autoscaler

What version of the component are you using?: v1.26.3

Component version:

What k8s version are you using (kubectl version)?:

Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.2", GitCommit:"fc04e732bb3e7198d2fa44efa5457c7c6f8c0f5b", GitTreeState:"clean", BuildDate:"2023-02-22T13:32:21Z", GoVersion:"go1.20.1", Compiler:"gc", Platform:"darwin/arm64"}
Kustomize Version: v4.5.7
Server Version: version.Info{Major:"1", Minor:"26+", GitVersion:"v1.26.5-eks-c12679a", GitCommit:"c03cecf98904742cce2e1183f87194102cc9dad9", GitTreeState:"clean", BuildDate:"2023-05-22T20:29:55Z", GoVersion:"go1.19.9", Compiler:"gc", Platform:"linux/amd64"}

kubectl version Output

$ kubectl version

What environment is this in?:
Under eks k8s clutser, I have 5 eks managed node-groups and each of them has taints.
When I deploy pod without toleration, CA should not scale-up any node-group.

However it scales up node-group even it has taints and when the node joins, since there is no node to schedule the pod, it scales up that nodegroup again until it reaches the max capacity 🤯

What did you expect to happen?:
It should not scale up node group

What happened instead?:

How to reproduce it (as minimally and precisely as possible):

use eks-managed nodegroup with taint, but no taint-label(k8s.io/cluster-autoscaler/node-template/taint) on asg. Then try to schedule a pod without toleration

Anything else we need to know?:

This is my current argument setup (cluster-name mangled)

- ./cluster-autoscaler
- --v=5
- --stderrthreshold=info
- --cloud-provider=aws
- --skip-nodes-with-local-storage=false
- --expander=least-waste
- --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/xxxx
- --scale-down-unneeded-time=2m
- --scale-down-delay-after-add=2m

When I see the logs, ca gets that node-groups are eks-managed and get's taint correctly as well, but misbehaves when scheduling.

I0628 11:49:32.426164       1 managed_nodegroup_cache.go:124] Current ManagedNodegroup cache: [{name:m5a_large_k6 clusterName:xxxx taints:[{Key:Service Value:K6 Effect:NO_SCHEDULE TimeAdded:<nil>}] labels:map[amiType:CUSTOM capacityType:ON_DEMAND eks.amazonaws.com/nodegroup:m5a_large_k6 k8sVersion:1.26]}, ...., ]

One interesting things is when I add taint-label(k8s.io/cluster-autoscaler/node-template/taint) on asg itself, it behaves as it supposed to be 🤔

The text was updated successfully, but these errors were encountered:

abstrask · 2024-01-30T16:20:16Z

Not sure why this is closed, but we've encountered the same issue, and believe we have identified the root cause (see #6481).

My colleague, @wcarlsen, and I believe we have a fix for this. Follow PR #6482 if you're interested.

0xF0D0 added the kind/bug Categorizes issue or PR as related to a bug. label Jun 28, 2023

jbartosik added the area/cluster-autoscaler label Jun 30, 2023

0xF0D0 closed this as completed Jul 31, 2023

abstrask mentioned this issue Jan 30, 2024

Taints on EKS Managed Node Groups scaled to zero not read correctly, causing scale-up of unsuitable nodes #6481

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cluster autoscaler doesn't apply eks-managed-ng's taint #5902

cluster autoscaler doesn't apply eks-managed-ng's taint #5902

0xF0D0 commented Jun 28, 2023

abstrask commented Jan 30, 2024

cluster autoscaler doesn't apply eks-managed-ng's taint #5902

cluster autoscaler doesn't apply eks-managed-ng's taint #5902

Comments

0xF0D0 commented Jun 28, 2023

abstrask commented Jan 30, 2024