Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster Autoscaler not working on the new AL2023 EKS optimised AMI #6963

Closed
ashishrajora0808 opened this issue Jun 23, 2024 · 9 comments
Closed
Labels
area/cluster-autoscaler kind/bug Categorizes issue or PR as related to a bug.

Comments

@ashishrajora0808
Copy link

ashishrajora0808 commented Jun 23, 2024

Which component are you using?: registry.k8s.io/autoscaling/cluster-autoscaler:v1.29.0

cluster-autoscaler

What version of the component are you using?:

Component version: v1.29.0

What k8s version are you using (kubectl version)?: 1.30

kubectl version Output
$ kubectl version

What environment is this in?: AWS

What did you expect to happen?: On trialling the Amazon linux 2023 EKS optimised AMI, I just expected things to work as the worker nodes in EKS have all the desired permissions for the AS to communicate to the ASG.

What happened instead?:
I am getting errors on the startup of ASG which points to some sort of credentials or networking issue.

How to reproduce it (as minimally and precisely as possible):

  • Build the EKS cluster with AL2023 EKS optimised AMI amazon/amazon-eks-node-al2023-x86_64-standard-1.27-v20240615.
  • Try to install ASG version v1.29.0 via helm
  • The logs should indicate the below

│ I0621 15:22:13.971945 1 aws_manager.go:79] AWS SDK Version: 1.48.7 │
│ I0621 15:22:13.972068 1 auto_scaling_groups.go:396] Regenerating instance to ASG map for ASG names: [] │
│ I0621 15:22:13.972083 1 auto_scaling_groups.go:403] Regenerating instance to ASG map for ASG tags: map[k8s.io/cluster-autoscaler/enabled: k8s.io/clust
│ er-autoscaler/qa-ore-blue:] │
│ E0621 15:24:14.262752 1 aws_manager.go:128] Failed to regenerate ASG cache: RequestError: send request failed │
│ caused by: Post "https://autoscaling.us-west-2.amazonaws.com/": dial tcp: lookup autoscaling.us-west-2.amazonaws.com: i/o timeout │
│ F0621 15:24:14.262782 1 aws_cloud_provider.go:460] Failed to create AWS Manager: RequestError: send request failed │
│ caused by: Post "https://autoscaling.us-west-2.amazonaws.com/": dial tcp: lookup autoscaling.us-west-2.amazonaws.com: i/o timeout

Anything else we need to know?: I updated the AWS VPC CNI plugin as part of the investigation but it did not help

amazon-k8s-cni-init:v1.18.2

The AS service account for the new EKS AL2023 AMI looks like it is not loading secrets. Not sure if this is the cause:

Name: cluster-autoscaler-aws-cluster-autoscaler │
│ Namespace: kube-system │
│ Labels: app.kubernetes.io/instance=cluster-autoscaler │
│ app.kubernetes.io/managed-by=Helm │
│ app.kubernetes.io/name=aws-cluster-autoscaler │
│ app.kubernetes.io/version=1.29.0 │
│ helm.sh/chart=cluster-autoscaler-9.35.0 │
│ Annotations: eks.amazonaws.com/role-arn: arn:aws:iam::123456789987:role/*-eks-worker-role-ore │
│ meta.helm.sh/release-name: cluster-autoscaler │
│ meta.helm.sh/release-namespace: kube-system │
│ Image pull secrets: │
│ Mountable secrets: │
│ Tokens: │
│ Events:

@ashishrajora0808 ashishrajora0808 added the kind/bug Categorizes issue or PR as related to a bug. label Jun 23, 2024
@adrianmoisey
Copy link
Member

/area cluster-autoscaler

@Arulaln-AR
Copy link

Hi @adrianmoisey and @ashishrajora0808 ,

You have some solution for the above issue?

@ashishrajora0808
Copy link
Author

I haven't found any resolution to this. Working with AWS support but no luck yet. Are you facing the same issue @Arulaln-AR ?

@Arulaln-AR
Copy link

@ashishrajora0808 , yes. i am too working on the same issue with AWS support.

But i know the other working solution, which you need to follow below article. It is like attaching role to a service account annotation and referring the service account to the pod.

https://docs.aws.amazon.com/eks/latest/userguide/iam-roles-for-service-accounts.html

But i also don't want to follow that.

@Arulaln-AR
Copy link

@ashishrajora0808 , If you don't want to follow the other solution provided above. Simply change the nodegroup ami type to "AL2_x86_64" from "AL2023_x86_64_STANDARD".
When i did that, it is working.

@ashishrajora0808
Copy link
Author

@Arulaln-AR Thanks, but that action will just revert back to the AL2 AMI. I want to use the new AL2023 AMI as AL2 goes end of life next year.

@ashishrajora0808
Copy link
Author

@Arulaln-AR The issue seems to be due to the nodeconfig, the block


apiVersion: node.eks.aws/v1alpha1
kind: NodeConfig
spec:
cluster:
name: my-cluster
apiServerEndpoint: https://example.com
certificateAuthority: Y2VydGlmaWNhdGVBdXRob3JpdHk=
cidr: 10.100.0.0/16

needs the CIDR 10.100.* to be hardocded in my case as I was passing the VPC CIDR in it. 

@ashishrajora0808
Copy link
Author

Not autoscaler related so closing the case.

@Arulaln-AR
Copy link

@ashishrajora0808 , I heard from aws support. it is because of the token hop limit. It is set to 1 for AL2023 but in AL2 it was set to 2.

We need to customize the launch template to make it work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

4 participants