-
Notifications
You must be signed in to change notification settings - Fork 957
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Node take 100s to start #3102
Comments
EC2 logs: ec2 logs:
|
Hey @royb-tabit, thanks for the loads of information! By chance do you know what you've changed since the setup to make your node startup time go from 60s to 100s? Have you changed the AMI, the user data, or added more pod/node constraints? If it hasn't been at 60s at all, that'd be good to know too. |
Hey @njtran, Thanks for your question. We never got to see karpenter |
The discrepancy may be that Karpenter creates the node right when the instance is launched and the cluster-autoscaler will wait for the kubelet to come online to register the node with the API server which will make it look like the node came online faster, but the actual time between the EC2 instance being launched and the node going Ready should be mostly the same between Karpenter and Cluster Autoscaler. |
Hey @bwagner5, |
@royb-tabit Would you be able to run this node timing tool that we recently open sourced on both the node launched by Karpenter and the node launched by cluster-autoscaler? https://github.com/awslabs/node-latency-for-k8s You can run the binary on the node after the instance is launched or you can run it as a DaemonSet. If you can post the timing chart that the tool produces on stdout, that may help narrow down the issue. |
Hey @bwagner5 , |
Hey @bwagner5 , I ran the DeamonSet, but it had some issues connecting the data, do you have an idea why? EDIT:
|
These are the AWS ClusterAutoscaler logs:
|
@bwagner5 is more familiar, but maybe there's a way to set this? |
Looks like it doesn't have permissions to access the K8s API or AWS APIs. Looks like I missed a README update when I added those events, so you'd need to install it with an IAM role for those events to work. But most of the data is here. Cluster Autoscaler Launch-to-Ready:
Karpenter Launch-to-Ready:
It looks like different AMIs are being used between the CA and Karpenter nodes. The Karpenter one appears to be a slightly older EKS Optimized AL2 AMI compared to the CA one. One notable different in latency is between the It looks like the CNI wasn't picked up by the node latency tool for Cluster Autoscaler. Are the nodes launched with Cluster Autoscaler using the VPC CNI with a default setup? I'll also note that |
Latest versions of the EKS AMI and the VPC CNI have some optimizations around startup and they appear to actively be tackling startup time. |
Labeled for closure due to inactivity in 10 days. |
Version
Karpenter Version:__ v0.20.0
Kubernetes Version:__ v1.23
Expected Behavior
Node to be in Ready state in about 60 seconds or less
Actual Behavior
000s instance registered to the cluster
038s cloud-init finish
055s kube-proxy is running
between
080s
and
100s Node is in Ready state
When comparing to "aws-autoscaler" performance
with the same lunchTemplet we get about 70s
until Node is in ready state.
Steps to Reproduce the Problem
AMI-ID (used in launchTemplate):
ami-0c64fd5283c4bfc45
USER-DATA:
Resource Specs and Logs
Karpenter Logs (lunching node in 11:27:02.675):
kube-proxy:
Community Note
The text was updated successfully, but these errors were encountered: