Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Node Startup Latency #1099

Closed
bwagner5 opened this issue Nov 11, 2022 · 15 comments
Closed

Improve Node Startup Latency #1099

bwagner5 opened this issue Nov 11, 2022 · 15 comments
Assignees
Labels
enhancement New feature or request

Comments

@bwagner5
Copy link
Contributor

Dynamically provisioning nodes quickly is important to ensure workloads can scale out quickly due to demand or recover in the event of infrastructure instability. This issue will track node startup latency improvement work on the EKS Optimized AL2 AMI.

Current work:

Cache Common Startup Container Images #938

There are several container images that are commonly needed to bootstrap a node (get it to a Ready state for pods):

  • pause
  • aws-node (VPC CNI) images (init and aws-node)
  • kube-proxy images (minimal and normal)

Disable Startup Yum Update Check #1074

There is currently a yum update run that blocks executing the eks-bootstrap script. This update check generally results in 0 updates if you are updating AMIs frequently, but the check takes 5-8 seconds because it hydrates the yum cache. This check also causes version skew across a cluster, where the same AMI ID may run different software versions depending on when it was launched which could cause problems with rollback and node churn.


Remove unnecessary Sleeps in the VPC CNI Initialization aws/amazon-vpc-cni-k8s#2104 (released in v1.12.0)

The VPC CNI had some unnecessary sleeps that resulted in 2-3 seconds of latency starting up. VPC CNI is required to be fully initialized before pods can be created on a node, so the initialization process should be as fast as possible.


Remove init-container from VPC CNI aws/amazon-vpc-cni-k8s#2137

The VPC CNI uses an init container to initialize some kernel settings related to networking that must be run as Privileged. The sequencing of the init container was resulting in some latency on startup. Generally, the VPC CNI would take 9-10 secs to full initialize. The PR above removed the init container and runs it as a regular container in the pod. This allows some parallelization of the initialization and removes the kubelet sequencing latency. The PR results in the VPC CNI's full initialization time to 4 sec. Half of the remaining latency is the container pulls which is solved in the caching PR above (#938). Integrating these two PRs together results in a 2 sec full initialization of the VPC CNI.


Add CLUSTER_ENDPOINT parameter to the VPC CNI to avoid kube-proxy race aws/amazon-vpc-cni-k8s#2138

With all the optimizations listed above, a new concern introduced is race between the VPC CNI and kube-proxy. The VPC CNI uses the kubernetes service cluster IP to reach kube-apiserver. This wasn't a concern before since the increased latency of the VPC CNI resulted in kube-proxy almost always winning the race to initialize. After the optimizations, the VPC CNI loses about half of the time and then hangs due to a 5 second timeout on reaching the kube-apiserver. The whole race can be avoided by passing the CLUSTER_ENDPOINT (the kube-apiserver load balancer endpoint) to the VPC CNI to use for initialization. The VPC CNI still needs to wait on kube-proxy to finish before completing the CNI plugin initialization, but the work to get to that point can be parallelized.

@stevehipwell
Copy link
Contributor

💯 for removing the yum update! If you can't completely remove it could it at least be configurable?

@bwagner5
Copy link
Contributor Author

bwagner5 commented Nov 22, 2022

Here's a timing chart that is pretty steady from a K8s 1.24 cluster with all of the enhancements mentioned above:

c6a.4xlarge - i-02ca66ee7f998e438

Event Time (seconds)
Instance Request 0
Instance Created 15:18:20 1
VM Initialized 15:18:30 11
Cloud-Init Initial Starts 15:18:35 16
Network Target Start 15:18:35 16
Cloud-Init Config Starts 15:18:36 17
Cloud-Init Config z-Exits 15:18:36 17
Cloud-Init Final Starts 15:18:36 17
Cloud-Init Initial z-Exits 15:18:36 17
Network Target Ready 15:18:36 17
ContainerD Starts 15:18:37 18
Cloud-Init Final (user-data) z-Exits 15:18:39 20
Kubelet Starts 15:18:39 20
Kubelet Node Registration 15:18:40 21
AWS Node Container Starts 15:18:41 22
kube-proxy Started 15:18:41 22
VPC CNI Init Container Starts 15:18:41 22
VPC CNI Plugin Initialized 15:18:44 25
Node Ready 15:18:50 31
Pod Starts 15:18:50 31

@john-zielke-snkeos
Copy link

How did you measure these startup times? And what were the startup times before implementing the changes? I am using the terraform karpenter example to start Nvidia GPU nodes and am currently seeing startup times of 2-3 minutes for a node to be ready using the default eks-AMI.

@bwagner5
Copy link
Contributor Author

How did you measure these startup times? And what were the startup times before implementing the changes? I am using the terraform karpenter example to start Nvidia GPU nodes and am currently seeing startup times of 2-3 minutes for a node to be ready using the default eks-AMI.

The measurements were taken using some custom tooling that we'll be open sourcing very soon. You'll be able to run the tooling as a daemonset to capture timing metrics for your nodes in a standardized format like shown above.

GPU instance types often take much longer to boot within EC2. I have not focused on exotic instance types like baremetal and gpus. The timings above are for c6a.4xlarge, but I have tested on m5 and c6i with similar results. Notably, i instance types and instance types with d (disk) capabilities will take a little longer to boot as well.

I will update this issue once the timing tooling is available.

Here are timings from before the optimizations:

c5.xlarge - i-08f01b534e902dbf7

Event Time (seconds)
Instance Request 0
Instance Created 16:18:45 1
VM Initialized 16:18:56 12
Cloud-Init Config Starts 16:19:05 21
Cloud-Init Initial Starts 16:19:05 21
Cloud-Init Initial z-Exits 16:19:05 21
Network Target Ready 16:19:05 21
Network Target Start 16:19:05 21
Cloud-Init Config z-Exits 16:19:13 29
Cloud-Init Final Starts 16:19:13 29
ContainerD Starts 16:19:14 30
Cloud-Init Final (user-data) z-Exits 16:19:18 34
Kubelet Starts 16:19:18 34
Kubelet Node Registration 16:19:23 39
kube-proxy Started 16:19:26 42
VPC CNI Init Container Starts 16:19:28 44
AWS Node Container Starts 16:19:35 51
VPC CNI Plugin Initialized 16:19:39 55
Node Ready 16:19:44 60
Pod Starts 16:19:48 64

@bwagner5
Copy link
Contributor Author

bwagner5 commented Dec 15, 2022

Here's a gif of startups taking around 25 seconds by cherry-picking a v1.26 kubelet change to v1.24 (along with all the optimizations mentioned above and auto-scaled with Karpenter).

demo

@stevehipwell
Copy link
Contributor

@bwagner5 how does this compare to Bottlerocket? Also are we likely to see this change backported for EKS?

@FernandoMiguel
Copy link

bit +1 on getting all these improvements to BR too

@bwagner5
Copy link
Contributor Author

I still need to do testing on BR, but at least some will carry-over like the VPC CNI improvements. Some don't apply like the yum updates. I'll have to see if we could include container caching in BR. I'll get back to you on back porting to EKS.

@kakarotbyte
Copy link

@bwagner5
Other than the changed mentioned above do I need to take can network considerations (like using VPC endpoint) and bootstrap (hadrcoding API url, CA, Kube-dns ip ) considerations to achieve above 31 second results ?

@bwagner5
Copy link
Contributor Author

I tested it with Karpenter which will automatically hardcode the API URL, CA Bundle, and kube-dns IP as params to the eks bootstrap.sh script. The EKS DescribeCluster call that occurs in the bootstrap.sh script shouldn't take much time though, so I suspect you can get similar results without hardcoding the params, at least for single node launches. You may run into rate limiting on the API call though when doing large node scale outs.

@kakarotbyte
Copy link

Understood, wont it help with speeding up downloading aws-node and kube-proxy images if ECR,S3 VPC endpoints are enabled.

@bwagner5
Copy link
Contributor Author

@kakarotbyte You may see some improvement using the endpoints but I wouldn't expect it to be significant (haven't tested that though). However, in my tests I used the AMI's cached images of aws-node and kube-proxy though.

@armenr
Copy link

armenr commented Dec 30, 2022

@kakarotbyte - I've been lucky enough to have the joy of exchanging ideas and observations with @bwagner5 .

I can tell you - from my extensive testing - that using those VPC endpoints terminated inside the VPC doesn't give you an appreciable improvement in provisioning speed.

But using a super-primitive and minimal HTTP client instead of curl/wget or the imds bash script to fetch and parse your instance metadata from IMDS endpoints can reduce startup latency time by at least 4 seconds (sometimes 6-8 seconds)...in cases where you can't hardcode the input values, and your bootstrap.sh script for EKS needs to go ask IMDS questions.

Using a custom-rolled AMI, built on a custom OS, with a custom kernel (we have really specific use-cases and latency requirements at the company I'm with)...+ using @bwagner5 's improvements, I have done some tests.

IF:

  • use my custom AMI + kernel (system is booted and ready in 2.1 seconds total)
  • use all of @bwagner5 's improvements
  • hard-code every possible value that the eks bootstrap.sh script is looking for
  • use a lightweight HTTP client instead of curl/wget/the imds script
  • cache all of the necessary docker images during AMI baking
  • use Karpenter instead of Cluster Autoscaler

EXPERIMENT:

  • Schedule a pod into the cluster with resource requests/limits that require Karpenter to provision a new node
  • Start running the timer from when you kubectl apply the "un-schedulable" Pod gets scheduled

RESULT on m6i.large

  • | New Node Ready | 26 seconds |

  • | New Pod Starts | 32 seconds|

@bwagner5
Copy link
Contributor Author

bwagner5 commented Jan 4, 2023

FYI I've open sourced the node latency timing tool that I've been using to create the timing charts and emit metrics for my testing here: https://github.com/awslabs/node-latency-for-k8s

Would love feedback on how / if this works well on other OS distributions (I've only been using the eks-optimized AL2).

@armenr
Copy link

armenr commented Apr 13, 2023

@bwagner5 - With whatever changes have already been implemented and merged/released, I'm seeing consistent "Node Ready" state at (or below) an average of ~26(ish) seconds on vanilla EKS AL2 nodes. I don't even need my own custom AMI anymore, to be honest. 👏👏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

7 participants