-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cache pause, vpc-cni, and kube-proxy images in the AMI #938
Conversation
I'm interested in where this is headed, but what kind of speedup are we talking about? These images are fairly small, and they're coming from within the region. I'm seeing <5 seconds total for image pulls in us-west-2:
Maybe this is more of a benefit for Outpost? |
This provides a marginal speed-up of around 8-10 seconds. The reason is that not all images are pulled concurrently. Additionally, caching the images on the node helps with large scale outs where a container registry may start throttling depending on the number of nodes, on-going pulls, and how many layers the images have. The images cached in this PR are 10 layers, which means with ECR's 3,000 TPS GetDownloadUrlForLayer limit, a 300 node scale-up would result in throttling and likely a smaller scale-up since the actual images that need to be pulled would also be taking up that TPS. Since we're fairly confident (although not guaranteed in the case of the vpc-cni) in what images will be used in the AMI, I think it makes sense to cache these directly. |
This limit is per second, though -- it seems unlikely that 300 nodes would attempt to download all 10 of these layers within the same second, right? Anyway, not against the general idea; we just need to make sure we're caching the "right" image(s) for this to benefit most users. |
It might be unlikely to download them all in the same second but not impossible. As the scale outs get larger and faster (which is what we're trying to do with Karpenter), things will start to bottleneck. |
52e1bd1
to
3cc119d
Compare
4589eb5
to
f98f6e7
Compare
f98f6e7
to
0c640a1
Compare
Can you add the size of the image cache after this change? |
|
2ad4af6
to
ff371ec
Compare
eeeb8f6
to
9b6f167
Compare
c32470c
to
1647d61
Compare
55a28e9
to
f2928f9
Compare
Folks - I stumbled onto this Issue, personally, because I've been working on improving "NodeReady Latency" time on my own personal cluster as well. I want to get as far below I have an AMI (non-BottleRocket/non-AL2/non-Ubuntu) that boots from "Instance Create" --> kernel --> user space --> fully "booted" in That includes dynamically bootstrapping the additional configs/limits/info that the Note: traditional (Python-based) cloud-init is a GIANT time-suck for instance boot times. Full transparency: I cannibalized the Using Karpenter, I'm able to schedule a new Pod ( SuggestionOne additional area of improvement is to relocate the The reason? You lose about ~5-6 seconds JUST waiting for retrieval of an ECR credential/token, BEST-CASE. That is to say, best-case times are around 5-6 seconds, and worst-case times are around 10-11 seconds of waiting, based on my testing...and I've been testing like a mad-man. Check this out:
I added a few silly On this run, it looks like we spent almost 10 seconds (~9.5-ish seconds) just waiting for the ECR password/token.
The troublesome bit from the logs above ^^
My PointAside from caching (which I attempted in my own setup as well, btw), I'd suggest maybe considering:
|
This PR mitigates the credential fetching by utilizing the cached pause image. The sandbox-image service still runs but it is not a blocking operation for containerd. Also, checkout this PR that removes the cloud-init update packages module which blocks user-data execution: #1074 . The updates were dangerous anyways since you could have a cluster of instances with the same AMI ID but operating different software versions. I experimented with removing cloud-init altogether, but it was difficult to justify removing it with an image distributed this widely. I found that removing the package updates reduced the majority of the cloud-init latency. There's still a few seconds of overhead with cloud-init but in my measurements it seems to only be 2-3 secs. Using this PR, #1074, and some VPC CNI startup improvements, Karpenter is able to bring a node to "Ready" in 30 seconds at P50 and some nodes going Ready in 25 seconds. These timings also account for the launch latency in the EC2 control plane which isn't visible looking at node creation time with Karpenter since Karpenter creates the node resource after the EC2 Fleet call returns and the instance hostname is looked up via DescribeInstances. I'd be curious on any other optimizations you implemented where you think we might be able to get the EKS Optimized AL2 launch time lower! |
@bwagner5 - Thank you for the reply! I'll experiment with the referenced PR. Let me gather up some notes/things I had done in my repo, and see what might be relevant. :) Happy to share. Additionally, you mentioned VPC CNI startup improvements - I'm kinda intrigued by this. Would you be able to point me in the direction of any info, or maybe share what you mean? Thank you! :) |
|
f2928f9
to
d01fc91
Compare
@bwagner5 - In order to respect the purpose of this thread (which is a PR, and not even an "Issue"), would there be any way for us to connect via Discord or Slack...to keep the exchange of info/ideas + discussion going? I recently spent 2 years @ AWS (US Startups Org) as an SA Manager. As a fellow (former) Amazonian, I'd be down to dig into this a bit with you... this is me: https://www.linkedin.com/in/armenr/ |
@bwagner5 I'm also interested in this discussion; could the VPC CNI start-up speed be addressed by using an eBPF pattern either directly or via chaining another CNI such as Cillium? |
@stevehipwell @armenr I started a group message with you both on K8s slack. We can continue discussion there. If anyone else is interested that stumbles across this issue, ping me on K8s slack (slack username: |
Startup latency is of major interest to our organization as well. Not sure I would bring much to the conversation, but being kept up-to-date of the different initiatives and "tuning tips" to make node startup faster is definitely something we'd appreciate. |
Created this issue to track and get feedback: #1099 |
79806e1
to
30b3fe2
Compare
30b3fe2
to
176b1c0
Compare
176b1c0
to
e8707e7
Compare
Issue #, if available:
#1034
#990
#1099
Description of changes:
kubelet logs with image pull timings before caching (v=4):
The total latency in these serial image pulls is ~7 seconds.
After caching all are used from the local host:
Since this PR caches the sandbox image (pause), this means we can also enable containerd to start on boot rather than in the bootstrap script which removes the containerd startup time which results in ~3 second latency reduction for when kubelet starts (P50).
This graph shows the combined reduction of the node getting to Ready, starting w/ kube-proxy & pause cached and then adding the vpc cni images to the cache (P50):
Size of the containerd image cache:
Size of a running node without any cached images (so the minimal images pulled needed to start the node):
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.