cache pause, vpc-cni, and kube-proxy images in the AMI #938

bwagner5 · 2022-06-03T21:24:09Z

Issue #, if available:
#1034
#990
#1099

Description of changes:

Cache the following images to improve node bootstrapping time:
- pause
- aws-node images (init and aws-node)
- kube-proxy images (minimal and normal)

kubelet logs with image pull timings before caching (v=4):

Oct  7 17:05:07 ip-192-168-101-146 pull-sandbox-image.sh: done: 45.698568ms
Oct  7 17:05:17 ip-192-168-101-146 kubelet: I1007 17:05:17.995376    4028 event.go:291] "Event occurred" object="kube-system/kube-proxy-z7s74" kind="Pod" apiVersion="v1" type="Normal" reason="Pulled" message="Successfully pulled image \"602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/kube-proxy:v1.22.12-minimal-eksbuild.1\" in 1.712432387s"
Oct  7 17:05:20 ip-192-168-101-146 kubelet: I1007 17:05:20.322544    4028 event.go:291] "Event occurred" object="kube-system/aws-node-c24g9" kind="Pod" apiVersion="v1" type="Normal" reason="Pulled" message="Successfully pulled image \"602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni-init:v1.11.4-eksbuild.1\" in 3.960642579s"
Oct  7 17:05:26 ip-192-168-101-146 kubelet: I1007 17:05:26.913619    4028 event.go:291] "Event occurred" object="kube-system/aws-node-c24g9" kind="Pod" apiVersion="v1" type="Normal" reason="Pulled" message="Successfully pulled image \"60240
1143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:v1.11.4-eksbuild.1\" in 1.390664746s"

The total latency in these serial image pulls is ~7 seconds.

After caching all are used from the local host:

Oct  7 17:05:09 ip-192-168-97-141 kubelet: I1007 17:05:09.666150    4912 event.go:291] "Event occurred" object="kube-system/aws-node-ldvfh" kind="Pod" apiVersion="v1" type="Normal" reason="Pulled" message="Container image \"602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni-init:v1.11.4-eksbuild.1\" already present on machine"
Oct  7 17:05:12 ip-192-168-97-141 kubelet: I1007 17:05:12.375142    4912 event.go:291] "Event occurred" object="kube-system/aws-node-ldvfh" kind="Pod" apiVersion="v1" type="Normal" reason="Pulled" message="Container image \"602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:v1.11.4-eksbuild.1\" already present on machine"
Oct  7 17:05:12 ip-192-168-97-141 kubelet: I1007 17:05:12.493594    4912 event.go:291] "Event occurred" object="kube-system/kube-proxy-79tp5" kind="Pod" apiVersion="v1" type="Normal" reason="Pulled" message="Container image \"602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/kube-proxy:v1.22.12-minimal-eksbuild.1\" already present on machine"

Since this PR caches the sandbox image (pause), this means we can also enable containerd to start on boot rather than in the bootstrap script which removes the containerd startup time which results in ~3 second latency reduction for when kubelet starts (P50).

This graph shows the combined reduction of the node getting to Ready, starting w/ kube-proxy & pause cached and then adding the vpc cni images to the cache (P50):

Size of the containerd image cache:

du -h /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs
...
1.1G	/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/

Size of a running node without any cached images (so the minimal images pulled needed to start the node):

du -h /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/
...
470M	/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

cartermckinnon · 2022-06-07T22:55:45Z

I'm interested in where this is headed, but what kind of speedup are we talking about? These images are fairly small, and they're coming from within the region. I'm seeing <5 seconds total for image pulls in us-west-2:

> sudo ctr --namespace k8s.io image pull 602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/pause:3.1-eksbuild.1 --user AWS:$ECR_PASSWORD
602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/pause:3.1-eksbuild.1:            resolved       |++++++++++++++++++++++++++++++++++++++| 
index-sha256:1cb4ab85a3480446f9243178395e6bee7350f0d71296daeb6a9fdd221e23aea6:    done           |++++++++++++++++++++++++++++++++++++++| 
manifest-sha256:234b8785dd78afc0fbb27edad009e7eb253e5685fb7387d4f0145f65c00873ac: done           |++++++++++++++++++++++++++++++++++++++| 
config-sha256:106a8e54d5eb3f70fcd1ed46255bdf232b3f169e89e68e13e4e67b25f59c1315:   done           |++++++++++++++++++++++++++++++++++++++| 
layer-sha256:41d8806bd3d23e1ffb7e9825fa56a0c2e851dfeeb405477ab1d6bc3a34bc0da2:    done           |++++++++++++++++++++++++++++++++++++++| 
elapsed: 0.4 s                                                                    total:  1.2 Ki (3.1 KiB/s)                                       
unpacking linux/amd64 sha256:1cb4ab85a3480446f9243178395e6bee7350f0d71296daeb6a9fdd221e23aea6...
done

> sudo ctr --namespace k8s.io image pull 602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:v1.11.2 --user AWS:$ECR_PASSWORD
602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:v1.11.2:              resolved       |++++++++++++++++++++++++++++++++++++++| 
index-sha256:9da3824d4b058462912d6e781714c4faa30be0a306cc48790cc042f51ca70651:    done           |++++++++++++++++++++++++++++++++++++++| 
manifest-sha256:5778a70db82f9ed9fb3ed3cb1d882f7211637c064589cd894c261759ea77265a: done           |++++++++++++++++++++++++++++++++++++++| 
layer-sha256:85f9b0d1a1fdbd43edbe8f98bbbf3b11d250fcfb05155e43fcb21d0d04822b0d:    done           |++++++++++++++++++++++++++++++++++++++| 
layer-sha256:8de5b65bd171294b1e04e0df439f4ea11ce923b642eddf3b3d76d297bfd2670c:    done           |++++++++++++++++++++++++++++++++++++++| 
layer-sha256:f18cfec559f5efc0bc336da4bb168e5a850082ad64e090db084e04f6cd65f7c8:    done           |++++++++++++++++++++++++++++++++++++++| 
config-sha256:4e9a8bf255bb85e99421a7108d8306ba0cacf43145478e5ab9d07e60ba9177ec:   done           |++++++++++++++++++++++++++++++++++++++| 
layer-sha256:1b862f5f9fa1a496dc2cb43edfeedd44e25a2c227d8b8796e223910853d3339b:    done           |++++++++++++++++++++++++++++++++++++++| 
elapsed: 2.8 s                                                                    total:  94.1 M (33.6 MiB/s)                                      
unpacking linux/amd64 sha256:9da3824d4b058462912d6e781714c4faa30be0a306cc48790cc042f51ca70651...
done

> sudo ctr --namespace k8s.io image pull 602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni-init:v1.11.2 --user AWS:$ECR_PASSWORD
602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni-init:v1.11.2:         resolved       |++++++++++++++++++++++++++++++++++++++| 
index-sha256:aa5bd2e6b21e7f4167e9db19c239f2288159ca171558ec3a60a7c5f9ce0650d3:    done           |++++++++++++++++++++++++++++++++++++++| 
manifest-sha256:c9fab251a20bf364315f54d4e43708e8a839d6e2d24156d70462beea0567e756: done           |++++++++++++++++++++++++++++++++++++++| 
layer-sha256:edc3ce4cb2326060f5e32631030f0ce65c688b745b40e51b0f18effb7fc2618a:    done           |++++++++++++++++++++++++++++++++++++++| 
config-sha256:5abf9331a3a190e0d8c5cde96d540e39b737e63e4d89d864694717c8ab9607eb:   done           |++++++++++++++++++++++++++++++++++++++| 
layer-sha256:8de5b65bd171294b1e04e0df439f4ea11ce923b642eddf3b3d76d297bfd2670c:    exists         |++++++++++++++++++++++++++++++++++++++| 
layer-sha256:e5481bb799cb068f202c75d6d276f186821d31b7783b5e3e8627362b1d4ced02:    done           |++++++++++++++++++++++++++++++++++++++| 
layer-sha256:85ab13e61b22b5dd72a416bcd42bf052d8a1d4565b3229b551043ba74343d7f5:    done           |++++++++++++++++++++++++++++++++++++++| 
elapsed: 0.8 s                                                                    total:  34.6 M (43.2 MiB/s)                                      
unpacking linux/amd64 sha256:aa5bd2e6b21e7f4167e9db19c239f2288159ca171558ec3a60a7c5f9ce0650d3...
done

> sudo ctr --namespace k8s.io image pull 602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/kube-proxy:v1.22.6-eksbuild.1 --user AWS:$ECR_PASSWORD
602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/kube-proxy:v1.22.6-eksbuild.1:   resolved       |++++++++++++++++++++++++++++++++++++++| 
index-sha256:c8abb4b8efc94090458f34e5f456791d9f7f57b5c99517b6b4e197305c1f10f6:    done           |++++++++++++++++++++++++++++++++++++++| 
manifest-sha256:0256f9a055e60fb40bccb5e2f743eb0eea13dcf38952058ceaba8e1fd1e6c096: done           |++++++++++++++++++++++++++++++++++++++| 
config-sha256:c8c9982c9d03789fe4d09993eaa54b11acd7b9bc6ebbfa60a223696172ca7507:   done           |++++++++++++++++++++++++++++++++++++++| 
layer-sha256:20b09fbd30377e1315a8bc9e15b5f8393a1090a7ec3f714ba5fce0c9b82a42f2:    done           |++++++++++++++++++++++++++++++++++++++| 
layer-sha256:8e20184c86519fabc2c1c075723074fc9c3fd780fb4c3fa2722b99db746b6db1:    done           |++++++++++++++++++++++++++++++++++++++| 
elapsed: 0.7 s                                                                    total:  30.6 M (43.6 MiB/s)                                      
unpacking linux/amd64 sha256:c8abb4b8efc94090458f34e5f456791d9f7f57b5c99517b6b4e197305c1f10f6...
done

Maybe this is more of a benefit for Outpost?

bwagner5 · 2022-06-13T14:55:17Z

This provides a marginal speed-up of around 8-10 seconds. The reason is that not all images are pulled concurrently. Additionally, caching the images on the node helps with large scale outs where a container registry may start throttling depending on the number of nodes, on-going pulls, and how many layers the images have. The images cached in this PR are 10 layers, which means with ECR's 3,000 TPS GetDownloadUrlForLayer limit, a 300 node scale-up would result in throttling and likely a smaller scale-up since the actual images that need to be pulled would also be taking up that TPS. Since we're fairly confident (although not guaranteed in the case of the vpc-cni) in what images will be used in the AMI, I think it makes sense to cache these directly.

cartermckinnon · 2022-06-14T00:25:46Z

with ECR's 3,000 TPS GetDownloadUrlForLayer limit, a 300 node scale-up would result in throttling and likely a smaller scale-up since the actual images that need to be pulled would also be taking up that TPS

This limit is per second, though -- it seems unlikely that 300 nodes would attempt to download all 10 of these layers within the same second, right? containerd also only downloads 3 layers in parallel by default. Users might configure a higher concurrency, but they can also request an increased quota from ECR, too.

Anyway, not against the general idea; we just need to make sure we're caching the "right" image(s) for this to benefit most users.

scripts/install-worker.sh

bwagner5 · 2022-06-14T16:40:32Z

with ECR's 3,000 TPS GetDownloadUrlForLayer limit, a 300 node scale-up would result in throttling and likely a smaller scale-up since the actual images that need to be pulled would also be taking up that TPS

This limit is per second, though -- it seems unlikely that 300 nodes would attempt to download all 10 of these layers within the same second, right? containerd also only downloads 3 layers in parallel by default. Users might configure a higher concurrency, but they can also request an increased quota from ECR, too.

Anyway, not against the general idea; we just need to make sure we're caching the "right" image(s) for this to benefit most users.

It might be unlikely to download them all in the same second but not impossible. As the scale outs get larger and faster (which is what we're trying to do with Karpenter), things will start to bottleneck.

scripts/install-worker.sh

files/bootstrap.sh

files/pull-image.sh

scripts/install-worker.sh

files/pull-image.sh

scripts/install-worker.sh

cartermckinnon · 2022-10-10T19:47:37Z

Can you add the size of the image cache after this change?

bwagner5 · 2022-10-10T20:29:43Z

Can you add the size of the image cache after this change?

du -h /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs
...
1.1G	/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/

armenr · 2022-11-08T08:10:36Z

Folks - I stumbled onto this Issue, personally, because I've been working on improving "NodeReady Latency" time on my own personal cluster as well. I want to get as far below 60s as possible, since our use-case is pretty latency-sensitive, and because Fargate pricing doesn't suit our use-case.

I have an AMI (non-BottleRocket/non-AL2/non-Ubuntu) that boots from "Instance Create" --> kernel --> user space --> fully "booted" in ~20 seconds (+/- 1.5 seconds)

That includes dynamically bootstrapping the additional configs/limits/info that the bootstrap.sh script goes through, but skipping any reliance or dependency on the cloud-init.

Note: traditional (Python-based) cloud-init is a GIANT time-suck for instance boot times.

Full transparency: I cannibalized the bootstrap script and worked on optimizing a few additional bits. I'm happy to share my findings/strategies. They are stable, and have been repeatably reliable in testing.

Using Karpenter, I'm able to schedule a new Pod (inflate test pod) that goes from Pending state (waiting for additional compute/new EKS worker node) to Running in about 58 seconds (comparable with BottleRocket boot/bootstrap times).

Suggestion

One additional area of improvement is to relocate the pause/sandbox image to a public ECR Repo.

The reason? You lose about ~5-6 seconds JUST waiting for retrieval of an ECR credential/token, BEST-CASE. That is to say, best-case times are around 5-6 seconds, and worst-case times are around 10-11 seconds of waiting, based on my testing...and I've been testing like a mad-man.

Check this out:

root@ip-10-0-11-210.us-west-2.compute.internal~ # systemd-analyze blame
13.462s sandbox-image.service

I added a few silly echo <where_in_the_script_are_we> lines to the pull-sandbox-image.sh script to get a sense of what the hold-up is.

On this run, it looks like we spent almost 10 seconds (~9.5-ish seconds) just waiting for the ECR password/token.

Nov 08 08:00:26 clr-ec2d84812bc92bcab8f37c572fae8f06 systemd[1]: Starting pull sandbox image defined in containerd config.toml...
Nov 08 08:00:26 clr-ec2d84812bc92bcab8f37c572fae8f06 pull-sandbox-image.sh[197]: FETCHING PASSWORD
Nov 08 08:00:35 ip-10-0-11-210.us-west-2.compute.internal pull-sandbox-image.sh[197]: FETCHED PASSWORD
Nov 08 08:00:35 ip-10-0-11-210.us-west-2.compute.internal pull-sandbox-image.sh[197]: Pulling Image
Nov 08 08:00:35 ip-10-0-11-210.us-west-2.compute.internal sudo[453]:     root : PWD=/ ; USER=root ; COMMAND=/sbin/ctr --namespace k8s.io image pull 602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/pause:3.5 --user
Nov 08 08:00:35 ip-10-0-11-210.us-west-2.compute.internal sudo[453]:     root : (command continued) AWS:<SOME_LONG_TOKEN_STRING>
Nov 08 08:00:35 ip-10-0-11-210.us-west-2.compute.internal sudo[453]:     root : (command continued) <SOME_LONG_TOKEN_STRING_CONTINUED>
Nov 08 08:00:35 ip-10-0-11-210.us-west-2.compute.internal sudo[453]:     root : (command continued) <SOME_LONG_TOKEN_STRING_PART_THREE>
Nov 08 08:00:35 ip-10-0-11-210.us-west-2.compute.internal sudo[453]: pam_unix(sudo:session): session opened for user root by (uid=0)
Nov 08 08:00:39 ip-10-0-11-210.us-west-2.compute.internal pull-sandbox-image.sh[454]: 602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/pause:3.5:                       resolved       |++++++++++++++++++++++++++++++
++++++++|
Nov 08 08:00:39 ip-10-0-11-210.us-west-2.compute.internal pull-sandbox-image.sh[454]: index-sha256:529cf6b1b6e5b76e901abc43aee825badbd93f9c5ee5f1e316d46a83abbce5a2:    exists         |++++++++++++++++++++++++++++++
++++++++|
Nov 08 08:00:39 ip-10-0-11-210.us-west-2.compute.internal pull-sandbox-image.sh[454]: manifest-sha256:666eebd093e91212426aeba3b89002911d2c981fefd8806b1a0ccb4f1b639a60: exists         |++++++++++++++++++++++++++++++
++++++++|
Nov 08 08:00:39 ip-10-0-11-210.us-west-2.compute.internal pull-sandbox-image.sh[454]: elapsed: 0.1 s                                                                    total:   0.0 B (0.0 B/s)
Nov 08 08:00:39 ip-10-0-11-210.us-west-2.compute.internal pull-sandbox-image.sh[454]: 602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/pause:3.5:                       resolved       |++++++++++++++++++++++++++++++
++++++++|
Nov 08 08:00:39 ip-10-0-11-210.us-west-2.compute.internal pull-sandbox-image.sh[454]: index-sha256:529cf6b1b6e5b76e901abc43aee825badbd93f9c5ee5f1e316d46a83abbce5a2:    exists         |++++++++++++++++++++++++++++++
++++++++|
Nov 08 08:00:39 ip-10-0-11-210.us-west-2.compute.internal pull-sandbox-image.sh[454]: manifest-sha256:666eebd093e91212426aeba3b89002911d2c981fefd8806b1a0ccb4f1b639a60: exists         |++++++++++++++++++++++++++++++
++++++++|
Nov 08 08:00:39 ip-10-0-11-210.us-west-2.compute.internal pull-sandbox-image.sh[454]: layer-sha256:0692f38991d53a0c28679148f99de26a44d630fda984b41f63c5e19f839d15a6:    done           |++++++++++++++++++++++++++++++
++++++++|
Nov 08 08:00:39 ip-10-0-11-210.us-west-2.compute.internal pull-sandbox-image.sh[454]: config-sha256:6996f8da07bd405c6f82a549ef041deda57d1d658ec20a78584f9f436c9a3bb7:   done           |++++++++++++++++++++++++++++++
++++++++|
Nov 08 08:00:39 ip-10-0-11-210.us-west-2.compute.internal pull-sandbox-image.sh[454]: elapsed: 0.2 s                                                                    total:   0.0 B (0.0 B/s)
Nov 08 08:00:39 ip-10-0-11-210.us-west-2.compute.internal pull-sandbox-image.sh[454]: unpacking linux/amd64 sha256:529cf6b1b6e5b76e901abc43aee825badbd93f9c5ee5f1e316d46a83abbce5a2...
Nov 08 08:00:39 ip-10-0-11-210.us-west-2.compute.internal pull-sandbox-image.sh[454]: done: 7.90311ms
Nov 08 08:00:39 ip-10-0-11-210.us-west-2.compute.internal sudo[453]: pam_unix(sudo:session): session closed for user root
Nov 08 08:00:39 ip-10-0-11-210.us-west-2.compute.internal pull-sandbox-image.sh[197]: In BREAK
Nov 08 08:00:39 ip-10-0-11-210.us-west-2.compute.internal systemd[1]: sandbox-image.service: Deactivated successfully.
Nov 08 08:00:39 ip-10-0-11-210.us-west-2.compute.internal systemd[1]: Finished pull sandbox image defined in containerd config.toml.

The troublesome bit from the logs above ^^

Nov 08 08:00:26 clr-ec2d84812bc92bcab8f37c572fae8f06 pull-sandbox-image.sh[197]: FETCHING PASSWORD
Nov 08 08:00:35 ip-10-0-11-210.us-west-2.compute.internal pull-sandbox-image.sh[197]: FETCHED PASSWORD

My Point

Aside from caching (which I attempted in my own setup as well, btw), I'd suggest maybe considering:

putting whatever is possible into a public, read-only ECR repo and skipping the credential wait times altogether
or at least decoupling the fetching of ECR credentials from the first-boot sequence of the node by utilizing the locally cached/warmed images, and fetching ECR credentials in a non-blocking/non-gating way, in the background somewhere.

bwagner5 · 2022-11-08T17:27:09Z

My Point
Aside from caching (which I attempted in my own setup as well, btw), I'd suggest maybe considering:

putting whatever is possible into a public, read-only ECR repo and skipping the credential wait times altogether

or at least decoupling the fetching of ECR credentials from the first-boot sequence of the node by utilizing the locally cached/warmed images, and fetching ECR credentials in a non-blocking/non-gating way, in the background somewhere.

This PR mitigates the credential fetching by utilizing the cached pause image. The sandbox-image service still runs but it is not a blocking operation for containerd.

Also, checkout this PR that removes the cloud-init update packages module which blocks user-data execution: #1074 . The updates were dangerous anyways since you could have a cluster of instances with the same AMI ID but operating different software versions.

I experimented with removing cloud-init altogether, but it was difficult to justify removing it with an image distributed this widely. I found that removing the package updates reduced the majority of the cloud-init latency. There's still a few seconds of overhead with cloud-init but in my measurements it seems to only be 2-3 secs.

Using this PR, #1074, and some VPC CNI startup improvements, Karpenter is able to bring a node to "Ready" in 30 seconds at P50 and some nodes going Ready in 25 seconds. These timings also account for the launch latency in the EC2 control plane which isn't visible looking at node creation time with Karpenter since Karpenter creates the node resource after the EC2 Fleet call returns and the instance hostname is looked up via DescribeInstances.

I'd be curious on any other optimizations you implemented where you think we might be able to get the EKS Optimized AL2 launch time lower!

armenr · 2022-11-09T09:43:43Z

@bwagner5 - Thank you for the reply! I'll experiment with the referenced PR. Let me gather up some notes/things I had done in my repo, and see what might be relevant. :) Happy to share.

Additionally, you mentioned VPC CNI startup improvements - I'm kinda intrigued by this. Would you be able to point me in the direction of any info, or maybe share what you mean?

Thank you! :)

bwagner5 · 2022-11-10T00:41:16Z

@bwagner5 - Thank you for the reply! I'll experiment with the referenced PR. Let me gather up some notes/things I had done in my repo, and see what might be relevant. :) Happy to share.

Additionally, you mentioned VPC CNI startup improvements - I'm kinda intrigued by this. Would you be able to point me in the direction of any info, or maybe share what you mean?

Thank you! :)

There's this PR which speeds up VPC CNI by about 2 seconds just due to some unnecessary sleeping: Reduce startup latency by removing some unneeded sleeps aws/amazon-vpc-cni-k8s#2104
I'm working on speeding up the VPC CNI's init container to main container switch over latency which, in my tests, shows about a 3 second latency. I'm experimenting with removing the init container and running it as a regular container in the pod and synchronizing the init stage with a shared emptyDir volume. This will allow the container start times to be parallelized and reduce any switching latency within the kubelet.

armenr · 2022-11-10T03:44:37Z

@bwagner5 - In order to respect the purpose of this thread (which is a PR, and not even an "Issue"), would there be any way for us to connect via Discord or Slack...to keep the exchange of info/ideas + discussion going?

I recently spent 2 years @ AWS (US Startups Org) as an SA Manager. As a fellow (former) Amazonian, I'd be down to dig into this a bit with you... this is me: https://www.linkedin.com/in/armenr/

stevehipwell · 2022-11-10T09:18:05Z

@bwagner5 I'm also interested in this discussion; could the VPC CNI start-up speed be addressed by using an eBPF pattern either directly or via chaining another CNI such as Cillium?

bwagner5 · 2022-11-10T22:33:49Z

@stevehipwell @armenr I started a group message with you both on K8s slack. We can continue discussion there. If anyone else is interested that stumbles across this issue, ping me on K8s slack (slack username: brandon.wagner)

maximethebault · 2022-11-11T01:07:34Z

Startup latency is of major interest to our organization as well. Not sure I would bring much to the conversation, but being kept up-to-date of the different initiatives and "tuning tips" to make node startup faster is definitely something we'd appreciate.

bwagner5 · 2022-11-11T17:48:40Z

Startup latency is of major interest to our organization as well. Not sure I would bring much to the conversation, but being kept up-to-date of the different initiatives and "tuning tips" to make node startup faster is definitely something we'd appreciate.

Created this issue to track and get feedback: #1099

README.md

bwagner5 marked this pull request as draft June 3, 2022 23:49

cartermckinnon reviewed Jun 14, 2022

View reviewed changes

scripts/install-worker.sh Outdated Show resolved Hide resolved

cartermckinnon mentioned this pull request Jul 28, 2022

preload docker images #677

Closed

guessi reviewed Aug 5, 2022

View reviewed changes

scripts/install-worker.sh Outdated Show resolved Hide resolved

guessi mentioned this pull request Aug 6, 2022

Added logic to collect logs on bottlerocket AMI #870

Closed

bwagner5 force-pushed the cache-images branch from d330c2d to 9cdeccc Compare August 26, 2022 20:16

bwagner5 force-pushed the cache-images branch from 52e1bd1 to 3cc119d Compare September 13, 2022 22:03

cartermckinnon mentioned this pull request Sep 15, 2022

Intermediate nodes that fails to start #990

Open

bwagner5 force-pushed the cache-images branch from 3cc119d to dac8201 Compare October 7, 2022 17:00

cartermckinnon self-requested a review October 7, 2022 17:13

bwagner5 force-pushed the cache-images branch 3 times, most recently from 4589eb5 to f98f6e7 Compare October 10, 2022 18:07

bwagner5 marked this pull request as ready for review October 10, 2022 18:08

bwagner5 commented Oct 10, 2022

View reviewed changes

files/bootstrap.sh Outdated Show resolved Hide resolved

bwagner5 commented Oct 10, 2022

View reviewed changes

files/pull-image.sh Show resolved Hide resolved

bwagner5 force-pushed the cache-images branch from f98f6e7 to 0c640a1 Compare October 10, 2022 18:48

bwagner5 commented Oct 10, 2022

View reviewed changes

scripts/install-worker.sh Outdated Show resolved Hide resolved

cartermckinnon requested changes Oct 10, 2022

View reviewed changes

bwagner5 force-pushed the cache-images branch from 2ad4af6 to ff371ec Compare October 10, 2022 22:06

bwagner5 requested a review from cartermckinnon October 10, 2022 22:52

bwagner5 force-pushed the cache-images branch from eeeb8f6 to 9b6f167 Compare October 12, 2022 17:46

bwagner5 requested a review from guessi October 12, 2022 17:47

bwagner5 force-pushed the cache-images branch from c32470c to 1647d61 Compare October 12, 2022 22:48

bwagner5 force-pushed the cache-images branch 4 times, most recently from 55a28e9 to f2928f9 Compare November 5, 2022 17:10

bwagner5 force-pushed the cache-images branch from f2928f9 to d01fc91 Compare November 10, 2022 00:44

bwagner5 mentioned this pull request Nov 11, 2022

Improve Node Startup Latency #1099

Closed

stevehipwell reviewed Nov 15, 2022

View reviewed changes

README.md Outdated Show resolved Hide resolved

bwagner5 force-pushed the cache-images branch 2 times, most recently from 79806e1 to 30b3fe2 Compare November 15, 2022 21:39

guessi mentioned this pull request Nov 17, 2022

sym-link aws to bin #1102

Merged

bwagner5 force-pushed the cache-images branch from 30b3fe2 to 176b1c0 Compare November 17, 2022 23:37

guessi approved these changes Nov 17, 2022

View reviewed changes

cache pause, vpc-cni, and kube-proxy images in the AMI

e8707e7

bwagner5 force-pushed the cache-images branch from 176b1c0 to e8707e7 Compare November 18, 2022 17:40

cartermckinnon approved these changes Nov 18, 2022

View reviewed changes

cartermckinnon merged commit 057f3e4 into awslabs:master Nov 18, 2022

cartermckinnon mentioned this pull request Nov 22, 2022

Short-circut sandbox image fetching #1090

Merged

maximethebault mentioned this pull request Jan 4, 2023

sandbox-image service fault-tolerance & reliability #1034

Closed

This was referenced Sep 1, 2023

aws-network-policy-agent is not pre-loaded in the latest AMI aws/amazon-vpc-cni-k8s#2538

Closed

aws-network-policy-agent is not pre-loaded in the latest AMI #1417

Open

github-actions bot mentioned this pull request Aug 19, 2024

Testing! trm109/amazon-eks-ami-sysctl-tests#6

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cache pause, vpc-cni, and kube-proxy images in the AMI #938

cache pause, vpc-cni, and kube-proxy images in the AMI #938

bwagner5 commented Jun 3, 2022 •

edited

Loading

cartermckinnon commented Jun 7, 2022

bwagner5 commented Jun 13, 2022 •

edited

Loading

cartermckinnon commented Jun 14, 2022

bwagner5 commented Jun 14, 2022

cartermckinnon commented Oct 10, 2022

bwagner5 commented Oct 10, 2022

armenr commented Nov 8, 2022 •

edited

Loading

bwagner5 commented Nov 8, 2022 •

edited

Loading

armenr commented Nov 9, 2022

bwagner5 commented Nov 10, 2022

armenr commented Nov 10, 2022 •

edited

Loading

stevehipwell commented Nov 10, 2022

bwagner5 commented Nov 10, 2022

maximethebault commented Nov 11, 2022

bwagner5 commented Nov 11, 2022

cache pause, vpc-cni, and kube-proxy images in the AMI #938

cache pause, vpc-cni, and kube-proxy images in the AMI #938

Conversation

bwagner5 commented Jun 3, 2022 • edited Loading

cartermckinnon commented Jun 7, 2022

bwagner5 commented Jun 13, 2022 • edited Loading

cartermckinnon commented Jun 14, 2022

bwagner5 commented Jun 14, 2022

cartermckinnon commented Oct 10, 2022

bwagner5 commented Oct 10, 2022

armenr commented Nov 8, 2022 • edited Loading

Suggestion

My Point

bwagner5 commented Nov 8, 2022 • edited Loading

armenr commented Nov 9, 2022

bwagner5 commented Nov 10, 2022

armenr commented Nov 10, 2022 • edited Loading

stevehipwell commented Nov 10, 2022

bwagner5 commented Nov 10, 2022

maximethebault commented Nov 11, 2022

bwagner5 commented Nov 11, 2022

bwagner5 commented Jun 3, 2022 •

edited

Loading

bwagner5 commented Jun 13, 2022 •

edited

Loading

armenr commented Nov 8, 2022 •

edited

Loading

bwagner5 commented Nov 8, 2022 •

edited

Loading

armenr commented Nov 10, 2022 •

edited

Loading