Sandbox container image being GC'd in 1.29 #1597

nightmareze1 · 2024-01-29T21:26:49Z

AMI: amazon-eks-node-1.29-v20240117

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox image "602401143452.dkr.ecr.eu-west-2.amazonaws.com/eks/pause:3.5": failed to pull image "602401143452.dkr.ecr.eu-west-2.amazonaws.com/eks/pause:3.5": failed to pull and unpack image "602401143452.dkr.ecr.eu-west-2.amazonaws.com/eks/pause:3.5": failed to resolve reference "602401143452.dkr.ecr.eu-west-2.amazonaws.com/eks/pause:3.5": unexpected status from HEAD request to https://602401143452.dkr.ecr.eu-west-2.amazonaws.com/v2/eks/pause/manifests/3.5: 401 Unauthorized

1 day after upgrading EKS to 1.29

The text was updated successfully, but these errors were encountered:

cartermckinnon · 2024-01-29T21:37:44Z

It sounds like something deleted your pause container image.

I would check:

Make sure that the --pod-infra-container-image flag passed to kubelet matches the sandbox_image in /etc/containerd/config.toml. This will prevent kubelet from deleting it during its image garbage collection process.
Look for RemoveImage CRI calls in your containerd logs. It's likely that some other CRI client (not kubelet) is deleting the image.

nightmareze1 · 2024-01-29T21:51:18Z

[~]# systemctl kubelet status

          └─3729 /usr/bin/kubelet --config /etc/kubernetes/kubelet/kubelet-config.json --kubeconfig /var/lib/kubelet/kubeconfig --container-runtime-endpoint unix:///run/containerd/containerd.sock --image-credential-provider-config /etc/eks/image-credential-provider/config.json --image-credential-provider-bin-dir /etc/eks/image-credential-provider --pod-infra-container-image=602401143452.dkr.ecr.eu-west-2.amazonaws.com/eks/pause:3.5 --v=2

[~]# cat /etc/containerd/config.toml |grep 602401143452.dkr.ecr.eu-west-2.amazonaws.com/eks/pause:3.5

sandbox_image = "602401143452.dkr.ecr.eu-west-2.amazonaws.com/eks/pause:3.5"

jrsparks86 · 2024-01-29T23:44:10Z

We also have noticed this issue after updating to 1.29. If we rotate out the nodes it recovers for some time then comes back a day later.

nightmareze1 · 2024-01-29T23:50:59Z

I'm using a temporal workaround proposed by a person in the issue created in aws-node repo(I modified a little but works)

curl -fsL -o crictl.tar.gz https://github.com/kubernetes-sigs/cri-tools/releases/download/v1.29.0/crictl-v1.29.0-linux-amd64.tar.gz
tar zxf crictl.tar.gz
chmod u+x crictl
mv crictl /usr/bin/crictl


cat <<EOF > /etc/eks/eks_creds_puller.sh
IMAGE_TOKEN=@@@(aws ecr get-login-password --region eu-west-2)
crictl --runtime-endpoint=unix:///run/containerd/containerd.sock  pull --creds "AWS:\$IMAGE_TOKEN" 602401143452.dkr.ecr.eu-west-2.amazonaws.com/eks/pause:3.5
EOF
 
sed -i 's/@@@/\$/g' /etc/eks/eks_creds_puller.sh

chmod u+x /etc/eks/eks_creds_puller.sh

echo "*/5 * * * * /etc/eks/eks_creds_puller.sh >> /var/log/eks_creds_puller 2>&1" | crontab -

ohrab-hacken · 2024-01-30T09:01:06Z

I am experience same issue. --pod-infra-container-image flag is set on kubelet. I found that my disk on node really become full after some time and kubelet garbage collector delete pause image. So, instead of delete different images, it deletes pause image. After pause image deleted, node doesn't work.
I found the reason of full disk. In my case, I have ttlSecondsAfterFinished: 7200 for dagster jobs, and it consume all disk space. I've changed it to ttlSecondsAfterFinished: 120 and jobs cleaned up more frequently and we don't have this issue any more.
It's strange cause I didn't have this issue on 1.28, and I didn't change any Dagster configuration between version upgrade. My guess, it kubelet image garbage collector works different in 1.28 and 1.29.

ghost · 2024-01-30T09:59:34Z

We're experiencing the same issue as well.

wiseelf · 2024-01-30T12:40:39Z

I'm having that same issue after upgrading to 1.29 on both AL2 and Bottlerocket nodes.

havilchis · 2024-01-30T15:49:26Z

The Kubelet flag --pod-infra-container-image is deprecated in 1.27+ [https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/]. The current implementation is that GC reads the properties of the Image set by the Container Runtime.

In the case of containerd, the GC should avoid images tagged with the property "pinned: true".

And containerd should flag the sandbox_image as pinned [https://github.com/containerd/containerd/pull/7944].

I believe that issue is related to containerD and the sandbox_image.

Although is set in config.toml, this is not flagged as "pinned: true".

I do not know if this is a general issue in ContainerD, but at least in my EKS Cluster in 1.29 the sandbox image appears as "pinned:false";

./crictl images | grep pause | grep us-east-1 | grep pause
602401143452.dkr.ecr-fips.us-east-1.amazonaws.com/eks/pause                    3.5                          6996f8da07bd4       299kB
602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause                         3.5                          6996f8da07bd4       299kB

./crictl inspecti 6996f8da07bd4 | grep pinned
    "pinned": false

cartermckinnon · 2024-01-30T16:47:26Z

It definitely seems like image pinning is the problem here. I'm trying to put a fix together 👍

cartermckinnon · 2024-01-30T19:55:36Z

I think the issue here is the version of containerd being used by Amazon Linux does not have pinned image support, which was added in 1.7.3: containerd/containerd@v1.7.2...v1.7.3

I'm verifying that this hasn't been cherry-picked by the AL team. We'll probably have to do a hotfix in the immediate term.

cartermckinnon · 2024-01-30T20:06:43Z

AL intends to push containerd-1.7.11 to the package repositories soon, but I'll go ahead and put together a hotfix on our end.

cartermckinnon · 2024-01-30T21:32:56Z

I think the best bandaid for now is to periodically pull the sandbox image (if necessary), that's what #1601 does. @mmerkes @suket22 PTAL.

Idan-Lazar · 2024-01-31T10:20:03Z

any updates?

StefanoMantero · 2024-01-31T14:00:38Z

We're experiencing the same issue as well, pretty random tho, any updates ?

dekelummanu · 2024-01-31T14:06:00Z

+1

spatelwearpact · 2024-01-31T17:16:37Z

None of our applications or jobs are running in the cluster now! This is literally the highest priority issue with 1.29!

Tenzer · 2024-01-31T17:17:50Z

A small workaround I've done on our end to help alleviate the issue, is to give the nodes in the cluster a bigger disk. This means it will take longer time for the nodes to use enough disk space to trigger the garbage collection which deletes the pause image.

wiseelf · 2024-01-31T17:21:48Z

A small workaround I've done on our end to help alleviate the issue, is to give the nodes in the cluster a bigger disk. This means it will take longer time for the nodes to use enough disk space to trigger the garbage collection which deletes the pause image.

I did the same, it just increases a time for issue to occur and brings additional expenses. Agree that it is a top priority issue because it is impossible to downgrade to 1.28 without recreating the cluster.

cartermckinnon · 2024-01-31T17:54:43Z

The way we pull the image is part of the problem, this label is only applied (with containerd 1.7.3+) at pull time in the cri-containerd server, ctr pull won't do the trick.

dims · 2024-01-31T18:41:40Z

cc @henry118

cartermckinnon · 2024-01-31T18:55:23Z

While we work to get a fix out, swapping out the sandbox container image to one that doesn't require ECR credentials is another workaround:

registry.k8s.io/pause:3.9
public.ecr.aws/eks-distro/kubernetes/pause:v1.29.0-eks-1-29-latest

mlagoma · 2024-01-31T21:12:18Z

While we work to get a fix out, swapping out the sandbox container image to one that doesn't require ECR credentials is another workaround:

registry.k8s.io/pause:3.9

public.ecr.aws/eks-distro/kubernetes/pause:v1.29.0-eks-1-29-latest

Greetings, does anybody have any guidance on how I can make this modification to my EKS cluster? Is it part of the Dockerfile build of the container image? The kube deployment manifest (which uses my container image)? Somewhere else? Better to just wait it out for the fix?

dims · 2024-01-31T21:18:33Z

@mlagoma /etc/containerd/config.toml is the configuration file for containerd, you will see an entry (key / value) for a sandbox_image this points to an image in ECR usually. @cartermckinnon was talking about switching that.

However, it is better to talk to AWS support and get help if you are not comfortable.

dims · 2024-02-05T12:12:56Z

@doramar97 downgrade workers with image v1.28 and it's better to wait a few weeks with the update, because they don't test anything (i.e. they test it in production with customers)

you are welcome to do what works for you. please bear with us as this was a tricky one.

dims · 2024-02-05T12:14:23Z

any updates on bottlerocket ?

@marcin99 if you need a solid ETA for production, it's better to approach via support escalation channels. suffice to say, it's in progress.

tzneal · 2024-02-05T16:09:22Z

@marcin99 I'm not sure that I can downgrade EKS version without replacing the cluster with a new one, It is a production cluster and i'm looking for a reliable fix until they will issue a fix.

The BottleRocket team confirmed that the DaemonSet prevention solution I posted above works for BotleRocket as well.

marcin99 · 2024-02-05T19:12:29Z

@doramar97 you don't need downgrade cluster version, but you can use the image from the previous version for workers

RamazanBiyik77 · 2024-02-09T09:18:02Z

This issue should be fixed in AMI release v20240202. We were able to include containerd-1.7.11 which properly reports the sandbox_image as pinned to kubelet, after the changes in #1605.

How can i apply this changes to my existing AMI?

i can confirm that my sandbox image is still 602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/pause:3.5.

RamazanBiyik77 · 2024-02-09T09:43:33Z

Okay found it on AWS EKS Compute section. There was a notification for new AMI release.

odellcraig · 2024-03-01T16:13:31Z

After reading through the thread, I see that this is fixed with v20240202. To apply this change, do you have to go update the launch template to point at the new AMI? I see that a new EKS cluster I created yesterday via Terraform is using the latest AMI (ami-0a5010afd9acfaa26 - amazon-eks-node-1.29-v20240227), But a cluster I created about a month ago before this change is still on ami-0c482d7ce1aa0dd44 (amazon-eks-node-1.29-v20240117). Is there a way tell my existing clusters to use the latest AMI?

bryantbiggs · 2024-03-01T16:21:23Z

@odellcraig you do that via the release_version

odellcraig · 2024-03-01T16:38:39Z

@bryantbiggs Thank you.

For anyone using Terraform and eks_managed_node_groups you can specify using:

eks_managed_node_groups = {
    initial = {
      ami_release_version = "1.29.0-20240227" # this is the latest version as of this comment
      name           = "..."
      instance_types = [...]
      min_size       = ...
      max_size       = ...
      desired_size   = ...
...

korncola · 2024-06-03T13:42:43Z

can you please fix the damn issue after half a year? Still happens with EKS managed nodegroup and AMI
amazon/amazon-eks-node-1.29-v20240522

Error in kubelet on node:
unexpected status from HEAD request to https://602401143452.dkr.ecr.eu-central-1.amazonaws.com/v2/eks/pause/manifests/3.5: 403 Forbidden"

migration to EKS halted here

shamallah · 2024-06-03T14:57:02Z

Same error with amazon-eks-node-1.29-v20240315
failed" error="failed to pull and unpack image \"602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/pause:3.5\": failed to copy: httpReadSeeker: failed open: unexpected status code https://602401143452.dkr.ecr.eu-central-1.amazonaws.com/v2/eks/pause/blobs/sha256:6996f8da07bd405c6f82a549ef041deda57d1d658ec20a78584f9f436c9a3bb7: 403 Forbidden"

tzneal · 2024-06-03T15:01:12Z

Are the permissions on your node role correct per https://docs.aws.amazon.com/eks/latest/userguide/create-node-role.html? Specifically, does it have the AmazonEC2ContainerRegistryReadOnly policy?

shamallah · 2024-06-03T15:30:51Z

AmazonEC2ContainerRegistryReadOnly policy?

The policy is attached.

korncola · 2024-06-03T16:35:36Z

Policy AmazonEC2ContainerRegistryReadOnly is attached here also.
Cant you just use a REAL public repo instead of this half baked half private/public repo in the configs and init scripts? cause hacking the bootstrapping script with public.ecr.aws/eks-distro/kubernetes/pause:v1.29.0-eks-1-29-latest works - but only until reboot, cause init-scripts will always place this damn non working URL in /etc/containerd/config.toml

cartermckinnon · 2024-06-03T16:59:12Z

@korncola can you open a ticket with AWS support so we can look into the specifics of your environment?

korncola · 2024-06-03T17:05:26Z

thanks @cartermckinnon , will do that.
But still: Why no true public repo?!

Did a cluster via terraform and GUI, triple checked policies. Also disabled all SCP. Still same error.
Also nodegroups with AL2023 image or AL2 no success.

cartermckinnon · 2024-06-03T17:13:10Z

ECR Public is only hosted in a few regions; so we still use regional ECR repositories for lower latency and better availability. ECR Public also has a monthly bandwidth limit for anonymous pulls that cannot be increased; so if you're using it in production, make sure you're not sending anonymous requests.

korncola · 2024-06-03T17:31:49Z

[...] and better availability. [...]

yeah i see the availability in this and the other tickets...

ECR Public also has a monthly bandwidth limit for anonymous pulls that cannot be increased;

As i said above use a real public service...
And AWS owns that service, so make it worth...
This are bad excuse for this design decision. Sorry for my rant, but I don't get this decisions, when I look at the scripts with all the hardcoded account IDs to compose an ECR repo URL, with scripts in scripts in scripts, I mean come on, you can do better at AWS.

But as always in the end, I will have a certain typo or whatever on my side causing my ECR pull error and you will all laugh at me :-)

bryantbiggs · 2024-06-03T17:41:50Z

@korncola lets keep it professional. The best course of action is to work with the team through the support ticket. There are many factors that go into decisions that users are not usually aware of. The team is very responsive in terms of investigating and getting a fix rolled out (as needed)

korncola · 2024-06-03T18:30:51Z

yep you are right 👍 team here is very helpful and responsive, thank you for the support here! Will report when issue is resolved, so others can use that info.

mlagoma · 2024-06-04T01:34:21Z

If I understand correctly, the same or a similar (in that it will definitely occur over time) bug was perhaps reintroduced/introduced? So should it be advised to not upgrade nodes? Or is this a separate issue (e.g. anonymous pulls)?

No sign of the issue on older version (1.29.0-20240202)

cartermckinnon · 2024-06-04T02:05:38Z

No, at this point we don’t have evidence of a new bug or a regression.

I’m going to lock this thread to avoid confusion, please open a new issue for follow-ups.

nightmareze1 changed the title ~~Pods stuck in ContainerCreating due to pull error unauthorized~~ Pods stuck in ContainerCreating due to pull error 401 Unauthorized Jan 29, 2024

jdn5126 mentioned this issue Jan 30, 2024

Pods stuck in ContainerCreating due to pull error unauthorized aws/amazon-vpc-cni-k8s#2030

Closed

cartermckinnon mentioned this issue Jan 30, 2024

Pull sandbox image periodically #1601

Merged

cartermckinnon changed the title ~~Pods stuck in ContainerCreating due to pull error 401 Unauthorized~~ Sandbox container image being GC'd in 1.29 Jan 30, 2024

z0rc mentioned this issue Jan 31, 2024

Sandbox container image being GC'd in 1.29 bottlerocket-os/bottlerocket#3745

Closed

dims mentioned this issue Jan 31, 2024

pkg/kubelet: allow sandbox image pinning from CRI kubernetes/kubernetes#118544

Merged

cartermckinnon mentioned this issue Jan 31, 2024

Use crictl to pull sandbox image #1605

Merged

jdn5126 mentioned this issue Feb 5, 2024

amazon-vpc-cni-k8s issue to pull image aws/amazon-vpc-cni-k8s#2783

Closed

koalafi-hilarym mentioned this issue Feb 7, 2024

EKS Windows pods failing to start up in 1.29 #1632

Closed

soutar mentioned this issue Feb 9, 2024

Failed to pull and unpack sandbox image (i/o timeout) #1633

Closed

pat-s mentioned this issue Feb 22, 2024

aarch64 nodes fail to pull eks/pause during node init bottlerocket-os/bottlerocket#2778

Closed

cartermckinnon mentioned this issue Apr 1, 2024

Pods can't run due to failures pulling pause image; pause image is being incorrectly garbage collected #1740

Closed

awslabs locked as resolved and limited conversation to collaborators Jun 4, 2024

Sandbox container image being GC'd in 1.29 #1597

Sandbox container image being GC'd in 1.29 #1597

Comments

nightmareze1 commented Jan 29, 2024 • edited Loading

cartermckinnon commented Jan 29, 2024 • edited Loading

nightmareze1 commented Jan 29, 2024

jrsparks86 commented Jan 29, 2024

nightmareze1 commented Jan 29, 2024

ohrab-hacken commented Jan 30, 2024

ghost commented Jan 30, 2024

wiseelf commented Jan 30, 2024

havilchis commented Jan 30, 2024

cartermckinnon commented Jan 30, 2024

cartermckinnon commented Jan 30, 2024

cartermckinnon commented Jan 30, 2024

cartermckinnon commented Jan 30, 2024 • edited Loading

Idan-Lazar commented Jan 31, 2024

StefanoMantero commented Jan 31, 2024

dekelummanu commented Jan 31, 2024

spatelwearpact commented Jan 31, 2024

Tenzer commented Jan 31, 2024

wiseelf commented Jan 31, 2024

cartermckinnon commented Jan 31, 2024 • edited Loading

dims commented Jan 31, 2024

cartermckinnon commented Jan 31, 2024 • edited Loading

mlagoma commented Jan 31, 2024

dims commented Jan 31, 2024

dims commented Feb 5, 2024

dims commented Feb 5, 2024

tzneal commented Feb 5, 2024

marcin99 commented Feb 5, 2024

RamazanBiyik77 commented Feb 9, 2024

RamazanBiyik77 commented Feb 9, 2024

odellcraig commented Mar 1, 2024

bryantbiggs commented Mar 1, 2024

odellcraig commented Mar 1, 2024

korncola commented Jun 3, 2024

shamallah commented Jun 3, 2024

tzneal commented Jun 3, 2024

shamallah commented Jun 3, 2024

korncola commented Jun 3, 2024

cartermckinnon commented Jun 3, 2024

korncola commented Jun 3, 2024

cartermckinnon commented Jun 3, 2024

korncola commented Jun 3, 2024 • edited Loading

bryantbiggs commented Jun 3, 2024

korncola commented Jun 3, 2024

mlagoma commented Jun 4, 2024

cartermckinnon commented Jun 4, 2024

nightmareze1 commented Jan 29, 2024 •

edited

Loading

cartermckinnon commented Jan 29, 2024 •

edited

Loading

cartermckinnon commented Jan 30, 2024 •

edited

Loading

cartermckinnon commented Jan 31, 2024 •

edited

Loading

cartermckinnon commented Jan 31, 2024 •

edited

Loading

korncola commented Jun 3, 2024 •

edited

Loading