Failed to pull and unpack sandbox image (i/o timeout) #1633

soutar · 2024-02-07T17:47:00Z

What happened:
sandbox-image.service failed with an i/o timeout error when pulling the 602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause:3.5 image and the instance therefore failed to join the EKS cluster. We first observed this at Feb 3, 2024 02:50:13.167 (UTC) and have seen multiple instances fail this way each day since then. We have observed the same problem on Kubernetes 1.27, 1.28, and 1.29.

-- Logs begin at Wed 2024-02-07 16:50:22 UTC, end at Wed 2024-02-07 16:59:59 UTC. --
Feb 07 16:50:31 ip-10-34-46-213.ec2.internal systemd[1]: Starting pull sandbox image defined in containerd config.toml...
Feb 07 16:50:31 ip-10-34-46-213.ec2.internal sudo[4047]:     root : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/bin/ctr#040--namespace#040k8s.io#040image#040ls
Feb 07 16:50:33 ip-10-34-46-213.ec2.internal sudo[4097]:     root : TTY=unknown ; PWD=/ ; USER=root ;
Feb 07 16:50:33 ip-10-34-46-213.ec2.internal sudo[4097]:     root : (command continued) COMMAND=/bin/crictl#040pull#040--creds#040AWS:<redacted>
Feb 07 16:50:34 ip-10-34-46-213.ec2.internal pull-sandbox-image.sh[4042]: time="2024-02-07T16:50:34Z" level=warning msg="image connect using default endpoints: [unix:///var/run/dockershim.sock unix:///run/containerd/containerd.sock unix:///run/crio/crio.sock unix:///var/run/cri-dockerd.sock]. As the default settings are now deprecated, you should set the endpoint instead."
Feb 07 16:51:04 ip-10-34-46-213.ec2.internal pull-sandbox-image.sh[4042]: E0207 16:51:04.434803    4098 remote_image.go:171] "PullImage from image service failed" err="rpc error: code = Unknown desc = failed to pull and unpack image \"602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause:3.5\": failed to copy: httpReadSeeker: failed open: failed to do request: Get \"https://602401143452.dkr.ecr.us-east-1.amazonaws.com/v2/eks/pause/manifests/sha256:529cf6b1b6e5b76e901abc43aee825badbd93f9c5ee5f1e316d46a83abbce5a2\": dial tcp 34.198.77.233:443: i/o timeout" image="602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause:3.5"
Feb 07 16:51:04 ip-10-34-46-213.ec2.internal pull-sandbox-image.sh[4042]: time="2024-02-07T16:51:04Z" level=fatal msg="pulling image: rpc error: code = Unknown desc = failed to pull and unpack image \"602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause:3.5\": failed to copy: httpReadSeeker: failed open: failed to do request: Get \"https://602401143452.dkr.ecr.us-east-1.amazonaws.com/v2/eks/pause/manifests/sha256:529cf6b1b6e5b76e901abc43aee825badbd93f9c5ee5f1e316d46a83abbce5a2\": dial tcp 34.198.77.233:443: i/o timeout"
Feb 07 16:51:04 ip-10-34-46-213.ec2.internal systemd[1]: sandbox-image.service: main process exited, code=exited, status=1/FAILURE
Feb 07 16:51:04 ip-10-34-46-213.ec2.internal systemd[1]: Failed to start pull sandbox image defined in containerd config.toml.
Feb 07 16:51:04 ip-10-34-46-213.ec2.internal systemd[1]: Unit sandbox-image.service entered failed state.
Feb 07 16:51:04 ip-10-34-46-213.ec2.internal systemd[1]: sandbox-image.service failed.

What you expected to happen:
sandbox-image.service should pull 602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause:3.5 successfully and allow the instance to join the EKS cluster

How to reproduce it (as minimally and precisely as possible):
This is not easily reproducible and has only affected 7 of the 271 instances we launched via Karpenter in the last 24 hours. The other 264 instances successfully joined our cluster. As the instances are automatically terminated by Karpenter after 15 minutes, it is only possible to collect debug information if we catch this happening.

Anything else we need to know?:

Environment:

AWS Region: us-east-1 (observed in AZs: a, b, c, d, f)
Instance Type(s): c4.2xlarge, c5.xlarge, c5d.2xlarge
EKS Platform version (use aws eks describe-cluster --name <name> --query cluster.platformVersion): eks.1
Kubernetes version (use aws eks describe-cluster --name <name> --query cluster.version): 1.29
AMI Version: amazon-eks-node-1.29-v20240202
Kernel (e.g. uname -a): Linux ip-10-34-46-213.ec2.internal 5.10.205-195.807.amzn2.x86_64 #1 SMP Tue Jan 16 18:28:59 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Release information (run cat /etc/eks/release on a node):

The text was updated successfully, but these errors were encountered:

lsowen · 2024-02-07T18:15:54Z

I'm also seeing the same thing in us-east-1, on a 1.25 EKS cluster

cartermckinnon · 2024-02-07T18:45:05Z

Can you open a case with AWS Support so the ECR team can look into the timeouts? It seems like our retry logic in this script isn't working properly, and that should mitigate this in most cases. I'll get a PR out 👍

soutar · 2024-02-08T09:05:08Z

@cartermckinnon Will do. Any improvement to the retry logic here as a mitigation would be great, thank you!

covidium · 2024-02-09T08:54:11Z

Cluster details:
EKS: 1.29
Region: us-east-1
AMI version: v20240117

I've had a similar issue where pods were stuck in ContainerCreating with the warning in the Pod description:
Warning FailedCreatePodSandBox 43s (x141 over 30m) kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox image "602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause:3.5": failed to pull image "602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause:3.5": failed to pull and unpack image "602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause:3.5": failed to resolve reference "602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause:3.5": unexpected status from HEAD request to https://602401143452.dkr.ecr.us-east-1.amazonaws.com/v2/eks/pause/manifests/3.5: 401 Unauthorized

I've checked the releases of the Amazon EKS AMI and noticed that in v20240202 there were some changes with the Sandbox image, therefore I've decided to upgrade to the latest version, which, for me solved the issue.

soutar · 2024-02-09T09:32:48Z

@covidium I believe the issue you were experiencing is described in #1597 and fixed in #1605. Unfortunately, even in v20240202, a failed pull on the sandbox image during node startup can result in the node not joining the cluster because the retry mechanism isn't working as expected.

hknerts · 2024-02-09T13:01:10Z

@soutar @cartermckinnon We encountered the same type of errors in v20240202, Also I saw that the "You must specify a region" errors in sandbox services.
There is a fix by below commit but think not in v20240202

107df3f#diff-57a6aadbbb1d3df65f4675ae80c562f7e406bcb11e41f6afb974043a2ede0aa0R32

soutar · 2024-02-13T10:28:18Z

@cartermckinnon FYI I built an AMI from 976fe67 and it seems to have fixed the retry behaviour. Once that commit makes it into a release I'd be happy to close this issue — what do you think?

vitaly-dt · 2024-02-13T13:09:42Z

I can still see this happening with the latest release amazon/amazon-eks-node-1.26-v20240209
According to the release notes:
v20240202...v20240209
This commit is this release - so it doesn't fix the problem just yet.

soutar · 2024-02-13T14:03:26Z

@vitaly-dt 976fe67 is not in v20240209 as far as I can see so that makes sense. I was able to deploy it in our infra by checking out the commit directly and using the build scripts in the repo to publish a private AMI.

vitaly-dt · 2024-02-13T14:31:52Z

You are correct, my bad.
Waiting for this fix to be released ASAP

ryehowell · 2024-02-14T13:21:13Z

Wanted to validate that the latest release v20240209 fixed the issue for me.

cartermckinnon · 2024-02-15T18:10:32Z

#1649 went out in yesterday's release 👍

cartermckinnon closed this as completed Feb 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed to pull and unpack sandbox image (i/o timeout) #1633

Failed to pull and unpack sandbox image (i/o timeout) #1633

soutar commented Feb 7, 2024 •

edited

Loading

lsowen commented Feb 7, 2024

cartermckinnon commented Feb 7, 2024 •

edited

Loading

soutar commented Feb 8, 2024

covidium commented Feb 9, 2024

soutar commented Feb 9, 2024

hknerts commented Feb 9, 2024 •

edited

Loading

soutar commented Feb 13, 2024

vitaly-dt commented Feb 13, 2024

soutar commented Feb 13, 2024

vitaly-dt commented Feb 13, 2024

ryehowell commented Feb 14, 2024 •

edited

Loading

cartermckinnon commented Feb 15, 2024

Failed to pull and unpack sandbox image (i/o timeout) #1633

Failed to pull and unpack sandbox image (i/o timeout) #1633

Comments

soutar commented Feb 7, 2024 • edited Loading

lsowen commented Feb 7, 2024

cartermckinnon commented Feb 7, 2024 • edited Loading

soutar commented Feb 8, 2024

covidium commented Feb 9, 2024

soutar commented Feb 9, 2024

hknerts commented Feb 9, 2024 • edited Loading

soutar commented Feb 13, 2024

vitaly-dt commented Feb 13, 2024

soutar commented Feb 13, 2024

vitaly-dt commented Feb 13, 2024

ryehowell commented Feb 14, 2024 • edited Loading

cartermckinnon commented Feb 15, 2024

soutar commented Feb 7, 2024 •

edited

Loading

cartermckinnon commented Feb 7, 2024 •

edited

Loading

hknerts commented Feb 9, 2024 •

edited

Loading

ryehowell commented Feb 14, 2024 •

edited

Loading