-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failed to pull and unpack sandbox image (i/o timeout) #1633
Comments
I'm also seeing the same thing in |
Can you open a case with AWS Support so the ECR team can look into the timeouts? It seems like our retry logic in this script isn't working properly, and that should mitigate this in most cases. I'll get a PR out 👍 |
@cartermckinnon Will do. Any improvement to the retry logic here as a mitigation would be great, thank you! |
Cluster details: I've had a similar issue where pods were stuck in ContainerCreating with the warning in the Pod description: I've checked the releases of the Amazon EKS AMI and noticed that in v20240202 there were some changes with the Sandbox image, therefore I've decided to upgrade to the latest version, which, for me solved the issue. |
@soutar @cartermckinnon We encountered the same type of errors in v20240202, Also I saw that the "You must specify a region" errors in sandbox services. 107df3f#diff-57a6aadbbb1d3df65f4675ae80c562f7e406bcb11e41f6afb974043a2ede0aa0R32 |
@cartermckinnon FYI I built an AMI from 976fe67 and it seems to have fixed the retry behaviour. Once that commit makes it into a release I'd be happy to close this issue — what do you think? |
I can still see this happening with the latest release |
@vitaly-dt 976fe67 is not in v20240209 as far as I can see so that makes sense. I was able to deploy it in our infra by checking out the commit directly and using the build scripts in the repo to publish a private AMI. |
You are correct, my bad. |
Wanted to validate that the latest release v20240209 fixed the issue for me. |
#1649 went out in yesterday's release 👍 |
What happened:
sandbox-image.service
failed with ani/o timeout
error when pulling the602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause:3.5
image and the instance therefore failed to join the EKS cluster. We first observed this at Feb 3, 2024 02:50:13.167 (UTC) and have seen multiple instances fail this way each day since then. We have observed the same problem on Kubernetes 1.27, 1.28, and 1.29.What you expected to happen:
sandbox-image.service should pull
602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause:3.5
successfully and allow the instance to join the EKS clusterHow to reproduce it (as minimally and precisely as possible):
This is not easily reproducible and has only affected 7 of the 271 instances we launched via Karpenter in the last 24 hours. The other 264 instances successfully joined our cluster. As the instances are automatically terminated by Karpenter after 15 minutes, it is only possible to collect debug information if we catch this happening.
Anything else we need to know?:
Environment:
aws eks describe-cluster --name <name> --query cluster.platformVersion
): eks.1aws eks describe-cluster --name <name> --query cluster.version
): 1.29uname -a
):Linux ip-10-34-46-213.ec2.internal 5.10.205-195.807.amzn2.x86_64 #1 SMP Tue Jan 16 18:28:59 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
cat /etc/eks/release
on a node):The text was updated successfully, but these errors were encountered: