You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The number of reports seem to have increased lately.
We've been encountering this issue at least twice a week on our cluster for the last 2 weeks. It's random and doesn't affect most nodes.
We didn't have the issue before that.
No change around that date that could have caused the issue.
We're using non-customized version of the AMI, with a simple Karpenter provisioner. We're willing to provide more details if needed.
Example of logs for a node stuck in a NonReady state:
Sep 22 12:16:57 ip-10-50-85-81.eu-west-1.compute.internal systemd[1]: Starting pull sandbox image defined in containerd config.toml...
Sep 22 12:17:04 ip-10-50-85-81.eu-west-1.compute.internal pull-sandbox-image.sh[3906]: Unable to locate credentials. You can configure credentials by running "aws configure".
Sep 22 12:17:04 ip-10-50-85-81.eu-west-1.compute.internal sudo[4011]: root : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/bin/ctr --namespace k8s.io image pull 602401143452.dkr.ecr.eu-west-1.amazonaws.com/eks
Sep 22 12:17:06 ip-10-50-85-81.eu-west-1.compute.internal pull-sandbox-image.sh[3906]: Password: panic: provided file is not a console
Sep 22 12:17:06 ip-10-50-85-81.eu-west-1.compute.internal pull-sandbox-image.sh[3906]: goroutine 1 [running]:
The important line seems to be the following:
Unable to locate credentials. You can configure credentials by running "aws configure".
Of course, when running the same "aws ecr get-login-password" command manually, everything works fine without having changed anything. The InstanceRole is correctly set-up, which means it's a transient failure which is not retried correctly.
There seems to be 2 issues:
There is some kind of service degradation affecting the "aws ecr get-login-password" command (ECR issue? IMDS issue?) - or the increased number of reports could just be a coincidence.
This AWS ECR command is not wrapped in the retry logic of the pull-sandbox-image.sh script, but it probably should
The text was updated successfully, but these errors were encountered:
There is an increasing number of reports around Kubernetes nodes failing to start because of the sandbox-image service failing (kubernetes-sigs/karpenter#750, aws/karpenter-provider-aws#1917, aws/karpenter-provider-aws#2235 (comment), #990). Latest report is from just 3 hours ago.
The number of reports seem to have increased lately.
We've been encountering this issue at least twice a week on our cluster for the last 2 weeks. It's random and doesn't affect most nodes.
We didn't have the issue before that.
No change around that date that could have caused the issue.
We're using non-customized version of the AMI, with a simple Karpenter provisioner. We're willing to provide more details if needed.
Example of logs for a node stuck in a NonReady state:
The important line seems to be the following:
Of course, when running the same "aws ecr get-login-password" command manually, everything works fine without having changed anything. The InstanceRole is correctly set-up, which means it's a transient failure which is not retried correctly.
There seems to be 2 issues:
The text was updated successfully, but these errors were encountered: