sandbox-image service fault-tolerance & reliability #1034

maximethebault · 2022-09-26T23:03:39Z

There is an increasing number of reports around Kubernetes nodes failing to start because of the sandbox-image service failing (kubernetes-sigs/karpenter#750, aws/karpenter-provider-aws#1917, aws/karpenter-provider-aws#2235 (comment), #990). Latest report is from just 3 hours ago.

The number of reports seem to have increased lately.
We've been encountering this issue at least twice a week on our cluster for the last 2 weeks. It's random and doesn't affect most nodes.
We didn't have the issue before that.
No change around that date that could have caused the issue.
We're using non-customized version of the AMI, with a simple Karpenter provisioner. We're willing to provide more details if needed.

Example of logs for a node stuck in a NonReady state:

Sep 22 12:16:57 ip-10-50-85-81.eu-west-1.compute.internal systemd[1]: Starting pull sandbox image defined in containerd config.toml...
Sep 22 12:17:04 ip-10-50-85-81.eu-west-1.compute.internal pull-sandbox-image.sh[3906]: Unable to locate credentials. You can configure credentials by running "aws configure".
Sep 22 12:17:04 ip-10-50-85-81.eu-west-1.compute.internal sudo[4011]:     root : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/bin/ctr --namespace k8s.io image pull 602401143452.dkr.ecr.eu-west-1.amazonaws.com/eks
Sep 22 12:17:06 ip-10-50-85-81.eu-west-1.compute.internal pull-sandbox-image.sh[3906]: Password: panic: provided file is not a console
Sep 22 12:17:06 ip-10-50-85-81.eu-west-1.compute.internal pull-sandbox-image.sh[3906]: goroutine 1 [running]:

The important line seems to be the following:

Unable to locate credentials. You can configure credentials by running "aws configure".

Of course, when running the same "aws ecr get-login-password" command manually, everything works fine without having changed anything. The InstanceRole is correctly set-up, which means it's a transient failure which is not retried correctly.

There seems to be 2 issues:

There is some kind of service degradation affecting the "aws ecr get-login-password" command (ECR issue? IMDS issue?) - or the increased number of reports could just be a coincidence.
This AWS ECR command is not wrapped in the retry logic of the pull-sandbox-image.sh script, but it probably should

The text was updated successfully, but these errors were encountered:

suket22 · 2022-09-27T17:21:14Z

This AWS ECR command is not wrapped in the retry logic of the pull-sandbox-image.sh script, but it probably should

That's a good callout. I think we can definitely make that change.

Unable to locate credentials. You can configure credentials by running "aws configure".

This one's a little surprising to me. Since it does appear to be transient, we'll try and make the first fix and look into any potential issue

maximethebault · 2023-01-04T23:27:03Z

This was fixed by aws/karpenter-provider-aws#938

cartermckinnon mentioned this issue Oct 10, 2022

cache pause, vpc-cni, and kube-proxy images in the AMI #938

Merged

maximethebault closed this as completed Jan 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sandbox-image service fault-tolerance & reliability #1034

sandbox-image service fault-tolerance & reliability #1034

maximethebault commented Sep 26, 2022

suket22 commented Sep 27, 2022

maximethebault commented Jan 4, 2023

sandbox-image service fault-tolerance & reliability #1034

sandbox-image service fault-tolerance & reliability #1034

Comments

maximethebault commented Sep 26, 2022

suket22 commented Sep 27, 2022

maximethebault commented Jan 4, 2023