Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sandbox-image service fault-tolerance & reliability #1034

Closed
maximethebault opened this issue Sep 26, 2022 · 2 comments
Closed

sandbox-image service fault-tolerance & reliability #1034

maximethebault opened this issue Sep 26, 2022 · 2 comments

Comments

@maximethebault
Copy link

There is an increasing number of reports around Kubernetes nodes failing to start because of the sandbox-image service failing (kubernetes-sigs/karpenter#750, aws/karpenter-provider-aws#1917, aws/karpenter-provider-aws#2235 (comment), #990). Latest report is from just 3 hours ago.

The number of reports seem to have increased lately.
We've been encountering this issue at least twice a week on our cluster for the last 2 weeks. It's random and doesn't affect most nodes.
We didn't have the issue before that.
No change around that date that could have caused the issue.
We're using non-customized version of the AMI, with a simple Karpenter provisioner. We're willing to provide more details if needed.

Example of logs for a node stuck in a NonReady state:

Sep 22 12:16:57 ip-10-50-85-81.eu-west-1.compute.internal systemd[1]: Starting pull sandbox image defined in containerd config.toml...
Sep 22 12:17:04 ip-10-50-85-81.eu-west-1.compute.internal pull-sandbox-image.sh[3906]: Unable to locate credentials. You can configure credentials by running "aws configure".
Sep 22 12:17:04 ip-10-50-85-81.eu-west-1.compute.internal sudo[4011]:     root : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/bin/ctr --namespace k8s.io image pull 602401143452.dkr.ecr.eu-west-1.amazonaws.com/eks
Sep 22 12:17:06 ip-10-50-85-81.eu-west-1.compute.internal pull-sandbox-image.sh[3906]: Password: panic: provided file is not a console
Sep 22 12:17:06 ip-10-50-85-81.eu-west-1.compute.internal pull-sandbox-image.sh[3906]: goroutine 1 [running]:

The important line seems to be the following:

Unable to locate credentials. You can configure credentials by running "aws configure".

Of course, when running the same "aws ecr get-login-password" command manually, everything works fine without having changed anything. The InstanceRole is correctly set-up, which means it's a transient failure which is not retried correctly.

There seems to be 2 issues:

  1. There is some kind of service degradation affecting the "aws ecr get-login-password" command (ECR issue? IMDS issue?) - or the increased number of reports could just be a coincidence.
  2. This AWS ECR command is not wrapped in the retry logic of the pull-sandbox-image.sh script, but it probably should
@suket22
Copy link
Member

suket22 commented Sep 27, 2022

This AWS ECR command is not wrapped in the retry logic of the pull-sandbox-image.sh script, but it probably should

That's a good callout. I think we can definitely make that change.

Unable to locate credentials. You can configure credentials by running "aws configure".

This one's a little surprising to me. Since it does appear to be transient, we'll try and make the first fix and look into any potential issue

@maximethebault
Copy link
Author

This was fixed by aws/karpenter-provider-aws#938

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants