Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when retrieving credentials from iam-role: Credential refresh failed, response did not contain: access_key, secret_key, token, expiry_time #1617

Closed
kylegalbraith opened this issue Nov 28, 2018 · 19 comments

Comments

@kylegalbraith
Copy link

We are seeing a strange issue relating to boto3 and botocore. The following error is being thrown sporadically when we try to read from S3 or utilize an SQS client.

Error when retrieving credentials from iam-role: Credential refresh failed, response did not contain: access_key, secret_key, token, expiry_time

It appears that the credentials are not correctly getting refreshed via the assumed IAM role. This a Python application running inside of a Docker container within EKS. An example piece of code is below.

def fetch_message(s3, bucket, key):
    response = s3.get_object(Bucket=bucket, Key=key)

Does anybody have any ideas why this is happening and whether or not this is a known issue with boto?

@JordonPhillips
Copy link
Contributor

It looks like you're sourcing credentials from the EC2 Instance Metadata and the request to fetch them failed. By default we don't retry those requests, but you can add retries with metadata_service_num_attempts and metadata_service_timeout in the config file (docs).

@kylegalbraith
Copy link
Author

@JordonPhillips Thank you for the response, so would this be a matter of adding those environment variables into the container this is happening in?

@TattiQ
Copy link

TattiQ commented Jan 17, 2019

In my case I get the same error but according to kube2iam logs the request for creds does not fail.

@TattiQ
Copy link

TattiQ commented Jan 17, 2019

@JordonPhillips what if we are sourcing creds from a k8s pod which runs on a k8s worker node (ec2 instance) , so not directly on EC2 instance. do we set AWS_METADATA_SERVICE_NUM_ATTEMPTS as an env var to a pod? is it still legit then? thanks!

@shshe
Copy link

shshe commented May 6, 2019

I'm also using kube2iam to have a pod assume an IAM role and seeing this error sporadically. It sometimes happens at the start of the container, but we've also seen it happen after the containers been running for a while. Any suggestions on workarounds? We've set AWS_METADATA_SERVICE_NUM_ATTEMPTS but it seems to have no effect.

@TattiQ
Copy link

TattiQ commented May 7, 2019

@shshe what does botocore debug log level say ? also, do you use celery?

@shshe
Copy link

shshe commented May 7, 2019

Hi @TattiQ , unfortunately, we didn't have DEBUG level turned on. I think our issue lies in kube2iam introducing latency when querying the EC2 metadata URI. Here's the related issue:

jtblin/kube2iam#31

We're currently trying the workaround with setting AWS_METADATA_SERVICE_NUM_ATTEMPTS and increasing AWS_METADATA_SERVICE_TIMEOUT. So far, the issue hasn't cropped up yet.

Edit: Yes, we do use celery and saw this in our celery app. But, we've also seen this issue crop up in a Kubernetes job that used multiprocessing + boto.

@spinus
Copy link

spinus commented Mar 5, 2020

Hey guys, did you figure out this issue by any chance?

@swetashre
Copy link
Contributor

Following up on this issue. Solution provided by @JordonPhillips here should fix the issue. Is anyone still getting the error ? if yes, please reopen a new issue. I would be happy to help.

@no-response
Copy link

no-response bot commented Mar 31, 2020

This issue has been automatically closed because there has been no response to our request for more information from the original author. With only the information that is currently in the issue, we don't have enough information to take action. Please reach out if you have or find the answers we need so that we can investigate further.

@no-response no-response bot closed this as completed Mar 31, 2020
@chrisbrownnwyc
Copy link

chrisbrownnwyc commented Jul 20, 2020

We're having this issue specifically with K8S as well. Did setting the Environment variables work?

@kylegalbraith
Copy link
Author

We're having this issue specifically with K8S as well. Did setting the Environment variables work?

Yes, setting the environment variables (specifically the retry attempts) seems to have mostly resolved the issue for us. We are also in an EKS K8s environment.

@martimors
Copy link

Hi, I was facing this issue running python in a pod in an EKS cluster, and it seems at first glance the retries/timeout solution worked. Did anyone figure out a reason why these requests fail? I've seen pods restart hundreds of times because of this and I'm curious if there is something in the EKS setup that can be used to mitigate.

@sethatron
Copy link

Bump on this, I am also seeing this issue in kube2iam/EKS

@martimors
Copy link

Yep, still seeing this one year later.

@ypicard
Copy link

ypicard commented Oct 21, 2021

Just noticed this too. Hundreds of restart in one night when it never happened in the last 6 months. No configuration changes or anything.

@alizdavoodi
Copy link

This is also happening for us. (python ---> kube2iam ---> AWS)
We switch to the service account (native EKS way of authenticating) because of this matter.

@arvindsree
Copy link

Seems to be the error return while handling this issue boto/boto3#1751. The workaround when we hit this issue was to re-attach the instance metadata.
Increasing the retry attempts did not do anything

@madina-iz
Copy link

I ran into this issue while working in a Jupyter notebook on EC2 instance. When it first started happening this month, all I had to do is rerun the code and it would work again on the second or third try. However, additional attempts stopped working for me this week. After struggling with a few different options which didn't work in my case, I finally decided to upgrade my Python (from 3.6 to 3.7) and dask (from 2021.11.01 to 2022.2.0), and that fixed the issue for me completely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests