Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apparently random NoCredentialsError after running for a while #1006

Open
4 of 6 tasks
jsoucheiron opened this issue Apr 18, 2023 · 20 comments
Open
4 of 6 tasks

Apparently random NoCredentialsError after running for a while #1006

jsoucheiron opened this issue Apr 18, 2023 · 20 comments
Labels
bug Something isn't working

Comments

@jsoucheiron
Copy link

jsoucheiron commented Apr 18, 2023

Describe the bug
We have an aiohttp server that sends SQS messages as result of certain actions. After running for a while we'll get

Traceback (most recent call last):
  File "/usr/local/skyscanner/app/services/sqs.py", line 21, in send_message
    await client.send_message(
  File "/usr/local/lib/python3.10/site-packages/aiobotocore/client.py", line 141, in _make_api_call
    http, parsed_response = await self._make_request(
  File "/usr/local/lib/python3.10/site-packages/aiobotocore/client.py", line 161, in _make_request
    return await self._endpoint.make_request(operation_model, request_dict)
  File "/usr/local/lib/python3.10/site-packages/aiobotocore/endpoint.py", line 77, in _send_request
    request = await self.create_request(request_dict, operation_model)
  File "/usr/local/lib/python3.10/site-packages/aiobotocore/endpoint.py", line 70, in create_request
    await self._event_emitter.emit(event_name, request=request,
  File "/usr/local/lib/python3.10/site-packages/aiobotocore/hooks.py", line 27, in _emit
    response = await handler(**kwargs)
  File "/usr/local/lib/python3.10/site-packages/aiobotocore/signers.py", line 16, in handler
    return await self.sign(operation_name, request)
  File "/usr/local/lib/python3.10/site-packages/aiobotocore/signers.py", line 63, in sign
    auth.add_auth(request)
  File "/usr/local/lib/python3.10/site-packages/botocore/auth.py", line 378, in add_auth
    raise NoCredentialsError()
botocore.exceptions.NoCredentialsError: Unable to locate credentials

Our code that triggers the issue in production, where we use IAM roles:

class SQSService:
    def __init__(self, sqs_region: str, sqs_url: str):
        self.default_source = "unknown"
        self.sqs_region = sqs_region
        self.sqs_url = sqs_url

    async def send_message(self, pushed_data: Dict[str, Any], data_type: str, source: Optional[str]):
        try:
            session = get_session()
            async with session.create_client("sqs", region_name=self.sqs_region) as client:
                await client.send_message(
                    QueueUrl=self.sqs_url,
                    MessageBody=json_dumps_extended(
                        {"pushed_data": pushed_data, "data_type": data_type, "source": source or self.default_source}
                    ),
                )        
        except Exception:
            logger.exception(f"Something went wrong in SQS upload of {pushed_data}")

We've tried multiple versions including 2.0.0 and 2.5.0

After many many tests trying to find a way to reproduce the issue locally, we've managed to mitigate it using backoff. When we do, this is what we get:
image

This leads me to believe there's a run condition somewhere that only triggers after a while running where you might end up with missing credentials temporarily.

Checklist

  • I have reproduced in environment where pip check passes without errors
  • I have provided pip freeze results
  • I have provided sample code or detailed way to reproduce
  • I have tried the same code in botocore to ensure this is an aiobotocore specific issue
  • I have tried similar code in aiohttp to ensure this is is an aiobotocore specific issue
  • I have checked the latest and older versions of aiobotocore/aiohttp/python to see if this is a regression / injection

pip freeze results

$ pip freeze
aiobotocore==2.0.0
aiocache==0.12.0
aiodns==3.0.0
aiohttp==3.8.1
aioitertools==0.11.0
aiosignal==1.3.1
aiotask-context==0.6.1
alembic==1.0.11
async-timeout==4.0.2
asyncpg==0.27.0
asyncpgsa==0.27.1
attrs==23.1.0
backoff==2.2.1
basictracer==3.2.0
black==23.3.0
boto3==1.19.8
botocore==1.22.8
Brotli==1.0.9
build==0.10.0
cachetools==5.3.0
cchardet==2.1.7
certifi==2022.12.7
cffi==1.15.1
charset-normalizer==2.1.1
click==8.1.3
coverage==7.2.3
cryptography==3.4.8
Deprecated==1.2.13
exceptiongroup==1.1.1
flake8==6.0.0
frozenlist==1.3.3
googleapis-common-protos==1.59.0
grpcio==1.53.0
gunicorn==20.1.0
idna==3.4
importlib-metadata==6.4.1
iniconfig==2.0.0
isort==5.12.0
Jinja2==3.1.2
jmespath==0.10.0
jq==1.4.1
jsonpickle==3.0.1
lightstep==4.4.8
Mako==1.2.4
markdown-it-py==2.2.0
MarkupSafe==2.1.2
mccabe==0.7.0
mdurl==0.1.2
moto==4.1.7
multidict==6.0.4
mypy-extensions==1.0.0
object-pool==0.2
opentelemetry-api==1.15.0
opentelemetry-exporter-otlp==1.15.0
opentelemetry-exporter-otlp-proto-grpc==1.15.0
opentelemetry-exporter-otlp-proto-http==1.15.0
opentelemetry-instrumentation==0.36b0
opentelemetry-instrumentation-aiohttp-client==0.36b0
opentelemetry-instrumentation-logging==0.36b0
opentelemetry-opentracing-shim==0.36b0
opentelemetry-propagator-ot-trace==0.36b0
opentelemetry-proto==1.15.0
opentelemetry-sdk==1.15.0
opentelemetry-semantic-conventions==0.36b0
opentelemetry-util-http==0.36b0
opentracing==2.4.0
packaging==23.1
pathspec==0.11.1
pbr==5.11.1
pip-tools==6.13.0
platformdirs==3.2.0
pluggy==1.0.0
pprintpp==0.4.0
protobuf==3.20.3
psycopg2-binary==2.9.6
pycares==4.3.0
pycodestyle==2.10.0
pycparser==2.21
pydantic==1.10.7
pyflakes==3.0.1
pyformance==0.4
Pygments==2.15.0
pyproject_hooks==1.0.0
pytest==7.3.1
pytest-aiohttp==1.0.4
pytest-asyncio==0.21.0
pytest-clarity==1.0.1
pytest-cov==2.12.1
pytest-env==0.6.2
pytest-mock==1.12.1
python-dateutil==2.8.2
python-editor==1.0.4
PyYAML==6.0
requests==2.28.2
responses==0.23.1
rich==13.3.4
s3transfer==0.5.2
six==1.16.0
SQLAlchemy==1.3.24
statsd==3.3.0
thrift==0.16.0
toml==0.10.2
tomli==2.0.1
types-PyYAML==6.0.12.9
types-requests==2.28.11.17
types-urllib3==1.26.25.10
typing_extensions==4.5.0
urllib3==1.26.15
uvloop==0.17.0
Werkzeug==2.2.3
wrapt==1.15.0
xmltodict==0.13.0
yarl==1.8.2
zipp==3.15.0

Environment:

  • Python Version: [e.g. 3.9, 3.10, 3.11]
  • OS name and version: [e.g. linux(python-slim docker)]
  • We haven't been able to reproduce for 3.9 and aiobotocore==1.2.2

Additional context
Happy to provide any further context to help resolve this.

@thehesiod
Copy link
Collaborator

there's definitely something funky with creds based on all the recent issues logged. we need a reliable test case where we can compare debug botocore and aiobotocore logs

@jsoucheiron
Copy link
Author

I wish I could provide it but I haven't managed to reproduce this locally yet, just in production after running for a while.

@DavidRomanovizc DavidRomanovizc added the bug Something isn't working label May 2, 2023
@hbikki
Copy link

hbikki commented May 8, 2023

I noticed similar issue happening when reading/writing to S3 with process count > 5 for versions 2.4.2

@thehesiod
Copy link
Collaborator

any interesting info with debug level logging?

@jsoucheiron
Copy link
Author

To add some additional context on this that might help untangle the issue:

  • SQSService is long lived
  • session = get_session() will be called multiple times
  • get_session() calls might happen during token rotation

Would it be a better approach to have a long-lived session instantiated in the class instead of creating a new one every time send_message() is called?

@thehesiod
Copy link
Collaborator

long lived session/client always preferred. botocore should take care of refreshing credentials.

@jsoucheiron
Copy link
Author

jsoucheiron commented May 9, 2023 via email

@thehesiod
Copy link
Collaborator

can you try after release with #1022 available?

@thehesiod
Copy link
Collaborator

could be related to #1025, I'd try once that release is available (later today)

@thehesiod
Copy link
Collaborator

actually the important part isn't the session, it's the client, you should keep your client for as long as possible. A client is tied to a connection pool, so it's heavy to keep re-creating them

@thehesiod
Copy link
Collaborator

to debug this I really need a reproducible test case. I have my own AWS account so if you can create a full encapsulated test case I can try to debug this otherwise there just isn't enough for me to go on here and I'll have to close it. Another option is to create a test case using moto

@jsoucheiron
Copy link
Author

The problem is that given that the client is an async context manager there's not nice/elegant way to have a long lived client. You'd need to enter manually and create some teardown hook to exit.

@thehesiod
Copy link
Collaborator

sure there is, we do this all the time:

class SQSService:
    def __init__(self, sqs_region: str, sqs_url: str):
        self.default_source = "unknown"
        self.sqs_region = sqs_region
        self.sqs_url = sqs_url
        self._exit_stack = contextlib.AsyncExitStack()

    async def __aenter__(self):
        self._client =  await self._exit_stack.enter_async_context(session.create_client("sqs", region_name=self.sqs_region)
        return self

    async def __aexit__(self, *args):
        await self._exit_stack.__aexit__(*args)

@jsoucheiron
Copy link
Author

This is the kind of pattern I'd love to see documented.
If there are certain ways of using the library that minimize load or are generally best practices given how it internally operates we should make this explicit in the docs so people can adopt this patterns.

@thehesiod
Copy link
Collaborator

I think we assumed it was common knowledge but open to PRs / issues to add to docs

@thehesiod
Copy link
Collaborator

I'd like to be able to get to the bottom of what's causing this issue as well though. Unfortunately we'll need some sort of way to reproduce

@nishantkumar1292
Copy link

Do we have a solution for this yet? I'm still experiencing this. However, I thought the issue was not random, and occurred with almost every call. But it could be because I only looked at later logs.

I wonder if explicit passing of access_key and secret_key would resolve this?

@thehesiod
Copy link
Collaborator

we need a way to repro or a detailed analysis from someone who can repro

Copy link

This issue has been marked as stale because it has been inactive for more than 60 days. Please update this pull request or it will be automatically closed in 7 days.

@github-actions github-actions bot added the Stale label Sep 21, 2024
@thehesiod
Copy link
Collaborator

what's funny we're hitting something like this as well on aws. I'm guessing periodically the AWS call fails for some reason.

@github-actions github-actions bot removed the Stale label Sep 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants