ReadTimeoutErrors on 169.254.169.254 /latest/api/token #880

ivasic · 2020-07-03T13:40:34Z

After updating the agent to version 5.8.0 my all my Django manage.py commands seem to hang indefinitely and never finish and I'm first seeing:

WARNING urllib3.connectionpool Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPConnectionPool(host='169.254.169.254', port=80): Read timed out. (read timeout=3.0)")': /latest/api/token connectionpool.py:749

and then (after some time):

ERROR elasticapm.transport Closing the transport connection timed out. base.py:261

After this last error message the process just hangs indefinitely and has to be killed manually.

To Reproduce

I'm not able to reproduce this locally, it seems to happen exclusively in my staging/production environment on AWS.
There, it really only takes a Django management command to make it hang. For example:

$ python manage.py migrate
Operations to perform:
  Apply all migrations: admin, auth, contenttypes, sessions
Running migrations:
  No migrations to apply.
2020-07-03 13:17:19,256 WARNING  urllib3.connectionpool Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPConnectionPool(host='169.254.169.254', port=80): Read timed out. (read timeout=3.0)")': /latest/api/token connectionpool.py:749
2020-07-03 13:17:27,164 ERROR    elasticapm.transport Closing the transport connection timed out. base.py:261

Environment

OS: Linux 6094537dd0b9 4.14.181-108.257.amzn1.x86_64 elastic/apm-agent-python#1 SMP Wed May 27 02:43:03 UTC 2020 x86_64 GNU/Linux (Docker container - python:3.8 image, docker running on Amazon Linux/2.15.2)
Python version: 3.8.3
Framework and version: Django 3.0.8 (same thing happens with 3.0.7)
APM Server version: 7.5.2
Agent version: 5.8.0 (I can confirm this doesn't happen with v5.7.0)

Additional context

The app runs on AWS ElasticBeanstalk Docker environment with python 3.8 image, pretty straight forward.

requirements.txt:

Click to expand

appdirs==1.4.4
asgiref==3.2.10; python_version >= '3.5'
boto3==1.14.16
botocore==1.17.16
certifi==2020.6.20
cffi==1.14.0
chardet==3.0.4
cryptography==2.9.2; python_version >= '2.7' and python_version not in '3.0, 3.1, 3.2, 3.3, 3.4'
django-migration-linter==2.3.0
django-redis==4.12.1
django-sortedm2m==3.0.0
django-storages==1.9.1
django-widget-tweaks==1.4.8
django==3.0.8
djangorestframework-simplejwt==4.4.0
djangorestframework==3.11.0
docutils==0.15.2; python_version >= '2.6' and python_version not in '3.0, 3.1, 3.2, 3.3'
elastic-apm==5.8.0
factory-boy==2.12.0
faker==4.1.1; python_version >= '3.4'
future==0.18.2; python_version >= '2.6' and python_version not in '3.0, 3.1, 3.2, 3.3'
googlemaps==4.4.1
gunicorn==20.0.4
idna==2.10; python_version >= '2.7' and python_version not in '3.0, 3.1, 3.2, 3.3'
jmespath==0.10.0; python_version >= '2.6' and python_version not in '3.0, 3.1, 3.2, 3.3'
jwcrypto==0.7
markdown==3.2.2
oauthlib==3.1.0; python_version >= '2.7' and python_version not in '3.0, 3.1, 3.2, 3.3'
pillow==7.2.0
psycopg2-binary==2.8.5
pycparser==2.20; python_version >= '2.7' and python_version not in '3.0, 3.1, 3.2, 3.3'
pyjwt==1.7.1
python-dateutil==2.8.1; python_version >= '2.7' and python_version not in '3.0, 3.1, 3.2, 3.3'
python-twitter==3.5
pytz==2020.1
pyyaml==5.3.1
redis==3.5.3; python_version >= '2.7' and python_version not in '3.0, 3.1, 3.2, 3.3, 3.4'
requests-oauthlib==1.3.0
requests==2.24.0
s3transfer==0.3.3
sentry-sdk==0.16.0
six==1.15.0; python_version >= '2.7' and python_version not in '3.0, 3.1, 3.2, 3.3'
sqlparse==0.3.1; python_version >= '2.7' and python_version not in '3.0, 3.1, 3.2, 3.3'
text-unidecode==1.3
twilio==6.43.0
ua-parser==0.10.0
uritemplate==3.0.1
urllib3==1.25.9; python_version != '3.4'
user-agents==2.1

The text was updated successfully, but these errors were encountered:

michaelhelmick · 2020-07-06T20:01:34Z

Can confirm this happens. Currently, attempting to set CLOUD_PROVIDER to False in settings.py and sticking to 5.8.0

Update:

Set the above variable and still seeing this Read Timeout

ELASTIC_APM = {
        'SERVICE_NAME': 'project',
        'CLOUD_PROVIDER': False,
        ...
}

basepi · 2020-07-06T22:31:41Z

Thanks for the report! This is certainly related to the new cloud metadata we collect, but I'll need to investigate more to figure out what's going wrong, especially with CLOUD_PROVIDER=False.

basepi · 2020-07-07T17:26:10Z

@michaelhelmick Can you show me the ReadTimeouts that you're getting? I cannot see a way that we should be hitting any of the cloud metadata collection with 'CLOUD_PROVIDER': False, so I want to make sure that your timeout is actually the same as @ivasic's.

basepi · 2020-07-07T17:36:17Z

@michaelhelmick Also are you in AWS or can you reproduce this outside of a cloud provider?

michaelhelmick · 2020-07-07T20:29:05Z

@michaelhelmick Also are you in AWS or can you reproduce this outside of a cloud provider?

It's on the same /latest/api/token endpoint. I curl'd inside my container and it wasn't available there either, but other metadata endpoints were.

I am on AWS and don't have time to try and reproduce outside of a cloud provider, sorry!

basepi · 2020-07-07T21:23:05Z

@michaelhelmick That's fine. I haven't been able to reproduce the hang, though I can reproduce the Closing the transport connection timed out, and know how to fix the ReadTimeout warning logs. I'm pretty sure that all of these issues are due to the metadata generation moving into the Transport thread, as part of adding the cloud metadata. I'm still digging.

basepi · 2020-07-07T21:47:51Z

@ivasic I've never been able to reproduce your full hang. However, I fixed the other errors you were seeing in #884. It was a combination of a missing retries=False on the token call for the AWS metadata, and the fact that on docker containers on EB, the PUT to the token endpoint fails, and fails slowly. I fixed the timeouts, removed the retries, and put error handling around the token request:

root@44f0a8d1a404:/app# time python manage.py migrate
Operations to perform:
  Apply all migrations: admin, auth, contenttypes, sessions
Running migrations:
  No migrations to apply.

real	0m1.445s
user	0m0.484s
sys	0m0.052s

It's still slower than I'd like, due to taking the whole 1.0 second timeout on the token call, but I think it's reasonable now.

I would love it if you could test this for me, to see if you are still hanging. You can drop this into your requirements.txt file (in place of elastic-apm=5.7.0) to test against my branch:

git+git://github.com/basepi/apm-agent-python@cloudmetadata#egg=elastic-apm

michaelhelmick · 2020-07-08T00:01:09Z

If it helps, mine appeared in celery logs in ECS. Did not appear in migrate or any web logs

ivasic · 2020-07-08T07:38:44Z

It seems to have helped! 🎉
I'm not seeing the error log anymore and it's also not stuck. Thanks a bunch @basepi

I'm not sure how your change stopped the hangs but glad it worked. Wonder if it had to do something with the retries maybe?

basepi · 2020-07-08T15:19:04Z

That's my best guess. But it still shouldn't have hung more than about 10-12 seconds, so I'm definitely confused...

In any case, I'm glad the fix is working!

basepi self-assigned this Jul 6, 2020

basepi added the bug label Jul 6, 2020

basepi mentioned this issue Jul 7, 2020

Better error handling for AWS token fetching #884

Merged

basepi mentioned this issue Jul 7, 2020

Agents should collect cloud identification metadata elastic/apm#256

Closed

7 tasks

basepi closed this as completed in #884 Jul 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ReadTimeoutErrors on 169.254.169.254 /latest/api/token #880

ReadTimeoutErrors on 169.254.169.254 /latest/api/token #880

ivasic commented Jul 3, 2020 •

edited by zube bot

Loading

michaelhelmick commented Jul 6, 2020 •

edited

Loading

basepi commented Jul 6, 2020

basepi commented Jul 7, 2020

basepi commented Jul 7, 2020

michaelhelmick commented Jul 7, 2020

basepi commented Jul 7, 2020

basepi commented Jul 7, 2020

michaelhelmick commented Jul 8, 2020

ivasic commented Jul 8, 2020 •

edited

Loading

basepi commented Jul 8, 2020

ReadTimeoutErrors on 169.254.169.254 /latest/api/token #880

ReadTimeoutErrors on 169.254.169.254 /latest/api/token #880

Comments

ivasic commented Jul 3, 2020 • edited by zube bot Loading

michaelhelmick commented Jul 6, 2020 • edited Loading

basepi commented Jul 6, 2020

basepi commented Jul 7, 2020

basepi commented Jul 7, 2020

michaelhelmick commented Jul 7, 2020

basepi commented Jul 7, 2020

basepi commented Jul 7, 2020

michaelhelmick commented Jul 8, 2020

ivasic commented Jul 8, 2020 • edited Loading

basepi commented Jul 8, 2020

ivasic commented Jul 3, 2020 •

edited by zube bot

Loading

michaelhelmick commented Jul 6, 2020 •

edited

Loading

ivasic commented Jul 8, 2020 •

edited

Loading