Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ReadTimeoutErrors on 169.254.169.254 /latest/api/token #880

Closed
ivasic opened this issue Jul 3, 2020 · 10 comments · Fixed by #884
Closed

ReadTimeoutErrors on 169.254.169.254 /latest/api/token #880

ivasic opened this issue Jul 3, 2020 · 10 comments · Fixed by #884
Assignees
Labels

Comments

@ivasic
Copy link
Contributor

ivasic commented Jul 3, 2020

After updating the agent to version 5.8.0 my all my Django manage.py commands seem to hang indefinitely and never finish and I'm first seeing:

WARNING urllib3.connectionpool Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPConnectionPool(host='169.254.169.254', port=80): Read timed out. (read timeout=3.0)")': /latest/api/token connectionpool.py:749

and then (after some time):

ERROR elasticapm.transport Closing the transport connection timed out. base.py:261

After this last error message the process just hangs indefinitely and has to be killed manually.

To Reproduce

I'm not able to reproduce this locally, it seems to happen exclusively in my staging/production environment on AWS.
There, it really only takes a Django management command to make it hang. For example:

$ python manage.py migrate
Operations to perform:
  Apply all migrations: admin, auth, contenttypes, sessions
Running migrations:
  No migrations to apply.
2020-07-03 13:17:19,256 WARNING  urllib3.connectionpool Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPConnectionPool(host='169.254.169.254', port=80): Read timed out. (read timeout=3.0)")': /latest/api/token connectionpool.py:749
2020-07-03 13:17:27,164 ERROR    elasticapm.transport Closing the transport connection timed out. base.py:261

Environment

  • OS: Linux 6094537dd0b9 4.14.181-108.257.amzn1.x86_64 elastic/apm-agent-python#1 SMP Wed May 27 02:43:03 UTC 2020 x86_64 GNU/Linux (Docker container - python:3.8 image, docker running on Amazon Linux/2.15.2)
  • Python version: 3.8.3
  • Framework and version: Django 3.0.8 (same thing happens with 3.0.7)
  • APM Server version: 7.5.2
  • Agent version: 5.8.0 (I can confirm this doesn't happen with v5.7.0)

Additional context

The app runs on AWS ElasticBeanstalk Docker environment with python 3.8 image, pretty straight forward.

  • requirements.txt:

    Click to expand
    appdirs==1.4.4
    asgiref==3.2.10; python_version >= '3.5'
    boto3==1.14.16
    botocore==1.17.16
    certifi==2020.6.20
    cffi==1.14.0
    chardet==3.0.4
    cryptography==2.9.2; python_version >= '2.7' and python_version not in '3.0, 3.1, 3.2, 3.3, 3.4'
    django-migration-linter==2.3.0
    django-redis==4.12.1
    django-sortedm2m==3.0.0
    django-storages==1.9.1
    django-widget-tweaks==1.4.8
    django==3.0.8
    djangorestframework-simplejwt==4.4.0
    djangorestframework==3.11.0
    docutils==0.15.2; python_version >= '2.6' and python_version not in '3.0, 3.1, 3.2, 3.3'
    elastic-apm==5.8.0
    factory-boy==2.12.0
    faker==4.1.1; python_version >= '3.4'
    future==0.18.2; python_version >= '2.6' and python_version not in '3.0, 3.1, 3.2, 3.3'
    googlemaps==4.4.1
    gunicorn==20.0.4
    idna==2.10; python_version >= '2.7' and python_version not in '3.0, 3.1, 3.2, 3.3'
    jmespath==0.10.0; python_version >= '2.6' and python_version not in '3.0, 3.1, 3.2, 3.3'
    jwcrypto==0.7
    markdown==3.2.2
    oauthlib==3.1.0; python_version >= '2.7' and python_version not in '3.0, 3.1, 3.2, 3.3'
    pillow==7.2.0
    psycopg2-binary==2.8.5
    pycparser==2.20; python_version >= '2.7' and python_version not in '3.0, 3.1, 3.2, 3.3'
    pyjwt==1.7.1
    python-dateutil==2.8.1; python_version >= '2.7' and python_version not in '3.0, 3.1, 3.2, 3.3'
    python-twitter==3.5
    pytz==2020.1
    pyyaml==5.3.1
    redis==3.5.3; python_version >= '2.7' and python_version not in '3.0, 3.1, 3.2, 3.3, 3.4'
    requests-oauthlib==1.3.0
    requests==2.24.0
    s3transfer==0.3.3
    sentry-sdk==0.16.0
    six==1.15.0; python_version >= '2.7' and python_version not in '3.0, 3.1, 3.2, 3.3'
    sqlparse==0.3.1; python_version >= '2.7' and python_version not in '3.0, 3.1, 3.2, 3.3'
    text-unidecode==1.3
    twilio==6.43.0
    ua-parser==0.10.0
    uritemplate==3.0.1
    urllib3==1.25.9; python_version != '3.4'
    user-agents==2.1
    
@michaelhelmick
Copy link

michaelhelmick commented Jul 6, 2020

Can confirm this happens. Currently, attempting to set CLOUD_PROVIDER to False in settings.py and sticking to 5.8.0

Update:

Set the above variable and still seeing this Read Timeout

ELASTIC_APM = {
        'SERVICE_NAME': 'project',
        'CLOUD_PROVIDER': False,
        ...
}

@basepi
Copy link
Contributor

basepi commented Jul 6, 2020

Thanks for the report! This is certainly related to the new cloud metadata we collect, but I'll need to investigate more to figure out what's going wrong, especially with CLOUD_PROVIDER=False.

@basepi basepi self-assigned this Jul 6, 2020
@basepi basepi added the bug label Jul 6, 2020
@basepi
Copy link
Contributor

basepi commented Jul 7, 2020

@michaelhelmick Can you show me the ReadTimeouts that you're getting? I cannot see a way that we should be hitting any of the cloud metadata collection with 'CLOUD_PROVIDER': False, so I want to make sure that your timeout is actually the same as @ivasic's.

@basepi
Copy link
Contributor

basepi commented Jul 7, 2020

@michaelhelmick Also are you in AWS or can you reproduce this outside of a cloud provider?

@michaelhelmick
Copy link

@michaelhelmick Also are you in AWS or can you reproduce this outside of a cloud provider?

It's on the same /latest/api/token endpoint. I curl'd inside my container and it wasn't available there either, but other metadata endpoints were.

I am on AWS and don't have time to try and reproduce outside of a cloud provider, sorry!

@basepi
Copy link
Contributor

basepi commented Jul 7, 2020

@michaelhelmick That's fine. I haven't been able to reproduce the hang, though I can reproduce the Closing the transport connection timed out, and know how to fix the ReadTimeout warning logs. I'm pretty sure that all of these issues are due to the metadata generation moving into the Transport thread, as part of adding the cloud metadata. I'm still digging.

@basepi
Copy link
Contributor

basepi commented Jul 7, 2020

@ivasic I've never been able to reproduce your full hang. However, I fixed the other errors you were seeing in #884. It was a combination of a missing retries=False on the token call for the AWS metadata, and the fact that on docker containers on EB, the PUT to the token endpoint fails, and fails slowly. I fixed the timeouts, removed the retries, and put error handling around the token request:

root@44f0a8d1a404:/app# time python manage.py migrate
Operations to perform:
  Apply all migrations: admin, auth, contenttypes, sessions
Running migrations:
  No migrations to apply.

real	0m1.445s
user	0m0.484s
sys	0m0.052s

It's still slower than I'd like, due to taking the whole 1.0 second timeout on the token call, but I think it's reasonable now.

I would love it if you could test this for me, to see if you are still hanging. You can drop this into your requirements.txt file (in place of elastic-apm=5.7.0) to test against my branch:

git+git://github.com/basepi/apm-agent-python@cloudmetadata#egg=elastic-apm

@michaelhelmick
Copy link

If it helps, mine appeared in celery logs in ECS. Did not appear in migrate or any web logs

@ivasic
Copy link
Contributor Author

ivasic commented Jul 8, 2020

It seems to have helped! 🎉
I'm not seeing the error log anymore and it's also not stuck. Thanks a bunch @basepi

I'm not sure how your change stopped the hangs but glad it worked. Wonder if it had to do something with the retries maybe?

@basepi
Copy link
Contributor

basepi commented Jul 8, 2020

That's my best guess. But it still shouldn't have hung more than about 10-12 seconds, so I'm definitely confused...

In any case, I'm glad the fix is working!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants