Intermittent DefaultCredentialsError on GCE #211

dhermes · 2017-11-09T18:33:02Z

Original issue: googleapis/google-cloud-python#4358

After successful use of credentials, _ = google.auth.default(), an application crashes when credentials cannot be detected:

...
  File "/usr/local/lib/python2.7/dist-packages/google/cloud/client.py", line 212, in __init__
    Client.__init__(self, credentials=credentials, _http=_http)
  File "/usr/local/lib/python2.7/dist-packages/google/cloud/client.py", line 125, in __init__
    credentials, _ = google.auth.default()
  File "/usr/local/lib/python2.7/dist-packages/google/auth/_default.py", line 286, in default
    raise exceptions.DefaultCredentialsError(_HELP_MESSAGE)
DefaultCredentialsError: Could not automatically determine credentials. Please set GOOGLE_APPLICATION_CREDENTIALS or
explicitly create credential and re-run the application. For more
information, please see
https://developers.google.com/accounts/docs/application-default-credentials.

/cc @dmho418

The text was updated successfully, but these errors were encountered:

dhermes · 2017-11-09T18:35:39Z

Also from @dmho418 (may be unrelated to this issue, but is probably the root cause):

...
  File "/usr/local/lib/python2.7/dist-packages/google/resumable_media/requests/upload.py", line 97, in transmit
    retry_strategy=self._retry_strategy)
  File "/usr/local/lib/python2.7/dist-packages/google/resumable_media/requests/_helpers.py", line 101, in http_request
    func, RequestsMixin._get_status_code, retry_strategy)
  File "/usr/local/lib/python2.7/dist-packages/google/resumable_media/_helpers.py", line 146, in wait_and_retry
    response = func()
  File "/usr/local/lib/python2.7/dist-packages/google/auth/transport/requests.py", line 176, in request
    self._auth_request, method, url, request_headers)
  File "/usr/local/lib/python2.7/dist-packages/google/auth/credentials.py", line 121, in before_request
    self.refresh(request)
  File "/usr/local/lib/python2.7/dist-packages/google/auth/compute_engine/credentials.py", line 93, in refresh
    raise exceptions.RefreshError(exc)
RuntimeError: RefreshError: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true

dhermes · 2017-11-09T18:37:10Z

@jonparrott Would there be a way to access the number of retries that were exceeded?

theacodes · 2017-11-09T18:37:43Z

It'll be the requests / urllib3 default

dhermes · 2017-11-09T19:39:43Z

Failing method (invoked by / raised from refresh())
Function throwing exception (actual line where thrown)
Transport used
Exception being raised (from urllib3)

Default retry strategy for the requests transport:

>>> import requests
>>> session = requests.Session()
>>> adapter1, adapter2 = session.adapters.values()
>>> adapter1.max_retries
Retry(total=0, connect=None, read=False, redirect=None, status=None)
>>> adapter2.max_retries
Retry(total=0, connect=None, read=False, redirect=None, status=None)

From the docstring:

By default, Requests does not retry failed connections.

dhermes · 2017-11-09T20:58:13Z

I'm trying to reproduce on GCE with:

$ sudo apt-get update
$ sudo apt-get upgrade
$ sudo apt-get install -y python-dev python-pip
$ sudo -H pip install --upgrade google-auth requests
$ cat << EOF > repro.py
> import pdb
> import sys
> import time
>
> import google.auth
>
>
> def main():
>     if len(sys.argv) < 2:
>         num_attempts = 100
>     else:
>         num_attempts = int(sys.argv[1])
>     for n in range(num_attempts):
>         print(n)
>         try:
>             credentials, project = google.auth.default()
>         except:
>             pdb.set_trace()
>             raise
>         time.sleep(0.5)
>
>
> if __name__ == '__main__':
>     main()
> EOF
$
$ python repro.py 10
$ python repro.py 100
$ python repro.py 1000

and am not having any luck.

If I had to guess, I'd say the only way to reproduce would be to stress out the instance (e.g. high CPU usage) so that the process handling the metadata server fails.

antoineazar · 2017-11-16T20:14:03Z

We see this issue in our setup as well. Python script connecting to BQ and running hundreds of queries in rapid fire sequence. We'll occasionally see:
A INFO:google.auth.compute_engine._metadata:Compute Engine Metadata server unavailable.

followed by a crash:
google.auth.exceptions.DefaultCredentialsError: Could not automatically determine credentials. Please set GOOGLE_APPLICATION_CREDENTIALS or explicitly create credential and re-run the application.

dhermes · 2017-11-16T20:24:51Z

Awesome data @antoineazar! Thanks for the confirmation.

Any code you could share so we could try to reproduce / strategize?

antoineazar · 2017-12-01T04:06:03Z

@dhermes any code that runs multiple (dozens usually suffice) queries on BQ in rapid sequence. Some simple sample code (pre-0.28 library, didn't test with 0.28), you can wrap this in a loop:

client = bigquery.Client()
query_job = client.run_async_query(str(uuid.uuid4()), query)

# Use standard SQL syntax.
query_job.use_legacy_sql = False

# Set a destination table.
dest_dataset = client.dataset(dest_dataset_id)
dest_table = dest_dataset.table(dest_table_id)
query_job.destination = dest_table

# Allow the results table to be overwritten.
query_job.write_disposition = 'WRITE_TRUNCATE'

query_job.begin()
query_job.result()  # Wait for query to finish.

GEverding · 2018-03-01T17:15:20Z

I'm seeing this issue in production. Is there a workaround/fix?

theacodes · 2018-03-01T17:23:51Z

@GEverding the recommendation right now is to use a service account keyfile instead of relying on the GCE metadata service.

It's possible that we could make retry failed connections to the metadata service, but I'm unsure on that at the moment.

GEverding · 2018-03-01T17:29:40Z

Thanks. Is that documented somewhere?

On Thu, Mar 1, 2018 at 12:25 PM Jon Wayne Parrott ***@***.***> wrote: @GEverding <https://github.com/geverding> the recommendation right now is to use a service account keyfile instead of relying on the GCE metadata service. It's possible that we could make retry failed connections to the metadata service, but I'm unsure on that at the moment. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#211 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABUFzCiFYhfaVHVD87TTf7cDedV8XLULks5taC6ogaJpZM4QYcXk> .

-- Garrett

theacodes · 2018-03-01T18:10:41Z

https://cloud.google.com/docs/authentication/getting-started

vanpelt · 2018-04-26T16:43:09Z

I'm getting "[Errno 111] Connection refused" triggering the same exceptions @dhermes mentioned fairly regularly on my appengine flex deployment. Is the best solution getting a keyfile into my flex deployment? I'm going to try passing in a requests adapter that has retry logic configured, but it's frustrating I've already spent this much time on such a fundamental feature of this library.

theacodes · 2018-04-26T17:49:05Z

@vanpelt yep, that is a completely acceptable approach.

KevinTydlacka · 2018-07-09T22:23:58Z

I'm also seeing these errors. Stack trace below:

DefaultCredentialsError: Could not automatically determine credentials. Please set GOOGLE_APPLICATION_CREDENTIALS or explicitly create credentials and re-run the application. For more information, please see https://developers.google.com/accounts/docs/application-default-credentials.
at default (/env/local/lib/python2.7/site-packages/google/auth/_default.py:306)
at create_channel (/env/local/lib/python2.7/site-packages/google/api_core/grpc_helpers.py:170)
at __init__ (/env/local/lib/python2.7/site-packages/google/cloud/pubsub_v1/publisher/client.py:81)
at handle_asc_sub_notif (/home/vmagent/app/main.py:101)
at dispatch_request (/env/local/lib/python2.7/site-packages/flask/app.py:1598)
at full_dispatch_request (/env/local/lib/python2.7/site-packages/flask/app.py:1612)
at handle_user_exception (/env/local/lib/python2.7/site-packages/flask/app.py:1517)
at full_dispatch_request (/env/local/lib/python2.7/site-packages/flask/app.py:1614)
at wsgi_app (/env/local/lib/python2.7/site-packages/flask/app.py:1982)

I'm running a very, very simple App Engine app in the Flexible Python 2.7 environment. It really doesn't do anything more than the sample found here, using the exact code: https://github.com/GoogleCloudPlatform/python-docs-samples/tree/master/appengine/flexible/pubsub

I ingest post data within a flask app, then publish it to a topic using the pubsub_v1 client. No other errors in the logs before or after, and the app/method successfully processes other requests within seconds before the failed attempt, and within a minute after.

This is a little frustrating/concerning since I'm basically seeing the issue using the provided GAE sample code. I'm updating my app and pushing a service account secrets json with my code, and will see what happens if I manually create the credentials for the client instead, but would love to see this resolved.

ocervell · 2018-09-03T01:18:28Z

Still seeing this today. Any workaround if storing the credentials file on the server is not possible (riskier than default service account credentials) ?

joshnewlinatclearobject · 2018-09-12T13:51:59Z

Is this something that I'll need to push up a keyfile with my flex deployment (as @vanpelt mentioned), or will it be fixed in the PR that was just linked?

dhendry · 2018-10-01T17:50:20Z

I would love to see a fix for this issue

cpavon · 2019-02-08T11:47:24Z

Is this still open 1 year later?

mike-seekwell · 2019-06-30T13:27:25Z

@JustinBeckwith @theacodes I see #323 was merged, should this issue be fixed now? A lot of people will likely land here when they run into this issue, you should confirm the fix here and, if it's really a fix, tell people to upgrade (google-auth==1.6.3).

busunkim96 · 2019-08-06T20:27:35Z

@mike-seekwell Thanks for the call out! #323 merged a fix to retry the ping to the metadata server.

If you're seeing this error, please upgrade to version 1.6.3 or greater.

ghost · 2019-10-26T21:15:45Z

I'm still seeing this error fairly regularly running on GAE flex for Python 3.6 with google-auth==1.6.3. Here's the full stack trace:

  ...
  File "/env/lib/python3.6/site-packages/google/auth/transport/requests.py", line 205, in request
    self._auth_request, method, url, request_headers)
  File "/env/lib/python3.6/site-packages/google/auth/credentials.py", line 122, in before_request
    self.refresh(request)
  File "/env/lib/python3.6/site-packages/google/auth/compute_engine/credentials.py", line 102, in refresh
    six.raise_from(new_exc, caught_exc)
  File "<string>", line 3, in raise_from
google.auth.exceptions.RefreshError: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/myst-ai-crystal@appspot.gserviceaccount.com/?recursive=true (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f9964d74080>: Failed to establish a new connection: [Errno 111] Connection refused',))

JustinBeckwith · 2019-10-26T21:21:42Z

Greetings! Would you mind opening a new issue? It makes tracking these discussions much easier.

Initial fix of issue googleapis#211 was done in CL googleapis#323, but only for .ping(). This one is adding same behaviour & tests for .get() method, as the problem still occurres

Initial fix of issue googleapis#211 was done in CL googleapis#323, but only for .ping()n This one is adding same behaviour & tests for .get() method, as the problem still occurres\n\nResolves: googleapis#211

…mpute_engine._metadata.get() Initial fix of issue googleapis#211 was done in CL googleapis#323, but only for .ping() This one is adding same behaviour & tests for .get() method, as the problem still occurres See the issue for details Refs: googleapis#323 Resolves: googleapis#211

…mpute_engine._metadata.get() (#398) Initial fix of issue #211 was done in CL #323, but only for .ping() This one is adding same behaviour & tests for .get() method, as the problem still occurres See the issue for details Refs: #323 Resolves: #211

dhermes mentioned this issue Nov 9, 2017

DefaultCredentialsError intermittently in google-cloud-storage running in Cloud Dataflow googleapis/google-cloud-python#4358

Closed

dhermes mentioned this issue Nov 9, 2017

Using six.raise_from wherever possible. #212

Merged

theacodes mentioned this issue Feb 9, 2018

Auth error: ETIMEDOUT googleapis/nodejs-pubsub#30

Closed

stephenplusplus mentioned this issue Feb 13, 2018

Auth error: ETIMEDOUT googleapis/google-auth-library-nodejs#283

Closed

asford mentioned this issue Feb 14, 2018

google.auth.exceptions.RefreshError with excessive concurrent requests. fsspec/gcsfs#71

Open

theacodes self-assigned this Mar 1, 2018

theacodes added the bug label Mar 1, 2018

JustinBeckwith added triage me I really want to be triaged. 🚨 This issue needs some love. labels Jun 8, 2018

tseaver added priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. and removed bug triage me I really want to be triaged. labels Jun 11, 2018

ocervell mentioned this issue Sep 8, 2018

Handle exceptions from Exporters census-instrumentation/opencensus-python#270

Closed

This was referenced Sep 8, 2018

Add OpenCensus instrumentation for tracing forseti-security/forseti-security#1926

Merged

StackdriverExporter with application default credentials crashes census-instrumentation/opencensus-python#304

Closed

JustinBeckwith added priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. and removed priority: p2 Moderately-important priority. Fix may not be included in next release. labels Feb 11, 2019

rmbarron mentioned this issue Feb 13, 2019

Add retry to _metadata.ping() #323

Merged

busunkim96 closed this as completed Aug 6, 2019

ryandawsonuk mentioned this issue Dec 3, 2019

tensorflow sample fails to run kserve/kserve#562

Closed

DaniilAnichin mentioned this issue Dec 5, 2019

fix: add retry to google.auth.compute_engine._metadata.get() #398

Merged

bruce3557 mentioned this issue Dec 26, 2019

Cannot run pipeline samples in GCP IAP Deployment kubeflow/pipelines#2773

Closed

Nishith95 mentioned this issue Jul 22, 2021

DefaultCredentialsError after Compute Engine Metadata server failures #814

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intermittent DefaultCredentialsError on GCE #211

Intermittent DefaultCredentialsError on GCE #211

dhermes commented Nov 9, 2017

dhermes commented Nov 9, 2017

dhermes commented Nov 9, 2017

theacodes commented Nov 9, 2017

dhermes commented Nov 9, 2017 •

edited

Loading

dhermes commented Nov 9, 2017 •

edited

Loading

antoineazar commented Nov 16, 2017 •

edited

Loading

dhermes commented Nov 16, 2017

antoineazar commented Dec 1, 2017 •

edited by dhermes

Loading

GEverding commented Mar 1, 2018

theacodes commented Mar 1, 2018

GEverding commented Mar 1, 2018 via email

theacodes commented Mar 1, 2018

vanpelt commented Apr 26, 2018

theacodes commented Apr 26, 2018 •

edited

Loading

KevinTydlacka commented Jul 9, 2018

ocervell commented Sep 3, 2018 •

edited

Loading

joshnewlinatclearobject commented Sep 12, 2018

dhendry commented Oct 1, 2018

cpavon commented Feb 8, 2019

mike-seekwell commented Jun 30, 2019 •

edited

Loading

busunkim96 commented Aug 6, 2019

ghost commented Oct 26, 2019 •

edited by ghost

Loading

JustinBeckwith commented Oct 26, 2019

Intermittent DefaultCredentialsError on GCE #211

Intermittent DefaultCredentialsError on GCE #211

Comments

dhermes commented Nov 9, 2017

dhermes commented Nov 9, 2017

dhermes commented Nov 9, 2017

theacodes commented Nov 9, 2017

dhermes commented Nov 9, 2017 • edited Loading

dhermes commented Nov 9, 2017 • edited Loading

antoineazar commented Nov 16, 2017 • edited Loading

dhermes commented Nov 16, 2017

antoineazar commented Dec 1, 2017 • edited by dhermes Loading

GEverding commented Mar 1, 2018

theacodes commented Mar 1, 2018

GEverding commented Mar 1, 2018 via email

theacodes commented Mar 1, 2018

vanpelt commented Apr 26, 2018

theacodes commented Apr 26, 2018 • edited Loading

KevinTydlacka commented Jul 9, 2018

ocervell commented Sep 3, 2018 • edited Loading

joshnewlinatclearobject commented Sep 12, 2018

dhendry commented Oct 1, 2018

cpavon commented Feb 8, 2019

mike-seekwell commented Jun 30, 2019 • edited Loading

busunkim96 commented Aug 6, 2019

ghost commented Oct 26, 2019 • edited by ghost Loading

JustinBeckwith commented Oct 26, 2019

dhermes commented Nov 9, 2017 •

edited

Loading

dhermes commented Nov 9, 2017 •

edited

Loading

antoineazar commented Nov 16, 2017 •

edited

Loading

antoineazar commented Dec 1, 2017 •

edited by dhermes

Loading

theacodes commented Apr 26, 2018 •

edited

Loading

ocervell commented Sep 3, 2018 •

edited

Loading

mike-seekwell commented Jun 30, 2019 •

edited

Loading

ghost commented Oct 26, 2019 •

edited by ghost

Loading