Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GRPC will keep retrying (until RPC timeout) if a user uses a bad service account key #6808

Closed
dzou opened this issue Mar 5, 2020 · 32 comments · Fixed by #9102 or #10909
Closed

GRPC will keep retrying (until RPC timeout) if a user uses a bad service account key #6808

dzou opened this issue Mar 5, 2020 · 32 comments · Fixed by #9102 or #10909
Assignees
Milestone

Comments

@dzou
Copy link
Contributor

dzou commented Mar 5, 2020

What version of gRPC-Java are you using?

1.27.2

What is your environment?

Linux, JDK8

What do you see?

When a user uses an invalid service account key (like one that was deleted), it is treated as an UNAVAILABLE error. Unavailable errors are interpreted as ones that should be retried by client libraries; consequently the application will attempt to retry this operation with invalid credentials until the RPC timeout is reached (usually 10 minutes).

What do you expect to see instead?

When a user uses an invalid service account key to authenticate with GRPC, it should yield an UNAUTHENTICATED error which indicates the operation should not be retried.

Steps to reproduce the bug

This is most easily reproduced through experimenting with the client libraries.

  1. Create a service account and service account key, download it.
  2. Delete the key entry in Cloud Console (so the downloaded key is no longer valid).
  3. Authenticate with some google service using that key. Here is an example using Pub/Sub like this.

Additional details:

Suggested fix:

The code that controls this behavior is in GoogleAuthLibraryCallCredentials. I don't think all IOExceptions should be retried though.

In this case, one can see that the exception has a .getCause() which is HttpResponseException and it has .getStatusCode() == 400 which indicates a bad request. This is the error thrown if the user provides an invalid service account key.

Would it be possible to modify it so that if it is an IOException, it will examine the getCause() of the exception and throw UNAUTHENTICATED if the cause is HttpResponseException with status code 400?

Example of exception that you see:

th [] threw exception [Request processing failed; nested exception is com.google.api.gax.rpc.UnavailableException: io.grpc.StatusRuntimeException: UNAVAILABLE: Credentials failed to obtain metadata] with root cause

com.google.api.client.http.HttpResponseException: 400 Bad Request
{"error":"invalid_grant","error_description":"Invalid JWT Signature."}
	at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1113) ~[google-http-client-1.34.2.jar:1.34.2]
	at com.google.auth.oauth2.ServiceAccountCredentials.refreshAccessToken(ServiceAccountCredentials.java:441) ~[google-auth-library-oauth2-http-0.20.0.jar:na]
	at com.google.auth.oauth2.OAuth2Credentials.refresh(OAuth2Credentials.java:157) ~[google-auth-library-oauth2-http-0.20.0.jar:na]
	at com.google.auth.oauth2.OAuth2Credentials.getRequestMetadata(OAuth2Credentials.java:145) ~[google-auth-library-oauth2-http-0.20.0.jar:na]
	at com.google.auth.oauth2.ServiceAccountCredentials.getRequestMetadata(ServiceAccountCredentials.java:603) ~[google-auth-library-oauth2-http-0.20.0.jar:na]
	at com.google.auth.Credentials.blockingGetToCallback(Credentials.java:112) ~[google-auth-library-credentials-0.20.0.jar:na]
	at com.google.auth.Credentials$1.run(Credentials.java:98) ~[google-auth-library-credentials-0.20.0.jar:na]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[na:1.8.0_181-google-v7]
	at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[na:1.8.0_181-google-v7]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) ~[na:1.8.0_181-google-v7]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:295) ~[na:1.8.0_181-google-v7]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[na:1.8.0_181-google-v7]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[na:1.8.0_181-google-v7]
	at java.lang.Thread.run(Thread.java:748) [na:1.8.0_181-google-v7]
@sanjaypujare
Copy link
Contributor

@dzou Based on getCause() being 400/Bad Request I think it is a stretch to infer UNAUTHENTICATED as the root cause. However the JSON snippet in the status text does indicate an authentication issue and that could be potentially used to infer UNAUTHENTICATED if that is what we get in all such cases.

@dzou
Copy link
Contributor Author

dzou commented Mar 5, 2020

I see, then forget about UNAUTHENTICATED; However I think 400/Bad Request should be an indication that the request should not be retried, no?

@sanjaypujare
Copy link
Contributor

That's debatable. Without more information I can see a "bad request" succeeding after retrying (some kind of a race condition or simply 400/Bad Request being the incorrect code returned from the server). So one could argue that it is safe to retry a bad request.

@dzou
Copy link
Contributor Author

dzou commented Mar 5, 2020

Ah, so I am interpretting the 400 error-codes as an error being made on the caller's side - see this for what I had in mind: https://stackoverflow.com/questions/47680711/which-http-errors-should-never-trigger-an-automatic-retry

I was thinking basically that a 400-code would mean the user's request is incorrect (i.e. in this case making a request with a service account key that does not exist), and if we know the user's request is invalid then we need not retry it.

@sanjaypujare
Copy link
Contributor

Consider gRPC's retry feature as part of an advanced framework wrapping an existing protocol like HTTP (HTTP2 to be more precise). Some users would expect the framework to retry even in cases of 400/Bad Request just because the error code sounds like a catch-all generic error code. But I do see where you are coming from and if many others feel the same way I guess the change you want can be made. But it will have to be more than just mapping 400/Bad Request to UNAUTHENTICATED since UNAUTHENTICATED doesn't sound right.

@sanjaypujare
Copy link
Contributor

I agree that logic of onFailure inside GoogleAuthLibraryCallCredentials can be tweaked as follows: in case of IOException check if it is an com.google.api.client.http.HttpResponseException and if the status code says 401 (UNAUTHORIZED) then call applier.fail with Status.UNAUTHENTICATED similar to the current else clause. However that still requires modification in the google-auth-library to generate 401 instead of 400 in this case. Once that is done we can reopen this issue and someone can open a PR for GoogleAuthLibraryCallCredentials.

@ejona86
Copy link
Member

ejona86 commented Mar 26, 2020

@jiangtaoli2016, how do you think this should be fixed? It probably impacts all languages.

I think there are two things to discuss here

  1. Normally we auto-promote ServiceAccountCreds to ServiceAccountJwtAccessCredentials, which avoids exchanging a JWT for an OAuth token (via plain HTTP). It appears your system was not doing this exchange, probably because you specified scopes. If grpc-java ignored the scopes and auto-promoted in this case as well, it would have avoided your issue because the server would have responded with UNAUTHENTICATED. C core made this change earlier, but it was not made to grpc-java

  2. The status code to use when there is a failure getting an OAuth token. That used to be UNAUTHORIZED but was changed to UNAVAILABLE because the OAuth exchange can fail due to normal I/O problems. The HttpResponseException exception is very awkward for gRPC to dig into, as it is part of google-http-java-client which is an implementation detail of google-auth-library-java. It is unclear how this would be resolved; it would probably need enhancements to the google-auth-library-java API

A hacky "fix" for (2) is easy, but I don't think it puts us on solid ground and would just bite us in the future as an implementation change in google-auth-library-java could change the status code. And we'd need to figure out an equivalent solution in each language. Spending our time on (1) seems better, and while it does side-step the issue it also provides other benefits.

@jiangtaoli2016
Copy link
Contributor

The token change from JWT server account creds to OAuth token is via
https://accounts.google.com/o/oauth2/token.

@ejona86 Could you point the grpc core change that you refer? I think it makes sense to have all languages behave consistently.

@ejona86
Copy link
Member

ejona86 commented Mar 26, 2020

@jiangtaoli2016, no, I can't point to the grpc core change. It was a long time ago and just mentioned to me "in person." I can't even point to the current behavior in c core.

@ejona86 ejona86 added enhancement and removed bug labels Mar 27, 2020
@creamsoup creamsoup modified the milestone: Next Apr 17, 2020
@jiangtaoli2016
Copy link
Contributor

Sorry for delay. I took some time to research cloud auth APIs and related gRPC codes.

When using service account key, one can either obtain an oauth2 access token from google token service (https://accounts.google.com/o/oauth2/token) and use the access token for Cloud APIs access or use the service account to create a JWT token and directly access Cloud APIs. The first method is ServiceAccountCredentials in Java or GoogleRefreshTokenCredentials in C++. The second method is ServiceAccountJwtAccessCredentials in Java or ServiceAccountJWTAccessCredentials. Unlike Java, gRPC core and wrapped language does not auto promote ServiceAccountCredentials into ServiceAccountJwtAccessCredentials. Both methods are valid and supported as of today. I don't think we could deprecate or should ServiceAccountCredentials.

In gRPC C++ GoogleRefreshTokenCredentials, AccessTokenCredentials, and StsCredentials, an access token is fetched remotely and attached to call metadata. In other words, it is a quite common scenario we have to deal with. @ejona86: it is better to deal with it rather than dodge this issue, as you suggested in (1) above.

In gRPC core implementation, to fetch an access token remotely, it uses grpc_httpcli_post to fetch token. By looking at grpc httpcli implementation, it does not seems retry. Basically, the fetching token is once. gRPC core does not parse response from HTTP request. Here is how it responds to http response. Any status code that is not 200 is treated as error. Unless success, IO error, auth error, or whatever error from token fetch will result in the same error message "Error occurred when fetching oauth2 token."

Back to (2) on @ejona86 suggestion, I agree it is hacky for gRPC code to parse into HttpResponseException exception from google-auth-library-java.

My suggestion would be doing something similar to c core: if fetching oauth2 token fails (due to various reasons), throw a generic error message such as "failed to fetch oauth2 token" and do not retry. Let application deal with it.

@ejona86
Copy link
Member

ejona86 commented Apr 21, 2020

@jiangtaoli2016, grpc-java is not retrying. It is delivering an UNAVAILABLE error to the client. But the client (properly) has retry logic that retries UNAVAILABLE errors.

@jiangtaoli2016, how do you provide scopes for the oauth exchange in the C++ API? I thought scopes are required to obtain an oauth token.

@jiangtaoli2016
Copy link
Contributor

@ejona86 I check all c core code. The only place that uses scopes is StsTokenFetcherCredentials.

@ejona86
Copy link
Member

ejona86 commented Jul 21, 2020

@jiangtaoli2016, given that gRPC is not retrying and instead the application is retrying because gRPC returned UNAVAILABLE, are you suggesting gRPC continue returning UNAVAILABLE and just accept that the application will retry? The application will not have a good way of determining the HTTP failure was 400 Bad Request.

@jiangtaoli2016
Copy link
Contributor

jiangtaoli2016 commented Jul 21, 2020

gRPC returning UNAVAILABLE is a fine choice for this particular use case. Downloading service account key is a bad choice, posing security risks to customers and to Google. Customers shall try to use alternative approaches, such as Application Default Credentials (ADC) or service account impersonation.

@dzou
Copy link
Contributor Author

dzou commented Aug 31, 2020

Hello, so to clarify, the position here is that this is working as intended?

I can accept this outcome because I don't feel strongly, but I would only say there are many legitimate use-cases for downloading service account keys still, such as using it in CI for integration tests against GCP services. From my experience based on user reports, the occurrence of this issue is still relatively often (though not very common).

@meltsufin
Copy link

@jiangtaoli2016 @ejona86 My understanding is that the client libraries (GAPIC-based) retry automatically based on whether the code the gRPC returns is "retryable". UNAVAILABLE is clearly an error that should be retried. However, invalid credentials should not be retried.

Regarding the use of service account keys, agreed, it's not the best choice from a security point of view when other options like ADC or impersonation are available. However, there are certain APIs like Cloud Translation that do not work with ADC and explicitly recommend downloading and using service account keys.

@ejona86
Copy link
Member

ejona86 commented Jan 13, 2021

@jiangtaoli2016, what was the verdict here? Is there something to be changed, or is this working as intended and appropriate?

@jiangtaoli2016
Copy link
Contributor

I would treat as working as intended.

@meltsufin Have you try to use Secure Token Service? That is a better alternative than service account key.

@meltsufin
Copy link

@jiangtaoli2016 I haven't used Secure Token Service, and I haven't seen it documented on product pages of any of the libraries. The issue is really not for any particular use case I might have, but for the users of our client libraries. The use of service account keys is recommended all over our documentation. Is this changing?

@ejona86
Copy link
Member

ejona86 commented Jan 14, 2021

@jiangtaoli2016, STS appears to be beta.

@ejona86
Copy link
Member

ejona86 commented Mar 25, 2021

I just marked #8003 as a duplicate, but it did nicely paint out how the components fit together. It also referenced googleapis/gax-java#965

@ejona86 ejona86 added this to the Next milestone Sep 7, 2021
@ejona86
Copy link
Member

ejona86 commented Sep 7, 2021

A design is in-progress in goolge-auth-library-java to give gRPC the information it needs to choose the Status code appropriately (retriable vs non-retriable). The grpc-java changes are expected to be small. I'm very glad there's been recent movement here.

@Gwen2
Copy link

Gwen2 commented Mar 31, 2022

Is there any information about progress?

@ejona86
Copy link
Member

ejona86 commented Mar 31, 2022

googleapis/google-auth-library-java#750 provides a method to distinguish between the two cases. gRPC will need to be updated to consume the new Retryable interface:

  • Handle response as Unavailable when Auth library returns exception that implements the Retryable interface and Retryable.isRetryable() returns true
  • Handle response as Unauthenticated otherwise.

ejona86 added a commit to ejona86/grpc-java that referenced this issue Apr 22, 2022
Retryable was added in google-auth-library 1.5.3 to address situations
like grpc#6808. Bump from that version and swap away from the imprecise
IOException heuristic. go/auth-correct-retry

Fixes grpc#6808
ejona86 added a commit to ejona86/grpc-java that referenced this issue Apr 22, 2022
Retryable was added in google-auth-library 1.5.3 to make clear the
situations that deserve a retry of the RPC. Bump to that version and
swap away from the imprecise IOException heuristic.
go/auth-correct-retry

Fixes grpc#6808
ejona86 added a commit that referenced this issue Apr 26, 2022
Retryable was added in google-auth-library 1.5.3 to make clear the
situations that deserve a retry of the RPC. Bump to that version and
swap away from the imprecise IOException heuristic.
go/auth-correct-retry

Fixes #6808
@ejona86 ejona86 reopened this May 16, 2022
@ejona86
Copy link
Member

ejona86 commented May 16, 2022

#9102 was reverted in #9118

@ejona86 ejona86 self-assigned this Sep 9, 2022
ejona86 added a commit to ejona86/grpc-java that referenced this issue Feb 9, 2024
Retryable was added in google-auth-library 1.5.3 to make clear the
situations that deserve a retry of the RPC. Upgrading to that caused
problems because of transitive dependency issues syncing into Google so
it was reverted in 369f87b. google-auth-library 1.11.0 changed the
approach to avoid the transitive dependency updates. cl/601545581
upgraded to 1.22.0 inside Google. Bump to that version and swap away
from the imprecise IOException heuristic. go/auth-correct-retry

Fixes grpc#6808
ejona86 added a commit that referenced this issue Feb 9, 2024
Retryable was added in google-auth-library 1.5.3 to make clear the
situations that deserve a retry of the RPC. Upgrading to that caused
problems because of transitive dependency issues syncing into Google so
it was reverted in 369f87b. google-auth-library 1.11.0 changed the
approach to avoid the transitive dependency updates. cl/601545581
upgraded to 1.22.0 inside Google. Bump to that version and swap away
from the imprecise IOException heuristic. go/auth-correct-retry

Fixes #6808
@ejona86 ejona86 modified the milestones: Next, 1.63 Feb 9, 2024
@jacob-tolar-bridg
Copy link

jacob-tolar-bridg commented Apr 16, 2024

@ejona86, the commit 372a535 from #10909 on master is tagged 1.63 and this issue is closed in the 1.63 milestone.

But looking at the 1.63 tarball or the released jar, the code changes from that commit aren't there.

I see now that this commit was partially reverted on the 1.63.x branch: #11056

Will this issue be reopened?

@ejona86 ejona86 reopened this Apr 17, 2024
@ejona86 ejona86 modified the milestones: 1.63, 1.64 Apr 17, 2024
@ejona86
Copy link
Member

ejona86 commented Apr 17, 2024

This is still on master, so sorta could be considered fixed, but I think it is still unsettled on whether we can make the change now.

@temawi temawi modified the milestones: 1.64, 1.65 May 2, 2024
@ejona86 ejona86 modified the milestones: 1.65, 1.64 Jun 6, 2024
@ejona86 ejona86 closed this as completed Jun 6, 2024
@ejona86
Copy link
Member

ejona86 commented Jun 6, 2024

It seems the fix went out in 1.64

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet