You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Using GCSFuse under heavy load with gRPC, it's possible that the OAuth token fetched to start an RPC has expired by the time the RPC arrives at the server, under the default early-expiry time of 10 seconds. 10 seconds is a pretty long time! But apparently not always long enough when lots of CPU work is in progress or lots of RPCs are in flight.
We've only reproduced on GCE using the default service account's OAuth token, with gRPC. I think it's fine to scope a fix to that path.
I think the practical thing to do is extend the hardcoded timeout from the 10s default to 1 minute (or make it a flag). I'm inclined to just extend it without a flag - these specific OAuth tokens appear to be valid for 1 hour so the difference between an update every ~60 minutes and every ~59 minutes isn't a big deal.
Any timeout can be too short under extreme load conditions. It'd be nice to retry with a refreshed token if we see an OAuth timeout error. Unfortunately the error for an expired OAuth token is the same as the error for an invalid OAuth token, so we'd have to retry ~every OAuth error at least once. That seems a little chattier than necessary but might be worth exploring.
Ah, it looks like in #909 we decided that retrying 401 is ok for GCSFuse over HTTP. We can retry UNAUTHENTICATED too, since that is ~the equivalent for gRPC.
Using GCSFuse under heavy load with gRPC, it's possible that the OAuth token fetched to start an RPC has expired by the time the RPC arrives at the server, under the default early-expiry time of 10 seconds. 10 seconds is a pretty long time! But apparently not always long enough when lots of CPU work is in progress or lots of RPCs are in flight.
We've only reproduced on GCE using the default service account's OAuth token, with gRPC. I think it's fine to scope a fix to that path.
I think the practical thing to do is extend the hardcoded timeout from the 10s default to 1 minute (or make it a flag). I'm inclined to just extend it without a flag - these specific OAuth tokens appear to be valid for 1 hour so the difference between an update every ~60 minutes and every ~59 minutes isn't a big deal.
Any timeout can be too short under extreme load conditions. It'd be nice to retry with a refreshed token if we see an OAuth timeout error. Unfortunately the error for an expired OAuth token is the same as the error for an invalid OAuth token, so we'd have to retry ~every OAuth error at least once. That seems a little chattier than necessary but might be worth exploring.
@gargnitingoogle FYI.
The text was updated successfully, but these errors were encountered: