fix: truncate RPC timeouts to time remaining in totalTimeout #1191

noahdietz · 2020-09-23T22:22:40Z

The RetrySettings documentation indicated that an RPC timeout would not be allowed to enable an RPC to exceed the totalTimeout while in flight. Meaning, if a calculated RPC timeout is greater than the time remaining in the totalTimeout (from the start of the first attempt), the RPC timeout should be equal to that time remaining. This prevents an RPC from being sent with a timeout that would allow it to execute beyond the time allotted by the totalTimeout.

This was not being done in gax-java, even though the documentation stated it and several other language implementations of GAX are implemented to match this same documentation.

This PR corrects that.

codecov · 2020-09-23T22:30:19Z

Codecov Report

Merging #1191 into master will increase coverage by 20.14%.
The diff coverage is 100.00%.

@@              Coverage Diff              @@
##             master    #1191       +/-   ##
=============================================
+ Coverage     58.91%   79.06%   +20.14%     
- Complexity      115     1197     +1082     
=============================================
  Files            20      205      +185     
  Lines           589     5268     +4679     
  Branches         60      435      +375     
=============================================
+ Hits            347     4165     +3818     
- Misses          213      930      +717     
- Partials         29      173      +144

Impacted Files	Coverage Δ	Complexity Δ
...le/api/gax/retrying/ExponentialRetryAlgorithm.java	`96.36% <100.00%> (ø)`	`14.00 <0.00> (?)`
...ava/com/google/api/gax/rpc/AsyncTaskException.java	`100.00% <0.00%> (ø)`	`1.00% <0.00%> (?%)`
...m/google/api/gax/retrying/NoopRetryingContext.java	`0.00% <0.00%> (ø)`	`0.00% <0.00%> (?%)`
...java/com/google/api/gax/rpc/PagedCallSettings.java	`90.00% <0.00%> (ø)`	`4.00% <0.00%> (?%)`
...ax/longrunning/OperationResponsePollAlgorithm.java	`66.66% <0.00%> (ø)`	`5.00% <0.00%> (?%)`
.../java/com/google/api/gax/rpc/RetryingCallable.java	`91.66% <0.00%> (ø)`	`2.00% <0.00%> (?%)`
.../java/com/google/api/gax/grpc/GrpcCallContext.java	`85.60% <0.00%> (ø)`	`48.00% <0.00%> (?%)`
...m/google/api/gax/tracing/NoopApiTracerFactory.java	`66.66% <0.00%> (ø)`	`2.00% <0.00%> (?%)`
...oogle/api/gax/retrying/DirectRetryingExecutor.java	`78.94% <0.00%> (ø)`	`6.00% <0.00%> (?%)`
...e/api/gax/rpc/RetryingServerStreamingCallable.java	`0.00% <0.00%> (ø)`	`0.00% <0.00%> (?%)`
... and 176 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 287cada...dcbf828. Read the comment docs.

vam-google

It looks good, but I'm concerned by the possibility of breaking LRO portion.

gax/src/main/java/com/google/api/gax/retrying/ExponentialRetryAlgorithm.java

vam-google · 2020-09-23T23:23:44Z

gax/src/main/java/com/google/api/gax/retrying/ExponentialRetryAlgorithm.java

+          Duration.ofNanos(clock.nanoTime())
+              .minus(Duration.ofNanos(prevSettings.getFirstAttemptStartTimeNanos()));
+      Duration timeLeft = globalSettings.getTotalTimeout().minus(timeElapsed).minus(randomDelay);
+      newRpcTimeout = Math.min(newRpcTimeout, timeLeft.toMillis());


What would happen if newRpcTimeout <= 0?

If newRpcTimeout <= 0, the subsequent check by shouldRetry would prevent another attempt from being made, because they do a similar check to see if the current time + the proposed random delay would exceed the totalTimeout:

long totalTimeSpentNanos = clock.nanoTime() - nextAttemptSettings.getFirstAttemptStartTimeNanos() + nextAttemptSettings.getRandomizedRetryDelay().toNanos(); // If totalTimeout limit is defined, check that it hasn't been crossed if (totalTimeout > 0 && totalTimeSpentNanos > totalTimeout) { return false; }

We could add some logic to handle if (timeLeft.isNegative() || timeLeft.isZero()), but I'm not sure what we'd set it to...maybe allow newRpcTimeout to remain unchanged by timeLeft?

After our offline conversation, I've decided to just document the likely behavior. LMK if you feel strongly otherwise.

vam-google · 2020-09-23T23:29:33Z

gax/src/test/java/com/google/api/gax/rpc/OperationCallableImplTest.java

                .setInitialRpcTimeout(Duration.ofMillis(100))
                .setMaxRpcTimeout(Duration.ofSeconds(1))
                .setRpcTimeoutMultiplier(2)
+                .setTotalTimeout(Duration.ofSeconds(5L))


What is the meaning of totalTimeout in case of LRO? Note that exponential retry algorithm is shared between retries and LRO polling logic. Whatever we do for retries, we must ensure that it does not break LRO. Ideally we want that new logic be completelly disabled fro LRO. I.e. the need to modify LRO tests indicates that there is a breaking behavioral change for LRO, which we should avoid.

I had the same thought and I was confused by what this test was doing.

The test setup actually sets by default a totalTimeout to be 5 ms, but then in this test increases the initialRpcTimeout and maxRpcTimeout to values much greater than the existing totalTimeout of 5ms. Per comments in a generated client, the initialRpcTimeout and maxRpcTimeout should be ignored and set to 0 for LROs? Not really sure what the test is doing here.

FWIW we had no GAPIC config for LRO polling initialRpcTimeout or maxRpcTimeout, so they would never be generated and only ever set by users...and they'd have the same behavior as non-LRO RPCs where a poll could run over the totalTimeout

I guess this test was testing for the existing incorrect behavior, where the RPC timeout didn't care if it exceeded the totalTimeout

The totalTimeout in the context of LRO is the "total polling timeout" (gapic config). So, the duration a "synchronous" LRO call should poll before giving up.

This test was set up weirdly, as I described. PTAL.

miraleung · 2020-09-25T00:05:09Z

Would like others to approve this as well.

vam-google

LGTM

gax/src/test/java/com/google/api/gax/retrying/ExponentialRetryAlgorithmTest.java

noahdietz · 2020-09-28T21:18:00Z

For the record: this change will cause this Java Showcase test in gapic-generator I added recently to fail. The test verifies some of the emergent behavior we are fixing with this PR, so I will fix that test with the dependency update this rolls out in.

fix: truncate RPC timeouts to time remaining

42a113f

noahdietz requested a review from a team as a code owner September 23, 2020 22:22

google-cla bot added the cla: yes This human has signed the Contributor License Agreement. label Sep 23, 2020

noahdietz requested review from vam-google, igorbernstein2, chingor13 and miraleung and removed request for a team September 23, 2020 22:23

vam-google reviewed Sep 23, 2020

View reviewed changes

improve rpc timeout calc comments

9aea1ba

stephaniewang526 added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Sep 24, 2020

miraleung approved these changes Sep 25, 2020

View reviewed changes

noahdietz requested review from miraleung and vam-google September 25, 2020 15:56

igorbernstein2 approved these changes Sep 25, 2020

View reviewed changes

miraleung approved these changes Sep 25, 2020

View reviewed changes

noahdietz added 2 commits September 25, 2020 13:12

comment time left <= 0 rpcTImeout behavior

10f9f84

comment shouldRetry polling relationship

cb8a7d5

noahdietz added kokoro:force-run Add this label to force Kokoro to re-run the tests. and removed kokoro:force-run Add this label to force Kokoro to re-run the tests. labels Sep 28, 2020

vam-google approved these changes Sep 28, 2020

View reviewed changes

gax/src/test/java/com/google/api/gax/retrying/ExponentialRetryAlgorithmTest.java Outdated Show resolved Hide resolved

add more commentary to polling test for clarity

dcbf828

chingor13 approved these changes Sep 28, 2020

View reviewed changes

noahdietz merged commit 1d0c940 into googleapis:master Sep 28, 2020

noahdietz deleted the truncate-timeouts branch September 28, 2020 21:20

noahdietz mentioned this pull request Sep 28, 2020

chore: fix showcase timeout-backoff test googleapis/gapic-generator#3283

Merged

igorbernstein2 mentioned this pull request Jun 18, 2021

feat: make stream wait timeout a first class citizen [WIP] #1409

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: truncate RPC timeouts to time remaining in totalTimeout #1191

fix: truncate RPC timeouts to time remaining in totalTimeout #1191

noahdietz commented Sep 23, 2020

codecov bot commented Sep 23, 2020 •

edited

Loading

vam-google left a comment

vam-google Sep 23, 2020

noahdietz Sep 23, 2020 •

edited

Loading

noahdietz Sep 25, 2020

vam-google Sep 23, 2020

noahdietz Sep 23, 2020

noahdietz Sep 23, 2020

noahdietz Sep 23, 2020

noahdietz Sep 28, 2020

miraleung commented Sep 25, 2020

vam-google left a comment

noahdietz commented Sep 28, 2020 •

edited

Loading

fix: truncate RPC timeouts to time remaining in totalTimeout #1191

fix: truncate RPC timeouts to time remaining in totalTimeout #1191

Conversation

noahdietz commented Sep 23, 2020

codecov bot commented Sep 23, 2020 • edited Loading

Codecov Report

vam-google left a comment

Choose a reason for hiding this comment

vam-google Sep 23, 2020

Choose a reason for hiding this comment

noahdietz Sep 23, 2020 • edited Loading

Choose a reason for hiding this comment

noahdietz Sep 25, 2020

Choose a reason for hiding this comment

vam-google Sep 23, 2020

Choose a reason for hiding this comment

noahdietz Sep 23, 2020

Choose a reason for hiding this comment

noahdietz Sep 23, 2020

Choose a reason for hiding this comment

noahdietz Sep 23, 2020

Choose a reason for hiding this comment

noahdietz Sep 28, 2020

Choose a reason for hiding this comment

miraleung commented Sep 25, 2020

vam-google left a comment

Choose a reason for hiding this comment

noahdietz commented Sep 28, 2020 • edited Loading

codecov bot commented Sep 23, 2020 •

edited

Loading

noahdietz Sep 23, 2020 •

edited

Loading

noahdietz commented Sep 28, 2020 •

edited

Loading