Skip to content
This repository has been archived by the owner on Sep 26, 2023. It is now read-only.

fix: truncate RPC timeouts to time remaining in totalTimeout #1191

Merged
merged 5 commits into from
Sep 28, 2020

Conversation

noahdietz
Copy link
Contributor

The RetrySettings documentation indicated that an RPC timeout would not be allowed to enable an RPC to exceed the totalTimeout while in flight. Meaning, if a calculated RPC timeout is greater than the time remaining in the totalTimeout (from the start of the first attempt), the RPC timeout should be equal to that time remaining. This prevents an RPC from being sent with a timeout that would allow it to execute beyond the time allotted by the totalTimeout.

This was not being done in gax-java, even though the documentation stated it and several other language implementations of GAX are implemented to match this same documentation.

This PR corrects that.

@noahdietz noahdietz requested a review from a team as a code owner September 23, 2020 22:22
@google-cla google-cla bot added the cla: yes This human has signed the Contributor License Agreement. label Sep 23, 2020
@noahdietz noahdietz requested review from vam-google, igorbernstein2, chingor13 and miraleung and removed request for a team September 23, 2020 22:23
@codecov
Copy link

codecov bot commented Sep 23, 2020

Codecov Report

Merging #1191 into master will increase coverage by 20.14%.
The diff coverage is 100.00%.

Impacted file tree graph

@@              Coverage Diff              @@
##             master    #1191       +/-   ##
=============================================
+ Coverage     58.91%   79.06%   +20.14%     
- Complexity      115     1197     +1082     
=============================================
  Files            20      205      +185     
  Lines           589     5268     +4679     
  Branches         60      435      +375     
=============================================
+ Hits            347     4165     +3818     
- Misses          213      930      +717     
- Partials         29      173      +144     
Impacted Files Coverage Δ Complexity Δ
...le/api/gax/retrying/ExponentialRetryAlgorithm.java 96.36% <100.00%> (ø) 14.00 <0.00> (?)
...ava/com/google/api/gax/rpc/AsyncTaskException.java 100.00% <0.00%> (ø) 1.00% <0.00%> (?%)
...m/google/api/gax/retrying/NoopRetryingContext.java 0.00% <0.00%> (ø) 0.00% <0.00%> (?%)
...java/com/google/api/gax/rpc/PagedCallSettings.java 90.00% <0.00%> (ø) 4.00% <0.00%> (?%)
...ax/longrunning/OperationResponsePollAlgorithm.java 66.66% <0.00%> (ø) 5.00% <0.00%> (?%)
.../java/com/google/api/gax/rpc/RetryingCallable.java 91.66% <0.00%> (ø) 2.00% <0.00%> (?%)
.../java/com/google/api/gax/grpc/GrpcCallContext.java 85.60% <0.00%> (ø) 48.00% <0.00%> (?%)
...m/google/api/gax/tracing/NoopApiTracerFactory.java 66.66% <0.00%> (ø) 2.00% <0.00%> (?%)
...oogle/api/gax/retrying/DirectRetryingExecutor.java 78.94% <0.00%> (ø) 6.00% <0.00%> (?%)
...e/api/gax/rpc/RetryingServerStreamingCallable.java 0.00% <0.00%> (ø) 0.00% <0.00%> (?%)
... and 176 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 287cada...dcbf828. Read the comment docs.

Copy link
Contributor

@vam-google vam-google left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good, but I'm concerned by the possibility of breaking LRO portion.

Duration.ofNanos(clock.nanoTime())
.minus(Duration.ofNanos(prevSettings.getFirstAttemptStartTimeNanos()));
Duration timeLeft = globalSettings.getTotalTimeout().minus(timeElapsed).minus(randomDelay);
newRpcTimeout = Math.min(newRpcTimeout, timeLeft.toMillis());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would happen if newRpcTimeout <= 0?

Copy link
Contributor Author

@noahdietz noahdietz Sep 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If newRpcTimeout <= 0, the subsequent check by shouldRetry would prevent another attempt from being made, because they do a similar check to see if the current time + the proposed random delay would exceed the totalTimeout:

long totalTimeSpentNanos =
        clock.nanoTime()
            - nextAttemptSettings.getFirstAttemptStartTimeNanos()
            + nextAttemptSettings.getRandomizedRetryDelay().toNanos();

// If totalTimeout limit is defined, check that it hasn't been crossed
if (totalTimeout > 0 && totalTimeSpentNanos > totalTimeout) {
   return false;
}

We could add some logic to handle if (timeLeft.isNegative() || timeLeft.isZero()), but I'm not sure what we'd set it to...maybe allow newRpcTimeout to remain unchanged by timeLeft?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After our offline conversation, I've decided to just document the likely behavior. LMK if you feel strongly otherwise.

.setInitialRpcTimeout(Duration.ofMillis(100))
.setMaxRpcTimeout(Duration.ofSeconds(1))
.setRpcTimeoutMultiplier(2)
.setTotalTimeout(Duration.ofSeconds(5L))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the meaning of totalTimeout in case of LRO? Note that exponential retry algorithm is shared between retries and LRO polling logic. Whatever we do for retries, we must ensure that it does not break LRO. Ideally we want that new logic be completelly disabled fro LRO. I.e. the need to modify LRO tests indicates that there is a breaking behavioral change for LRO, which we should avoid.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had the same thought and I was confused by what this test was doing.

The test setup actually sets by default a totalTimeout to be 5 ms, but then in this test increases the initialRpcTimeout and maxRpcTimeout to values much greater than the existing totalTimeout of 5ms. Per comments in a generated client, the initialRpcTimeout and maxRpcTimeout should be ignored and set to 0 for LROs? Not really sure what the test is doing here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW we had no GAPIC config for LRO polling initialRpcTimeout or maxRpcTimeout, so they would never be generated and only ever set by users...and they'd have the same behavior as non-LRO RPCs where a poll could run over the totalTimeout

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this test was testing for the existing incorrect behavior, where the RPC timeout didn't care if it exceeded the totalTimeout

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The totalTimeout in the context of LRO is the "total polling timeout" (gapic config). So, the duration a "synchronous" LRO call should poll before giving up.

This test was set up weirdly, as I described. PTAL.

@stephaniewang526 stephaniewang526 added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Sep 24, 2020
@miraleung
Copy link
Contributor

Would like others to approve this as well.

@noahdietz noahdietz added kokoro:force-run Add this label to force Kokoro to re-run the tests. and removed kokoro:force-run Add this label to force Kokoro to re-run the tests. labels Sep 28, 2020
Copy link
Contributor

@vam-google vam-google left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@noahdietz
Copy link
Contributor Author

noahdietz commented Sep 28, 2020

For the record: this change will cause this Java Showcase test in gapic-generator I added recently to fail. The test verifies some of the emergent behavior we are fixing with this PR, so I will fix that test with the dependency update this rolls out in.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
cla: yes This human has signed the Contributor License Agreement. kokoro:force-run Add this label to force Kokoro to re-run the tests.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants