Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add Batcher#close(timeout) and Batcher#cancelOutstanding #3141

Merged
merged 13 commits into from
Sep 9, 2024

Conversation

igorbernstein2
Copy link
Contributor

@igorbernstein2 igorbernstein2 commented Aug 29, 2024

There have been reports of batcher.close() hanging every once in awhile. Currently it is impossible to debug because we dont expose any internal state to analyze.

This PR adds 2 additional methods that should help in diagnosing issues:

  1. close(timeout) will try to close the batcher, but if any of the underlying batch operations fail, the exception message will contain a wealth of information describing the underlying state of operations as provided by feat: add toString to futures returned by operations #3140
  2. cancelOutstanding this allows for remediation for close(timeout) throwing an exception.

The intended usecase is dataflow connector's FinishBundle:

try {
  batcher.close(Duration.ofMinutes(1));
} catch(TimeoutException e) {
   // log details why the batch failed to close with the help of #3140
   logger.error(e);
   batcher.cancelOutstanding();
  batcher.close(Duration.ofMinutes(1));
}

Example exception message:

Exception in thread "main" com.google.api.gax.batching.BatchingException: Timed out trying to close batcher after PT1S. Batch request prototype: com.google.cloud.bigtable.data.v2.models.BulkMutation@2bac9ba. Outstanding batches: Batch{operation=CallbackChainRetryingFuture{super=null, latestCompletedAttemptResult=ImmediateFailedFuture@6a9d5dff[status=FAILURE, cause=[com.google.cloud.bigtable.data.v2.models.MutateRowsException: Some mutations failed to apply]], attemptResult=null, attemptSettings=TimedAttemptSettings{globalSettings=RetrySettings{totalTimeout=PT10M, initialRetryDelay=PT0.01S, retryDelayMultiplier=2.0, maxRetryDelay=PT1M, maxAttempts=0, jittered=true, initialRpcTimeout=PT1M, rpcTimeoutMultiplier=1.0, maxRpcTimeout=PT1M}, retryDelay=PT1.28S, rpcTimeout=PT1M, randomizedRetryDelay=PT0.877S, attemptCount=8, overallAttemptCount=8, firstAttemptStartTimeNanos=646922035424541}}, elements=com.google.cloud.bigtable.data.v2.models.RowMutationEntry@7a344b65}

Thank you for opening a Pull Request! Before submitting your PR, please read our contributing guidelines.

There are a few things you can do to make sure it goes smoothly:

  • Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)

Fixes #<issue_number_goes_here> ☕️

…le. Currently it is impossible to debug because we dont expose any internal state to analyze.

This PR adds 2 additional methods that should help in diagnosing issues:
1. close(timeout) will try to close the batcher, but if any of the underlying batch operations fail, the exception message will contain a wealth of information describing the underlying state of operations as provided by googleapis#3140
2. cancelOutstanding this allows for remediation for close(timeout) throwing an exception.

The intended usecase is dataflow connector's FinishBundle:

try {
  batcher.close(Duration.ofMinutes(1));
} catch(BatchingException e) {
   batcher.cancelOutstanding();
  batcher.close(Duration.ofMinutes(1));
}
@igorbernstein2 igorbernstein2 added the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label Aug 29, 2024
@product-auto-label product-auto-label bot added the size: m Pull request size is medium. label Aug 29, 2024
Copy link

conventional-commit-lint-gcf bot commented Aug 29, 2024

🤖 I detect that the PR title and the commit message differ and there's only one commit. To use the PR title for the commit history, you can use Github's automerge feature with squashing, or use automerge label. Good luck human!

-- conventional-commit-lint bot
https://conventionalcommits.org/

@igorbernstein2 igorbernstein2 changed the title There have been reports of batcher.close() hanging every once in awhile feat: add Batcher#close(timeout) and Batcher#cancelOutstanding Aug 29, 2024
@igorbernstein2 igorbernstein2 changed the title feat: add Batcher#close(timeout) and Batcher#cancelOutstanding feat: add Batcher#close(timeout) and Batcher#cancelOutstanding WIP Aug 29, 2024
@igorbernstein2 igorbernstein2 marked this pull request as ready for review August 29, 2024 20:36
@igorbernstein2 igorbernstein2 changed the title feat: add Batcher#close(timeout) and Batcher#cancelOutstanding WIP feat: add Batcher#close(timeout) and Batcher#cancelOutstanding Aug 29, 2024
@igorbernstein2
Copy link
Contributor Author

api compatibility is failing with:

Run REPOS_UNDER_TEST="java-datastore" ./.kokoro/presubmit/downstream-compatibility.sh
Setup maven mirror
Installing this repo's modules to local maven.
Error: 7012: com.google.api.gax.batching.Batcher: Method 'public void cancelOutstanding()' has been added to an interface
Error: 7012: com.google.api.gax.batching.Batcher: Method 'public void close(org.threeten.bp.Duration)' has been added to an interface
Error: Failed to execute goal org.codehaus.mojo:clirr-maven-plugin:2.8:check (default) on project gax: There were 2 errors. -> [Help 1]

But it is wrong because the Batcher interface is marked as @InternalExtensionOnly so adding new methods should be allowed. I tried ignoring it using a clirr ignore, but that doesnt seem to work. Whats the correct way to get it to start passing?

@igorbernstein2 igorbernstein2 removed the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label Sep 1, 2024
@igorbernstein2
Copy link
Contributor Author

Ok, everything should be ready

…sh thread and read by the user thread during cancel()
Copy link

sonarcloud bot commented Sep 9, 2024

Quality Gate Failed Quality Gate failed for 'gapic-generator-java-root'

Failed conditions
B Reliability Rating on New Code (required ≥ A)

See analysis details on SonarCloud

Catch issues before they fail your Quality Gate with our IDE extension SonarLint

Copy link

sonarcloud bot commented Sep 9, 2024

Quality Gate Failed Quality Gate failed for 'java_showcase_integration_tests'

Failed conditions
0.0% Coverage on New Code (required ≥ 80%)
B Reliability Rating on New Code (required ≥ A)

See analysis details on SonarCloud

Catch issues before they fail your Quality Gate with our IDE extension SonarLint

@blakeli0 blakeli0 merged commit b5a92e4 into googleapis:main Sep 9, 2024
44 of 47 checks passed
ldetmer added a commit that referenced this pull request Sep 9, 2024
🤖 I have created a release *beep* *boop*
---


<details><summary>2.45.0</summary>

##
[2.45.0](v2.44.0...v2.45.0)
(2024-09-09)


### Features

* add Batcher#close(timeout) and Batcher#cancelOutstanding
([#3141](#3141))
([b5a92e4](b5a92e4))
* add full RetrySettings sample code to Settings classes
([#3056](#3056))
([8fe3a2d](8fe3a2d))
* add toString to futures returned by operations
([#3140](#3140))
([afecb8c](afecb8c))
* bake gapic-generator-java into the hermetic build docker image
([#3067](#3067))
([a372e82](a372e82))


### Bug Fixes

* **gax:** prevent truncation/overflow when converting time values
([#3095](#3095))
([699074e](699074e))


### Dependencies

* add opentelemetry exporter-metrics and shared-resoucemapping to shared
dependencies
([#3078](#3078))
([fc8d80d](fc8d80d))
* update dependency certifi to v2024.8.30
([#3150](#3150))
([c18b705](c18b705))
* update dependency com.google.api-client:google-api-client-bom to
v2.7.0
([#3151](#3151))
([5f43e43](5f43e43))
* update dependency com.google.errorprone:error_prone_annotations to
v2.31.0
([#3153](#3153))
([3071509](3071509))
* update dependency com.google.errorprone:error_prone_annotations to
v2.31.0
([#3154](#3154))
([335ee63](335ee63))
* update dependency com.google.guava:guava to v33.3.0-jre
([#3119](#3119))
([41174b0](41174b0))
* update dependency dev.cel:cel to v0.7.1
([#3155](#3155))
([b1ddd16](b1ddd16))
* update dependency filelock to v3.16.0
([#3175](#3175))
([6681113](6681113))
* update dependency idna to v3.8
([#3156](#3156))
([82f5326](82f5326))
* update dependency io.netty:netty-tcnative-boringssl-static to
v2.0.66.final
([#3148](#3148))
([a7efaa8](a7efaa8))
* update dependency net.bytebuddy:byte-buddy to v1.15.1
([#3115](#3115))
([0e06c5f](0e06c5f))
* update dependency org.apache.commons:commons-lang3 to v3.17.0
([#3157](#3157))
([8d3b9fd](8d3b9fd))
* update dependency org.checkerframework:checker-qual to v3.47.0
([#3166](#3166))
([365674d](365674d))
* update dependency org.yaml:snakeyaml to v2.3
([#3158](#3158))
([e67ea9a](e67ea9a))
* update dependency platformdirs to v4.3.2
([#3176](#3176))
([4f2f9e0](4f2f9e0))
* update dependency virtualenv to v20.26.4
([#3177](#3177))
([080e078](080e078))
* update google api dependencies
([#3118](#3118))
([67342ea](67342ea))
* update google auth library dependencies to v1.25.0
([#3168](#3168))
([715884a](715884a))
* update google http client dependencies to v1.45.0
([#3159](#3159))
([a3fe612](a3fe612))
* update googleapis/java-cloud-bom digest to 6626f91
([#3147](#3147))
([658e40e](658e40e))
* update junit5 monorepo to v5.11.0
([#3111](#3111))
([6bf84c8](6bf84c8))
* update netty dependencies to v4.1.113.final
([#3165](#3165))
([9b5957d](9b5957d))
* update opentelemetry-java monorepo to v1.42.0
([#3172](#3172))
([413c44e](413c44e))


### Documentation

* Update DEVELOPMENT.md
([#3126](#3126))
([92bdf4e](92bdf4e))
</details>

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

Co-authored-by: release-please[bot] <55107282+release-please[bot]@users.noreply.github.com>
Co-authored-by: ldetmer <1771267+ldetmer@users.noreply.github.com>
ldetmer pushed a commit that referenced this pull request Sep 17, 2024
There have been reports of batcher.close() hanging every once in awhile.
Currently it is impossible to debug because we dont expose any internal
state to analyze.

This PR adds 2 additional methods that should help in diagnosing issues:
1. close(timeout) will try to close the batcher, but if any of the
underlying batch operations fail, the exception message will contain a
wealth of information describing the underlying state of operations as
provided by #3140
2. cancelOutstanding this allows for remediation for close(timeout)
throwing an exception.

The intended usecase is dataflow connector's FinishBundle:

```java
try {
  batcher.close(Duration.ofMinutes(1));
} catch(TimeoutException e) {
   // log details why the batch failed to close with the help of #3140
   logger.error(e);
   batcher.cancelOutstanding();
  batcher.close(Duration.ofMinutes(1));
}
```

Example exception message:

> Exception in thread "main"
com.google.api.gax.batching.BatchingException: Timed out trying to close
batcher after PT1S. Batch request prototype:
com.google.cloud.bigtable.data.v2.models.BulkMutation@2bac9ba.
Outstanding batches:
Batch{operation=CallbackChainRetryingFuture{super=null,
latestCompletedAttemptResult=ImmediateFailedFuture@6a9d5dff[status=FAILURE,
cause=[com.google.cloud.bigtable.data.v2.models.MutateRowsException:
Some mutations failed to apply]], attemptResult=null,
attemptSettings=TimedAttemptSettings{globalSettings=RetrySettings{totalTimeout=PT10M,
initialRetryDelay=PT0.01S, retryDelayMultiplier=2.0, maxRetryDelay=PT1M,
maxAttempts=0, jittered=true, initialRpcTimeout=PT1M,
rpcTimeoutMultiplier=1.0, maxRpcTimeout=PT1M}, retryDelay=PT1.28S,
rpcTimeout=PT1M, randomizedRetryDelay=PT0.877S, attemptCount=8,
overallAttemptCount=8, firstAttemptStartTimeNanos=646922035424541}},
elements=com.google.cloud.bigtable.data.v2.models.RowMutationEntry@7a344b65}

Co-authored-by: Blake Li <blakeli@google.com>
ldetmer added a commit that referenced this pull request Sep 17, 2024
🤖 I have created a release *beep* *boop*
---


<details><summary>2.45.0</summary>

##
[2.45.0](v2.44.0...v2.45.0)
(2024-09-09)


### Features

* add Batcher#close(timeout) and Batcher#cancelOutstanding
([#3141](#3141))
([b5a92e4](b5a92e4))
* add full RetrySettings sample code to Settings classes
([#3056](#3056))
([8fe3a2d](8fe3a2d))
* add toString to futures returned by operations
([#3140](#3140))
([afecb8c](afecb8c))
* bake gapic-generator-java into the hermetic build docker image
([#3067](#3067))
([a372e82](a372e82))


### Bug Fixes

* **gax:** prevent truncation/overflow when converting time values
([#3095](#3095))
([699074e](699074e))


### Dependencies

* add opentelemetry exporter-metrics and shared-resoucemapping to shared
dependencies
([#3078](#3078))
([fc8d80d](fc8d80d))
* update dependency certifi to v2024.8.30
([#3150](#3150))
([c18b705](c18b705))
* update dependency com.google.api-client:google-api-client-bom to
v2.7.0
([#3151](#3151))
([5f43e43](5f43e43))
* update dependency com.google.errorprone:error_prone_annotations to
v2.31.0
([#3153](#3153))
([3071509](3071509))
* update dependency com.google.errorprone:error_prone_annotations to
v2.31.0
([#3154](#3154))
([335ee63](335ee63))
* update dependency com.google.guava:guava to v33.3.0-jre
([#3119](#3119))
([41174b0](41174b0))
* update dependency dev.cel:cel to v0.7.1
([#3155](#3155))
([b1ddd16](b1ddd16))
* update dependency filelock to v3.16.0
([#3175](#3175))
([6681113](6681113))
* update dependency idna to v3.8
([#3156](#3156))
([82f5326](82f5326))
* update dependency io.netty:netty-tcnative-boringssl-static to
v2.0.66.final
([#3148](#3148))
([a7efaa8](a7efaa8))
* update dependency net.bytebuddy:byte-buddy to v1.15.1
([#3115](#3115))
([0e06c5f](0e06c5f))
* update dependency org.apache.commons:commons-lang3 to v3.17.0
([#3157](#3157))
([8d3b9fd](8d3b9fd))
* update dependency org.checkerframework:checker-qual to v3.47.0
([#3166](#3166))
([365674d](365674d))
* update dependency org.yaml:snakeyaml to v2.3
([#3158](#3158))
([e67ea9a](e67ea9a))
* update dependency platformdirs to v4.3.2
([#3176](#3176))
([4f2f9e0](4f2f9e0))
* update dependency virtualenv to v20.26.4
([#3177](#3177))
([080e078](080e078))
* update google api dependencies
([#3118](#3118))
([67342ea](67342ea))
* update google auth library dependencies to v1.25.0
([#3168](#3168))
([715884a](715884a))
* update google http client dependencies to v1.45.0
([#3159](#3159))
([a3fe612](a3fe612))
* update googleapis/java-cloud-bom digest to 6626f91
([#3147](#3147))
([658e40e](658e40e))
* update junit5 monorepo to v5.11.0
([#3111](#3111))
([6bf84c8](6bf84c8))
* update netty dependencies to v4.1.113.final
([#3165](#3165))
([9b5957d](9b5957d))
* update opentelemetry-java monorepo to v1.42.0
([#3172](#3172))
([413c44e](413c44e))


### Documentation

* Update DEVELOPMENT.md
([#3126](#3126))
([92bdf4e](92bdf4e))
</details>

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

Co-authored-by: release-please[bot] <55107282+release-please[bot]@users.noreply.github.com>
Co-authored-by: ldetmer <1771267+ldetmer@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size: m Pull request size is medium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants