Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not reuse gRPC connections that fail with native Netty errors #23150

Closed
wants to merge 1 commit into from

Conversation

fmeum
Copy link
Collaborator

@fmeum fmeum commented Jul 30, 2024

Such connections are usually not in a recoverable state and should not be used for retries, which would otherwise likely fail in the same way.

Fixes #20868

Such connections are usually not in a recoverable state and should not be used for retries, which would otherwise likely fail in the same way.
@fmeum fmeum requested a review from a team as a code owner July 30, 2024 10:44
@github-actions github-actions bot added team-Remote-Exec Issues and PRs for the Execution (Remote) team awaiting-review PR is awaiting review from an assigned reviewer labels Jul 30, 2024
var call = mock(ClientCall.class);
doAnswer(
invocationOnMock -> {
((ClientCall.Listener) invocationOnMock.getArgument(0))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more of a whitebox test than I would have liked, but I didn't find another way to simulate a connection failure. Open to suggestions here to make this more realistic.

private static boolean isFatalError(@Nullable Throwable t) {
// A low-level netty error indicates that the connection is fundamentally broken
// and should not be reused for retries.
return t instanceof Errors.NativeIoException;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could extend this to cover certain status codes, but I wanted to start with something limited enough to not introduce performance regressions.

@werkt
Copy link
Contributor

werkt commented Jul 30, 2024

This may also fix #10159 and some other related ssl issues.

@fmeum
Copy link
Collaborator Author

fmeum commented Jul 30, 2024

This may also fix #10159 and some other related ssl issues.

Only if we extend the predicate for resetting the connection though. I'm currently only doing that for client-side low-level netty failures. @werkt Do you have other classes of exceptions in mind?

@werkt
Copy link
Contributor

werkt commented Jul 30, 2024

#18764 was another example. You're probably right, the detection here wouldn't see through the UNAVAILABLE to the handshake timeout - what I found was that the SSLException was sticky, and I couldn't manipulate the channel persistence management levels enough to get it reliably fixed or reproducing before I started diagnosing the underlying network issues.

Copy link
Member

@coeuvre coeuvre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! LGTM.

@coeuvre coeuvre added awaiting-PR-merge PR has been approved by a reviewer and is ready to be merge internally and removed awaiting-review PR is awaiting review from an assigned reviewer labels Aug 12, 2024
@fmeum
Copy link
Collaborator Author

fmeum commented Aug 12, 2024

@bazel-io fork 7.4.0

@coeuvre
Copy link
Member

coeuvre commented Aug 12, 2024

FYI: Errors.NativeIoException is final and cannot be mocked. I also cannot instantiate a new Errors.NativeIoException because it causes linkage error on windows (and it not worth it to play with the complexity of linkage only for test). So I added a Predicate to the factory and let test override it.

@github-actions github-actions bot removed the awaiting-PR-merge PR has been approved by a reviewer and is ready to be merge internally label Aug 12, 2024
@fmeum fmeum deleted the 20868-fatal-error-no-reuse branch August 12, 2024 20:21
@keith
Copy link
Member

keith commented Aug 19, 2024

can we cherry pick this into a release that's coming sooner than October? maybe a 7.3.2?

bazel-io pushed a commit to bazel-io/bazel that referenced this pull request Aug 19, 2024
Such connections are usually not in a recoverable state and should not be used for retries, which would otherwise likely fail in the same way.

Fixes bazelbuild#20868

Closes bazelbuild#23150.

PiperOrigin-RevId: 662091153
Change-Id: Iaf160b11a13af013b9969c7fdaa966bca8ab6be2
@fmeum
Copy link
Collaborator Author

fmeum commented Aug 19, 2024

@Wyverald

@keith
Copy link
Member

keith commented Aug 19, 2024

idk when the 7.4.0 RCs will be, that would be fine with me too, but currently we can't use HEAD commits because of bazelbuild/continuous-integration#1402

github-merge-queue bot pushed a commit that referenced this pull request Aug 21, 2024
…ors (#23343)

Such connections are usually not in a recoverable state and should not
be used for retries, which would otherwise likely fail in the same way.

Fixes #20868

Closes #23150.

PiperOrigin-RevId: 662091153
Change-Id: Iaf160b11a13af013b9969c7fdaa966bca8ab6be2

Commit
06691b3

Co-authored-by: Fabian Meumertzheim <fabian@meumertzhe.im>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team-Remote-Exec Issues and PRs for the Execution (Remote) team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bazel throw io exception on outputs download connection reset
4 participants