-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
release-22.1: util/tracing: fix crash in StreamClientInterceptor #80911
release-22.1: util/tracing: fix crash in StreamClientInterceptor #80911
Conversation
Before this patch, our client-side tracing interceptor for streaming rpc calls was exposed to gRPC bugs being currently fixed in github.com/grpc/grpc-go/pull/5323. This had to do with calls to clientStream.Context() panicking with an NPE under certain races with RPCs failing. We've recently gotten two crashes seemingly because of this. It's unclear why this hasn't shown up before, as nothing seems new (either on our side or on the grpc side). In 22.2 we do use more streaming RPCs than before (for example for span configs), so maybe that does it. This patch eliminates the problem by eliminating the problematic call into ClientStream.Context(). The background is that our interceptors needs to watch for ctx cancelation and consider the RPC done at that point. But, there was no reason for that call; we can more simply use the RPC caller's ctx for the purposes of figuring out if the caller cancels the RPC. In fact, calling ClientStream.Context() is bad for other reasons, beyond exposing us to the bug: 1) ClientStream.Context() pins the RPC attempt to a lower-level connection, and inhibits gRPC's ability to sometimes transparently retry failed calls. In fact, there's a comment on ClientStream.Context() that tells you not to call it before using the stream through other methods like Recv(), which imply that the RPC is already "pinned" and transparent retries are no longer possible anyway. We were breaking this. 2) One of the grpc-go maintainers suggested that, due to the bugs reference above, this call could actually sometimes give us "the wrong context", although how wrong exactly is not clear to me (i.e. could we have gotten a ctx that doesn't inherit from the caller's ctx? Or a ctx that's canceled independently from the caller?) Release note: A rare crash indicating a nil-pointer deference in google.golang.org/grpc/internal/transport.(*Stream).Context(...) was fixed.
Thanks for opening a backport. Please check the backport criteria before merging:
If some of the basic criteria cannot be satisfied, ensure that the exceptional criteria are satisfied within.
Add a brief release justification to the body of your PR to justify this backport. Some other things to consider:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @dhartunian)
pkg/util/tracing/grpc_interceptor_test.go
line 140 at r1 (raw file):
c := grpcutils.NewGRPCTestClient(conn) unusedAny, err := types.MarshalAny(&types.Empty{}) require.NoError(t, err)
I noticed that the "make tests independent" part of the diff is not here. Is that related to the finalizer since it's checking for leaked spans?
pkg/util/tracing/grpc_interceptor_test.go
line 230 at r1 (raw file):
// the span. Nothing else closes that span in this test. See // newTracingClientStream(). runtime.GC()
If the finalizer is still in place, does the GC need to be triggered?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian)
pkg/util/tracing/grpc_interceptor_test.go
line 140 at r1 (raw file):
Previously, dhartunian (David Hartunian) wrote…
I noticed that the "make tests independent" part of the diff is not here. Is that related to the finalizer since it's checking for leaked spans?
I simply haven't backported that commit, to keep the backport more focused. That test change is independent of the rest. It makes the test fail better in cases where we have span leaks.
pkg/util/tracing/grpc_interceptor_test.go
line 230 at r1 (raw file):
Previously, dhartunian (David Hartunian) wrote…
If the finalizer is still in place, does the GC need to be triggered?
no; the finalizer stays in place in case we have leaks of the gRPC streams and their spans (which I don't think we have). But this test no longer needs the finalizer because the test no longer leaks streams. Before this patch, the test was violating the gRPC contract in the UnaryStream
and StreamStream
case because it was not fully consuming the stream - we were just Recv()
ing once and then abandoning the stream, but you're supposed to read until you get an error, or cancel the call's ctx. This patch brings the test to the RPC spec.
This is a backport of parts of #80878. Compared to that patch, this one leaves the finalizer in place, to not rock the boat before the 22.1 release.
Before this patch, our client-side tracing interceptor for streaming rpc
calls was exposed to gRPC bugs being currently fixed in
github.com/grpc/grpc-go/pull/5323. This had to do with calls to
clientStream.Context() panicking with an NPE under certain races with
RPCs failing. We've recently gotten two crashes seemingly because of
this. It's unclear why this hasn't shown up before, as nothing seems new
(either on our side or on the grpc side). In 22.2 we do use more
streaming RPCs than before (for example for span configs), so maybe that
does it.
This patch eliminates the problem by eliminating the
problematic call into ClientStream.Context(). The background is that
our interceptors needs to watch for ctx cancelation and consider the RPC
done at that point. But, there was no reason for that call; we can more
simply use the RPC caller's ctx for the purposes of figuring out if the
caller cancels the RPC. In fact, calling ClientStream.Context() is bad
for other reasons, beyond exposing us to the bug:
connection, and inhibits gRPC's ability to sometimes transparently
retry failed calls. In fact, there's a comment on ClientStream.Context()
that tells you not to call it before using the stream through other
methods like Recv(), which imply that the RPC is already "pinned" and
transparent retries are no longer possible anyway. We were breaking
this.
reference above, this call could actually sometimes give us "the
wrong context", although how wrong exactly is not clear to me (i.e.
could we have gotten a ctx that doesn't inherit from the caller's ctx?
Or a ctx that's canceled independently from the caller?)
Release note: A rare crash indicating a nil-pointer deference in
google.golang.org/grpc/internal/transport.(*Stream).Context(...)
was fixed.
Release justification: release blocker