-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix "deadline exceeded" issues discovered in conformance tests #643
Conversation
…r 'deadline exceeded' code instead of 'unknown' or 'unavailable'
@@ -111,8 +115,12 @@ func (c *compressionPool) Compress(dst *bytes.Buffer, src *bytes.Buffer) *Error | |||
if err != nil { | |||
return errorf(CodeUnknown, "get compressor: %w", err) | |||
} | |||
if _, err := io.Copy(compressor, src); err != nil { | |||
if _, err := src.WriteTo(compressor); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was just for consistency/symmetry with the decompress code above, which directly calls dst.ReadFrom
instead of using io.Copy
.
@@ -252,7 +261,7 @@ func (r *envelopeReader) Read(env *envelope) *Error { | |||
if r.readMaxBytes > 0 && size > int64(r.readMaxBytes) { | |||
_, err := io.CopyN(io.Discard, r.reader, size) | |||
if err != nil && !errors.Is(err, io.EOF) { | |||
return errorf(CodeUnknown, "read enveloped message: %w", err) | |||
return errorf(CodeResourceExhausted, "message is larger than configured max %d - unable to determine message size: %w", r.readMaxBytes, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was made consistent with the exact same logic that lives in the decompression code. Its handling (always returning "resource exhausted" since we know the message is too big) seemed better, and its error message more helpful, so I copy+pasted it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Before merging, please do respond to my question about os.ErrDeadlineExceeded
.
Also, if the best way to flush this error out is to run the conformance tests repeatedly...should we do that on one platform in CI? If it's too painful to do on every PR, we could at least do a cron job and run it nightly.
// https://github.com/golang/go/issues/64449 | ||
if errors.Is(err, os.ErrDeadlineExceeded) { | ||
return NewError(CodeDeadlineExceeded, err) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confirming: whenever we hit this case, the actual context has been cancelled - correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It cannot be guaranteed without also checking the context. However, the same is true of the check above: errors.Is(err, context.DeadlineExceeded)
could also return true even if the context is not cancelled:
- A network timeout that is not caused by a context cancellation will also return true with
errors.Is(err. context.DeadlineExceeded)
: https://github.com/golang/go/blob/master/src/net/net.go#L600-L602 - Handler code could create a child context with some other, arbitrary deadline, and that result in a
context.DeadlineExceeded
that does not correlate with the lifetime of the parent RPC context.
So this check does not make the situation any worse IMO. And, if the referenced Go bug is fixed, it would behave exactly this way, even with only the errors.Is(err, context.DeadlineExceeded)
check above.
The only way to guarantee that the context has been cancelled is to also pass in the context and check ctx.Err()
, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, if the best way to flush this error out is to run the conformance tests repeatedly...should we do that on one platform in CI?
While trying to create a repro test for the Go bug, I think I may have found a test formulation that can reveal this issue more consistently. It isn't pretty, since it's timing related, and different execution environments will have different timing that is likely to tickle it (10-core MBP vs. VM/container in CI infra). Let me put that together and verify it can repro this issue on main and then see what you think of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just pushed a commit with a repro test. And... sadly... it fails on this branch. So there is some other place a deadline exceeded error is miscategorized as "unknown" 😢
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I think I've now squashed all of the bugs.
The upside is that the test seemed to fail pretty consistently without all of these fixes, so I think it will be a reasonably good way to identify issues in CI. But there are a couple of downsides:
- It never repro'ed the issues with the race detector enabled. So I have the test currently running only when the race detector is disabled. And so there is now an extra command run for
make test
that just runs this one test, without-race
. - When there are no bugs, it slows down CI by a full 20 seconds, hoping to find a non-existent bug.
See the last three commits for details:
- d5965e8 adds the test
- 39a162c updates the test so the output is more useful in tracking down the source of the error; this also makes sure to use the gRPC protocol, too, in case there protocol-specific places that fail to classify the error correctly. (It just uses gRPC-Web right now, since that is the same protocol handler implementation and basically a superset of activity, since it includes add'l body writes for the trailers.)
- d42edba fixes the three add'l places found by the test that were incorrectly classifying errors. One of them was wrapping a
*connect.Error
with another*connect.Error
and changing the code to "unknown". Oops!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of them was wrapping a *connect.Error with another *connect.Error and changing the code to "unknown". Oops!
😬 Yikes!
…ror; also use gRPC protocol for broader coverage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potentially we could move these new test cases that relies on thrashing to a test on the writers and readers by controlling when errors are injected? This wouldn't help in uncovering new issues but we could build a suite of known error cases and protect against them. This would ensure the testcases are able to reproduce the issues under -race
.
🤷. While the test would certainly be faster, it wouldn't be nearly as robust or confidence-inspiring. Also, I worry that adding the ability inject arbitrary errors into all of the different places could make the non-test code much more complicated. |
@emcfarlane, @akshayjshah, since the test that reproduces the issue is slow, I've disabled it for now. It's actually enabled, but only when the race detector is off. And since the Makefile and CI always run the tests with Seem good enough to merge as is? |
… have some coverage in CI
After a chat w/ Akshay, I added it as a separate CI job, so it runs in parallel with other tests, and only for latest Go and linux. Looking at the timing above, it is faster than the main |
# only the latest | ||
go-version: 1.21.x | ||
- name: Run Slow Tests | ||
run: make slowtest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Talked to Josh live about this, but I'd love for this to be carved up a little differently. Locally, I'd really like make
and make test
to run these slower smoke tests. In CI, the highly-matrixed tests should run all the tests except the smoke tests. The new job (slowtest
) should run the slow tests, but it would be okay with me if it also ran the faster tests.
I don't care that much whether we carve up the tests with an environment variable, testing.Short()
, or with build tags.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make test
now runs both short and slow tests. CI "Test" job only runs short tests, and there's a separate "Slow Tests" job for the others. Did this with go test -short
and a check in the test (if testing.Short() { t.Skip(...) }
).
Also, while factoring out slow tests to a parallel job, I did the same for conformance tests, to shorten the critical path through the CI workflow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the main CI job must re-build the code (to instrument with race detector) and also does linting, the new slowtest job is actually faster than that other job (for now at least). So this new job does not increase the total duration. The longest job is still Windows, by a good margin.
When adding timeout tests to the conformance suite in connectrpc/conformance#715 and connectrpc/conformance#716, @smaye81 encountered several issues where the responses from the connect-go client would sometimes have a code of
unavailable
orunknown
.The
unavailable
code was happening here. ThewrapIfContextError
above it was not successful because, sadly, a timeout error during dialing can get returned asos.ErrDeadlineExceeded
instead ofcontext.DeadlineExceeded
😭. (I filed a Go bug for this and added a work-around towrapIfContextError
.)The
unknown
codes were when a deadline exceeded error was returned from a call to write a message. The code was extremely inconsistent about handling of context errors and connect errors that could be returned from a call toWrite
. I've updated it so that all code that writes or reads now checks for context and connect errors, though many places (which read and write from a buffer or a pipe) can't actually trigger it. I figured it was safer (and more consistent) to have the checks everywhere instead of trying to be selective where it could or couldn't happen (which requires non-trivial analysis and could also be brittle if this code is changed in the future).I tried creating a repro test case that reconstructed what was happening in the failing conformance test case. But I wasn't able to find a formulation that could actually do so consistently. So what I did to repro/test this was put the conformance tests in a loop, running them until they failed. Before these fixes, they semi-regularly failed (at least once every five runs). With this fix, I let it run over 500 times and still zero failures.