Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix "deadline exceeded" issues discovered in conformance tests #643

Merged
merged 12 commits into from
Dec 7, 2023

Conversation

jhump
Copy link
Member

@jhump jhump commented Nov 29, 2023

When adding timeout tests to the conformance suite in connectrpc/conformance#715 and connectrpc/conformance#716, @smaye81 encountered several issues where the responses from the connect-go client would sometimes have a code of unavailable or unknown.

The unavailable code was happening here. The wrapIfContextError above it was not successful because, sadly, a timeout error during dialing can get returned as os.ErrDeadlineExceeded instead of context.DeadlineExceeded 😭. (I filed a Go bug for this and added a work-around to wrapIfContextError.)

The unknown codes were when a deadline exceeded error was returned from a call to write a message. The code was extremely inconsistent about handling of context errors and connect errors that could be returned from a call to Write. I've updated it so that all code that writes or reads now checks for context and connect errors, though many places (which read and write from a buffer or a pipe) can't actually trigger it. I figured it was safer (and more consistent) to have the checks everywhere instead of trying to be selective where it could or couldn't happen (which requires non-trivial analysis and could also be brittle if this code is changed in the future).

I tried creating a repro test case that reconstructed what was happening in the failing conformance test case. But I wasn't able to find a formulation that could actually do so consistently. So what I did to repro/test this was put the conformance tests in a loop, running them until they failed. Before these fixes, they semi-regularly failed (at least once every five runs). With this fix, I let it run over 500 times and still zero failures.

…r 'deadline exceeded' code instead of 'unknown' or 'unavailable'
@@ -111,8 +115,12 @@ func (c *compressionPool) Compress(dst *bytes.Buffer, src *bytes.Buffer) *Error
if err != nil {
return errorf(CodeUnknown, "get compressor: %w", err)
}
if _, err := io.Copy(compressor, src); err != nil {
if _, err := src.WriteTo(compressor); err != nil {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was just for consistency/symmetry with the decompress code above, which directly calls dst.ReadFrom instead of using io.Copy.

@@ -252,7 +261,7 @@ func (r *envelopeReader) Read(env *envelope) *Error {
if r.readMaxBytes > 0 && size > int64(r.readMaxBytes) {
_, err := io.CopyN(io.Discard, r.reader, size)
if err != nil && !errors.Is(err, io.EOF) {
return errorf(CodeUnknown, "read enveloped message: %w", err)
return errorf(CodeResourceExhausted, "message is larger than configured max %d - unable to determine message size: %w", r.readMaxBytes, err)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was made consistent with the exact same logic that lives in the decompression code. Its handling (always returning "resource exhausted" since we know the message is too big) seemed better, and its error message more helpful, so I copy+pasted it here.

@jhump jhump requested a review from akshayjshah November 29, 2023 21:29
Copy link
Member

@akshayjshah akshayjshah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Before merging, please do respond to my question about os.ErrDeadlineExceeded.

Also, if the best way to flush this error out is to run the conformance tests repeatedly...should we do that on one platform in CI? If it's too painful to do on every PR, we could at least do a cron job and run it nightly.

// https://github.com/golang/go/issues/64449
if errors.Is(err, os.ErrDeadlineExceeded) {
return NewError(CodeDeadlineExceeded, err)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirming: whenever we hit this case, the actual context has been cancelled - correct?

Copy link
Member Author

@jhump jhump Dec 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It cannot be guaranteed without also checking the context. However, the same is true of the check above: errors.Is(err, context.DeadlineExceeded) could also return true even if the context is not cancelled:

  1. A network timeout that is not caused by a context cancellation will also return true with errors.Is(err. context.DeadlineExceeded): https://github.com/golang/go/blob/master/src/net/net.go#L600-L602
  2. Handler code could create a child context with some other, arbitrary deadline, and that result in a context.DeadlineExceeded that does not correlate with the lifetime of the parent RPC context.

So this check does not make the situation any worse IMO. And, if the referenced Go bug is fixed, it would behave exactly this way, even with only the errors.Is(err, context.DeadlineExceeded) check above.

The only way to guarantee that the context has been cancelled is to also pass in the context and check ctx.Err(), too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if the best way to flush this error out is to run the conformance tests repeatedly...should we do that on one platform in CI?

While trying to create a repro test for the Go bug, I think I may have found a test formulation that can reveal this issue more consistently. It isn't pretty, since it's timing related, and different execution environments will have different timing that is likely to tickle it (10-core MBP vs. VM/container in CI infra). Let me put that together and verify it can repro this issue on main and then see what you think of it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just pushed a commit with a repro test. And... sadly... it fails on this branch. So there is some other place a deadline exceeded error is miscategorized as "unknown" 😢

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I think I've now squashed all of the bugs.

The upside is that the test seemed to fail pretty consistently without all of these fixes, so I think it will be a reasonably good way to identify issues in CI. But there are a couple of downsides:

  1. It never repro'ed the issues with the race detector enabled. So I have the test currently running only when the race detector is disabled. And so there is now an extra command run for make test that just runs this one test, without -race.
  2. When there are no bugs, it slows down CI by a full 20 seconds, hoping to find a non-existent bug.

See the last three commits for details:

  1. d5965e8 adds the test
  2. 39a162c updates the test so the output is more useful in tracking down the source of the error; this also makes sure to use the gRPC protocol, too, in case there protocol-specific places that fail to classify the error correctly. (It just uses gRPC-Web right now, since that is the same protocol handler implementation and basically a superset of activity, since it includes add'l body writes for the trailers.)
  3. d42edba fixes the three add'l places found by the test that were incorrectly classifying errors. One of them was wrapping a *connect.Error with another *connect.Error and changing the code to "unknown". Oops!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of them was wrapping a *connect.Error with another *connect.Error and changing the code to "unknown". Oops!

😬 Yikes!

Copy link
Contributor

@emcfarlane emcfarlane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially we could move these new test cases that relies on thrashing to a test on the writers and readers by controlling when errors are injected? This wouldn't help in uncovering new issues but we could build a suite of known error cases and protect against them. This would ensure the testcases are able to reproduce the issues under -race.

client_ext_norace_test.go Outdated Show resolved Hide resolved
client_ext_norace_test.go Outdated Show resolved Hide resolved
client_ext_norace_test.go Outdated Show resolved Hide resolved
client_ext_norace_test.go Outdated Show resolved Hide resolved
@jhump
Copy link
Member Author

jhump commented Dec 5, 2023

This wouldn't help in uncovering new issues but we could build a suite of known error cases and protect against them.

🤷. While the test would certainly be faster, it wouldn't be nearly as robust or confidence-inspiring. Also, I worry that adding the ability inject arbitrary errors into all of the different places could make the non-test code much more complicated.

@jhump
Copy link
Member Author

jhump commented Dec 6, 2023

@emcfarlane, @akshayjshah, since the test that reproduces the issue is slow, I've disabled it for now. It's actually enabled, but only when the race detector is off. And since the Makefile and CI always run the tests with -race, it's effectively a manual-only test.

Seem good enough to merge as is?

@jhump
Copy link
Member Author

jhump commented Dec 7, 2023

After a chat w/ Akshay, I added it as a separate CI job, so it runs in parallel with other tests, and only for latest Go and linux. Looking at the timing above, it is faster than the main ci job, so isn't making CI jobs otherwise take longer (so time to merge is unaffected), they just use a little more resources.

# only the latest
go-version: 1.21.x
- name: Run Slow Tests
run: make slowtest
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Talked to Josh live about this, but I'd love for this to be carved up a little differently. Locally, I'd really like make and make test to run these slower smoke tests. In CI, the highly-matrixed tests should run all the tests except the smoke tests. The new job (slowtest) should run the slow tests, but it would be okay with me if it also ran the faster tests.

I don't care that much whether we carve up the tests with an environment variable, testing.Short(), or with build tags.

Copy link
Member Author

@jhump jhump Dec 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make test now runs both short and slow tests. CI "Test" job only runs short tests, and there's a separate "Slow Tests" job for the others. Did this with go test -short and a check in the test (if testing.Short() { t.Skip(...) }).

Also, while factoring out slow tests to a parallel job, I did the same for conformance tests, to shorten the critical path through the CI workflow.

Copy link
Member Author

@jhump jhump Dec 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the main CI job must re-build the code (to instrument with race detector) and also does linting, the new slowtest job is actually faster than that other job (for now at least). So this new job does not increase the total duration. The longest job is still Windows, by a good margin.

@jhump jhump merged commit c9408f4 into main Dec 7, 2023
11 checks passed
@jhump jhump deleted the jh/deadline-exceeded-issues branch December 7, 2023 18:01
@jhump jhump added the bug Something isn't working label Dec 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants