Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: fix flaky RPC TLS enforcement test #18155

Merged
merged 1 commit into from
Aug 7, 2023
Merged

Conversation

tgross
Copy link
Member

@tgross tgross commented Aug 4, 2023

The RPC TLS enforcment test creates network connections to a server and these are occassionally failing in testing with write: broken pipe errors. This has been an ongoing issue where it'll appear to get fixed, then reoccur, and no one seems to be able to reproduce outside of CI. The test assertion itself is reliable and correct, which is why it's been hard to spend effort to hunt this down.

We've eliminated RPC connection limits as a possibility, because we'd be getting a different error. But I've been able to trigger the error locally by setting a low RPCHandshakeTimeout, which suggests it might be the cause of the flaky tests.

This changeset sets the RPCHandshakeTimeout to 10s, twice its default. It also cleans up the test code to use current idioms around using shoenig/test and t.Cleanup functions, and configures the test servers to log verbosely in hopes that if this error continues we'll have more data to debug with.

Fixes: #16253

Comment on lines -1148 to -1152
defer func() {
//TODO Avoid panics from logging during shutdown
time.Sleep(1 * time.Second)
}()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean the one nice thing about this code is that when this test failed in CI and you went to look at the code, you could immediately tell that this was a flaky test by this janky defer right at the top. 😅

(I'm happy to see it gone)

@tgross
Copy link
Member Author

tgross commented Aug 4, 2023

Test run https://github.com/hashicorp/nomad/actions/runs/5765563770/job/15631783891?pr=18155 contains a failed test still. Note that we do see the "failed TLS handshake" error in the Nomad server's logs, but that error doesn't seem to have made it back to the client:

=== RUN   TestRPC_TLS_Enforcement_RPC/other_region_client/servers_only_rpc/nomad_RPC:_rpc=Eval.Update_verify_hostname=true_leader=true
node-753 2023-08-04T19:05:12.626Z [WARN]  nomad/rpc.go:317: nomad.rpc: failed TLS handshake: remote_addr=127.0.0.1:58546 error="invalid role or region for certificate"
    assert.go:11: 
        rpc_test.go:1330: expected error to contain substring
        ↪ substring: certificate
        ↪       err: write tcp 127.0.0.1:58546->127.0.0.1:36303: write: broken pipe
--- FAIL: TestRPC_TLS_Enforcement_RPC/other_region_client/servers_only_rpc/nomad_RPC:_rpc=Eval.Update_verify_hostname=true_leader=true (0.00s)

This actually suggests to me the test is wrong -- why would we expect to get an error message back?

@tgross
Copy link
Member Author

tgross commented Aug 4, 2023

Ok, so these connections that are failing are ones that are never supposed to work b/c they fail our TLS cert role validation. The error message is coming from the TLS handshake error. The RPC connection handler closes the connection immediately on getting the error from the TLS handshake. The stdlib's TLS library flushes the connection's buffer before returning the error. So my theory is that in the failing case we get:

  • TLS library fails handshake
  • TLS library calls flush on buffer but OS doesn't write bytes onto the wire yet (or they're not received)
  • TLS library returns error
  • rpc.go closes the connection
  • client in rpc_test.go never receives the error, so it writes the message and gets broken pipe

And in the passing case we get:

  • TLS library fails handshake
  • TLS library calls flush on buffer
  • OS writes bytes onto the wire and they're received
  • TLS library returns error
  • rpc.go closes the connection
  • client in rpc_test.go got the error, so it accepts that and the assertion passes

I might be able to force this to occur with a little more debugging; if I can reproduce that I'll be able to fix this issue permanently by simply accepting the broken pipe as an acceptable error message too. Note that the Raft TLS enforcement tests already seem to do this; they check for the string certificate or EOF.

The RPC TLS enforcment test creates network connections to a server and these
are occassionally failing in testing with `write: broken pipe` errors. This has
been an ongoing issue where it'll appear to get fixed, then reoccur, and no one
seems to be able to reproduce outside of CI. The test assertion itself is
reliable, which is why it's been hard to spend effort to hunt this down.

The failing test cases are ones that are never supposed to work b/c they fail
our TLS cert role validation. The error message is coming from the TLS handshake
error. The RPC connection handler closes the connection immediately on getting
the error from the TLS handshake. The stdlib's TLS library flushes the
connection's buffer before returning the error. So the theory is that in the
failing case we don't get the error message before the connection is closed, but
do get the error return that allows the client to move on to a write, which
tries to write on the closed pipe.

I've been unable to reproduce this exactly, as the race is effectively between
the OS and the runtime. The equivalent test of the Raft TLS enforcement includes
handling of a EOF intead of the certificate error, so it appears this actually
expected (or at least known) behavior. Because the code under test is operating
as expected, this changeset updates the assertion to accept the error.
@tgross
Copy link
Member Author

tgross commented Aug 7, 2023

I might be able to force this to occur with a little more debugging

I haven't been able to reproduce but seeing as how the race in the test assertion is not about the code under test I'm also having to make a tradeoff in terms of time invested here. I'm going to accept the alternate error message and some rainy day I'll come back to this whole TLS / RPC connection handling and see if I can reproduce.

@tgross tgross changed the title test: debugging of flaky RPC TLS enforcement test test: fix flaky RPC TLS enforcement test Aug 7, 2023
@tgross tgross marked this pull request as ready for review August 7, 2023 15:16
@tgross tgross merged commit 5d2c1d1 into main Aug 7, 2023
21 checks passed
@tgross tgross deleted the rpc_tls_enforcement_test branch August 7, 2023 15:17
@tgross tgross added backport/1.4.x backport to 1.4.x release line backport/1.5.x backport to 1.5.x release line backport/1.6.x backport to 1.6.x release line labels Aug 7, 2023
tgross added a commit that referenced this pull request Aug 7, 2023
The RPC TLS enforcment test creates network connections to a server and these
are occassionally failing in testing with `write: broken pipe` errors. This has
been an ongoing issue where it'll appear to get fixed, then reoccur, and no one
seems to be able to reproduce outside of CI. The test assertion itself is
reliable, which is why it's been hard to spend effort to hunt this down.

The failing test cases are ones that are never supposed to work b/c they fail
our TLS cert role validation. The error message is coming from the TLS handshake
error. The RPC connection handler closes the connection immediately on getting
the error from the TLS handshake. The stdlib's TLS library flushes the
connection's buffer before returning the error. So the theory is that in the
failing case we don't get the error message before the connection is closed, but
do get the error return that allows the client to move on to a write, which
tries to write on the closed pipe.

I've been unable to reproduce this exactly, as the race is effectively between
the OS and the runtime. The equivalent test of the Raft TLS enforcement includes
handling of a EOF intead of the certificate error, so it appears this actually
expected (or at least known) behavior. Because the code under test is operating
as expected, this changeset updates the assertion to accept the error.
@tgross
Copy link
Member Author

tgross commented Aug 7, 2023

BPA backported to 1.6.x and 1.5.x. Backported to 1.4.x by hand because BPA gave up on that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.4.x backport to 1.4.x release line backport/1.5.x backport to 1.5.x release line backport/1.6.x backport to 1.6.x release line theme/flaky-tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

flaky tests: TestRPC_TLS_ are still unhappy
2 participants