test: fix flaky RPC TLS enforcement test #18155

tgross · 2023-08-04T18:49:30Z

The RPC TLS enforcment test creates network connections to a server and these are occassionally failing in testing with write: broken pipe errors. This has been an ongoing issue where it'll appear to get fixed, then reoccur, and no one seems to be able to reproduce outside of CI. The test assertion itself is reliable and correct, which is why it's been hard to spend effort to hunt this down.

We've eliminated RPC connection limits as a possibility, because we'd be getting a different error. But I've been able to trigger the error locally by setting a low RPCHandshakeTimeout, which suggests it might be the cause of the flaky tests.

This changeset sets the RPCHandshakeTimeout to 10s, twice its default. It also cleans up the test code to use current idioms around using shoenig/test and t.Cleanup functions, and configures the test servers to log verbosely in hopes that if this error continues we'll have more data to debug with.

Fixes: #16253

schmichael · 2023-08-04T19:08:53Z

nomad/rpc_test.go

-	defer func() {
-		//TODO Avoid panics from logging during shutdown
-		time.Sleep(1 * time.Second)
-	}()
-


I mean the one nice thing about this code is that when this test failed in CI and you went to look at the code, you could immediately tell that this was a flaky test by this janky defer right at the top. 😅

(I'm happy to see it gone)

tgross · 2023-08-04T19:22:42Z

Test run https://github.com/hashicorp/nomad/actions/runs/5765563770/job/15631783891?pr=18155 contains a failed test still. Note that we do see the "failed TLS handshake" error in the Nomad server's logs, but that error doesn't seem to have made it back to the client:

=== RUN   TestRPC_TLS_Enforcement_RPC/other_region_client/servers_only_rpc/nomad_RPC:_rpc=Eval.Update_verify_hostname=true_leader=true
node-753 2023-08-04T19:05:12.626Z [WARN]  nomad/rpc.go:317: nomad.rpc: failed TLS handshake: remote_addr=127.0.0.1:58546 error="invalid role or region for certificate"
    assert.go:11: 
        rpc_test.go:1330: expected error to contain substring
        ↪ substring: certificate
        ↪       err: write tcp 127.0.0.1:58546->127.0.0.1:36303: write: broken pipe
--- FAIL: TestRPC_TLS_Enforcement_RPC/other_region_client/servers_only_rpc/nomad_RPC:_rpc=Eval.Update_verify_hostname=true_leader=true (0.00s)

This actually suggests to me the test is wrong -- why would we expect to get an error message back?

tgross · 2023-08-04T20:19:36Z

Ok, so these connections that are failing are ones that are never supposed to work b/c they fail our TLS cert role validation. The error message is coming from the TLS handshake error. The RPC connection handler closes the connection immediately on getting the error from the TLS handshake. The stdlib's TLS library flushes the connection's buffer before returning the error. So my theory is that in the failing case we get:

TLS library fails handshake
TLS library calls flush on buffer but OS doesn't write bytes onto the wire yet (or they're not received)
TLS library returns error
rpc.go closes the connection
client in rpc_test.go never receives the error, so it writes the message and gets broken pipe

And in the passing case we get:

TLS library fails handshake
TLS library calls flush on buffer
OS writes bytes onto the wire and they're received
TLS library returns error
rpc.go closes the connection
client in rpc_test.go got the error, so it accepts that and the assertion passes

I might be able to force this to occur with a little more debugging; if I can reproduce that I'll be able to fix this issue permanently by simply accepting the broken pipe as an acceptable error message too. Note that the Raft TLS enforcement tests already seem to do this; they check for the string certificate or EOF.

The RPC TLS enforcment test creates network connections to a server and these are occassionally failing in testing with `write: broken pipe` errors. This has been an ongoing issue where it'll appear to get fixed, then reoccur, and no one seems to be able to reproduce outside of CI. The test assertion itself is reliable, which is why it's been hard to spend effort to hunt this down. The failing test cases are ones that are never supposed to work b/c they fail our TLS cert role validation. The error message is coming from the TLS handshake error. The RPC connection handler closes the connection immediately on getting the error from the TLS handshake. The stdlib's TLS library flushes the connection's buffer before returning the error. So the theory is that in the failing case we don't get the error message before the connection is closed, but do get the error return that allows the client to move on to a write, which tries to write on the closed pipe. I've been unable to reproduce this exactly, as the race is effectively between the OS and the runtime. The equivalent test of the Raft TLS enforcement includes handling of a EOF intead of the certificate error, so it appears this actually expected (or at least known) behavior. Because the code under test is operating as expected, this changeset updates the assertion to accept the error.

tgross · 2023-08-07T14:38:48Z

I might be able to force this to occur with a little more debugging

I haven't been able to reproduce but seeing as how the race in the test assertion is not about the code under test I'm also having to make a tradeoff in terms of time invested here. I'm going to accept the alternate error message and some rainy day I'll come back to this whole TLS / RPC connection handling and see if I can reproduce.

The RPC TLS enforcment test creates network connections to a server and these are occassionally failing in testing with `write: broken pipe` errors. This has been an ongoing issue where it'll appear to get fixed, then reoccur, and no one seems to be able to reproduce outside of CI. The test assertion itself is reliable, which is why it's been hard to spend effort to hunt this down. The failing test cases are ones that are never supposed to work b/c they fail our TLS cert role validation. The error message is coming from the TLS handshake error. The RPC connection handler closes the connection immediately on getting the error from the TLS handshake. The stdlib's TLS library flushes the connection's buffer before returning the error. So the theory is that in the failing case we don't get the error message before the connection is closed, but do get the error return that allows the client to move on to a write, which tries to write on the closed pipe. I've been unable to reproduce this exactly, as the race is effectively between the OS and the runtime. The equivalent test of the Raft TLS enforcement includes handling of a EOF intead of the certificate error, so it appears this actually expected (or at least known) behavior. Because the code under test is operating as expected, this changeset updates the assertion to accept the error.

tgross · 2023-08-07T15:19:45Z

BPA backported to 1.6.x and 1.5.x. Backported to 1.4.x by hand because BPA gave up on that.

vercel bot deployed to Preview – nomad-storybook-and-ui August 4, 2023 18:52 View deployment

tgross mentioned this pull request Aug 4, 2023

flaky tests: TestRPC_TLS_ are still unhappy #16253

Closed

tgross added the theme/flaky-tests label Aug 4, 2023

schmichael approved these changes Aug 4, 2023

View reviewed changes

tgross force-pushed the rpc_tls_enforcement_test branch from fa9b9b6 to 65501ff Compare August 7, 2023 14:36

vercel bot deployed to Preview – nomad-storybook-and-ui August 7, 2023 14:38 View deployment

tgross changed the title ~~test: debugging of flaky RPC TLS enforcement test~~ test: fix flaky RPC TLS enforcement test Aug 7, 2023

tgross marked this pull request as ready for review August 7, 2023 15:16

tgross merged commit 5d2c1d1 into main Aug 7, 2023
21 checks passed

tgross deleted the rpc_tls_enforcement_test branch August 7, 2023 15:17

tgross added backport/1.4.x backport to 1.4.x release line backport/1.5.x backport to 1.5.x release line backport/1.6.x backport to 1.6.x release line labels Aug 7, 2023

This was referenced Aug 7, 2023

Backport of test: fix flaky RPC TLS enforcement test into release/1.6.x #18166

Merged

Backport of test: fix flaky RPC TLS enforcement test into release/1.5.x #18167

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: fix flaky RPC TLS enforcement test #18155

test: fix flaky RPC TLS enforcement test #18155

tgross commented Aug 4, 2023 •

edited

Loading

schmichael Aug 4, 2023

tgross commented Aug 4, 2023 •

edited

Loading

tgross commented Aug 4, 2023 •

edited

Loading

tgross commented Aug 7, 2023

tgross commented Aug 7, 2023

test: fix flaky RPC TLS enforcement test #18155

test: fix flaky RPC TLS enforcement test #18155

Conversation

tgross commented Aug 4, 2023 • edited Loading

schmichael Aug 4, 2023

Choose a reason for hiding this comment

tgross commented Aug 4, 2023 • edited Loading

tgross commented Aug 4, 2023 • edited Loading

tgross commented Aug 7, 2023

tgross commented Aug 7, 2023

tgross commented Aug 4, 2023 •

edited

Loading

tgross commented Aug 4, 2023 •

edited

Loading

tgross commented Aug 4, 2023 •

edited

Loading