-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test: fix flaky RPC TLS enforcement test #18155
Conversation
defer func() { | ||
//TODO Avoid panics from logging during shutdown | ||
time.Sleep(1 * time.Second) | ||
}() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean the one nice thing about this code is that when this test failed in CI and you went to look at the code, you could immediately tell that this was a flaky test by this janky defer right at the top. 😅
(I'm happy to see it gone)
Test run https://github.com/hashicorp/nomad/actions/runs/5765563770/job/15631783891?pr=18155 contains a failed test still. Note that we do see the "failed TLS handshake" error in the Nomad server's logs, but that error doesn't seem to have made it back to the client:
This actually suggests to me the test is wrong -- why would we expect to get an error message back? |
Ok, so these connections that are failing are ones that are never supposed to work b/c they fail our TLS cert role validation. The error message is coming from the TLS handshake error. The RPC connection handler closes the connection immediately on getting the error from the TLS handshake. The stdlib's TLS library flushes the connection's buffer before returning the error. So my theory is that in the failing case we get:
And in the passing case we get:
I might be able to force this to occur with a little more debugging; if I can reproduce that I'll be able to fix this issue permanently by simply accepting the |
The RPC TLS enforcment test creates network connections to a server and these are occassionally failing in testing with `write: broken pipe` errors. This has been an ongoing issue where it'll appear to get fixed, then reoccur, and no one seems to be able to reproduce outside of CI. The test assertion itself is reliable, which is why it's been hard to spend effort to hunt this down. The failing test cases are ones that are never supposed to work b/c they fail our TLS cert role validation. The error message is coming from the TLS handshake error. The RPC connection handler closes the connection immediately on getting the error from the TLS handshake. The stdlib's TLS library flushes the connection's buffer before returning the error. So the theory is that in the failing case we don't get the error message before the connection is closed, but do get the error return that allows the client to move on to a write, which tries to write on the closed pipe. I've been unable to reproduce this exactly, as the race is effectively between the OS and the runtime. The equivalent test of the Raft TLS enforcement includes handling of a EOF intead of the certificate error, so it appears this actually expected (or at least known) behavior. Because the code under test is operating as expected, this changeset updates the assertion to accept the error.
fa9b9b6
to
65501ff
Compare
I haven't been able to reproduce but seeing as how the race in the test assertion is not about the code under test I'm also having to make a tradeoff in terms of time invested here. I'm going to accept the alternate error message and some rainy day I'll come back to this whole TLS / RPC connection handling and see if I can reproduce. |
The RPC TLS enforcment test creates network connections to a server and these are occassionally failing in testing with `write: broken pipe` errors. This has been an ongoing issue where it'll appear to get fixed, then reoccur, and no one seems to be able to reproduce outside of CI. The test assertion itself is reliable, which is why it's been hard to spend effort to hunt this down. The failing test cases are ones that are never supposed to work b/c they fail our TLS cert role validation. The error message is coming from the TLS handshake error. The RPC connection handler closes the connection immediately on getting the error from the TLS handshake. The stdlib's TLS library flushes the connection's buffer before returning the error. So the theory is that in the failing case we don't get the error message before the connection is closed, but do get the error return that allows the client to move on to a write, which tries to write on the closed pipe. I've been unable to reproduce this exactly, as the race is effectively between the OS and the runtime. The equivalent test of the Raft TLS enforcement includes handling of a EOF intead of the certificate error, so it appears this actually expected (or at least known) behavior. Because the code under test is operating as expected, this changeset updates the assertion to accept the error.
BPA backported to 1.6.x and 1.5.x. Backported to 1.4.x by hand because BPA gave up on that. |
The RPC TLS enforcment test creates network connections to a server and these are occassionally failing in testing with
write: broken pipe
errors. This has been an ongoing issue where it'll appear to get fixed, then reoccur, and no one seems to be able to reproduce outside of CI. The test assertion itself is reliable and correct, which is why it's been hard to spend effort to hunt this down.We've eliminated RPC connection limits as a possibility, because we'd be getting a different error. But I've been able to trigger the error locally by setting a low
RPCHandshakeTimeout
, which suggests it might be the cause of the flaky tests.This changeset sets the
RPCHandshakeTimeout
to 10s, twice its default. It also cleans up the test code to use current idioms around usingshoenig/test
andt.Cleanup
functions, and configures the test servers to log verbosely in hopes that if this error continues we'll have more data to debug with.Fixes: #16253