-
-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flaky RPC tests #318
Comments
Observations:
It seems likely that some of these are race conditions that are sufficiently timing dependent that the noise from other tests is needed to trigger them. Also, I managed to trigger a failure other than the above:
|
When it hangs instead, it's on line 1816, which is a call to .Struct() on a result. |
Previously, these tests used the same connection for both tests. This fixes intermittent test failures on the second test, as discussed in capnproto#318. I think, in principle, the tests *should* pass in that case anyway, but: 1. Keeping the cases isolated makes this easier to understand 2. This way we can (and now do) run the tests in parallel 3. While we should better test the scenarios where: - a connection is used after a bootstrap fails - bootstrap is called more than once - etc. Doing so without tests dedicated to those things will be very difficult to maintain; we could add some but that should be a separate (low priority) task.
@lthibault, #320 fixes both failure modes for |
A trace for the
I'll pick this back up tomorrow, and start by auditing places where we remove stuff from questions.
|
...I think I've seen that before, so we should investigate that here as well. |
This was manifesting as occasional failures of TestBootstrapReceiverAnswerRpc, as discussed in capnproto#318. When this was introduced, I attempted to avoid double-rejecting the promise by removing it from the table after the first rejection, but this is in fact incorrect in the case where we cancel the question, because the entry needs to stick around until the return message comes in. With this patch, instead, we solve the problem by just having releaseQuestions() check the `finished` flag before calling Reject.
#322 fixes |
...and let the method implementation check the context itself. This solves an occasional hang (observed in capnproto#318) in TestRecvCancel, which involves a method implementation that waits for its context to be canceled and then closes a channel. Without this patch, if the context is cancelled early enough for handleCall to see it, it won't run the method at all, and so the channel will never be closed, causing the test to hang on a receive on that channel.
#324 fixes |
#327 fixes |
...to try to better shake out flaky tests & race conditions. Many of the tests listed in capnproto#318 "usually" passed, which is how they snuck in in the first place. This adjusts our CI to run the tests in the rpc package 500 times.
Closing, since all the relevant prs have been merged. |
There are a few tests in the RPC package that usually pass, but fail intermittently:
TestRecvCancel
(hangs)TestSendCancel
(maybe; see Flaky RPC tests #318 (comment))TestHandleReturn_regression
(Sometimes hangs, sometimes actually fails, see Flaky RPC tests #318 (comment) )TestBootstrapReceiverAnswerRpc
: see Flaky RPC tests #318 (comment)TestRecvAbort
:There may be one or two more. These crop up running the tests locally often enough to be easily reproducible but not every time, so at some point they managed to slip past CI and get merged. We should fix these; I don't want to keep having to wonder when working on something else whether I've broken something or a test is just flaky; this slows down development across the board.
The text was updated successfully, but these errors were encountered: