Flaky RPC tests #318

zenhack · 2022-10-09T02:10:40Z

There are a few tests in the RPC package that usually pass, but fail intermittently:

TestRecvCancel (hangs)
TestSendCancel (maybe; see Flaky RPC tests #318 (comment))
TestHandleReturn_regression (Sometimes hangs, sometimes actually fails, see Flaky RPC tests #318 (comment) )
TestBootstrapReceiverAnswerRpc: see Flaky RPC tests #318 (comment)
TestRecvAbort:

level0_test.go:148: 
        	Error Trace:	level0_test.go:148
        	Error:      	Received unexpected error:
        	            	context deadline exceeded
        	Test:       	TestRecvAbort
        	Messages:   	should resolve bootstrap capability

There may be one or two more. These crop up running the tests locally often enough to be easily reproducible but not every time, so at some point they managed to slip past CI and get merged. We should fix these; I don't want to keep having to wonder when working on something else whether I've broken something or a test is just flaky; this slows down development across the board.

The text was updated successfully, but these errors were encountered:

zenhack · 2022-10-12T23:26:03Z

Observations:

TestRecvCancel only ever hangs when run in parallel with other tests. It passes consistently if you either do go test -run TestRecvCancel or delete the t.Parallel() at the top of the function. I don't yet understand why that would happen; it doesn't look like there are any shared resources with other tests that might interfere, and there's no timeouts or anything...
The timeout in TestRecvAbort trigger the above failure isn't theoretically impossible, though a 1 second timeout seems generous. I'm having trouble reproducing the issue right now, but if I was able to reproduce it semi-reliably, my first thought would be to try commenting out t.Parallel() to see if that fixes anything.

It seems likely that some of these are race conditions that are sufficiently timing dependent that the noise from other tests is needed to trigger them.

Also, I managed to trigger a failure other than the above:

--- FAIL: TestBootstrapReceiverAnswerRpc (0.00s)
panic: test.capnp:CapArgsTest.call: rpc: connection closed [recovered]
	panic: test.capnp:CapArgsTest.call: rpc: connection closed

goroutine 27 [running]:
testing.tRunner.func1.2({0x667820, 0xc000586060})
	/usr/lib/go/src/testing/testing.go:1396 +0x24e
testing.tRunner.func1()
	/usr/lib/go/src/testing/testing.go:1399 +0x39f
panic({0x667820, 0xc000586060})
	/usr/lib/go/src/runtime/panic.go:884 +0x212
capnproto.org/go/capnp/v3/rpc.chkfatal(...)
	/home/isd/src/go-wk/go-capnproto2/rpc/receiveranswer_test.go:51
capnproto.org/go/capnp/v3/rpc.TestBootstrapReceiverAnswerRpc(0xc0000eed00)
	/home/isd/src/go-wk/go-capnproto2/rpc/receiveranswer_test.go:87 +0x4f3
testing.tRunner(0xc0000eed00, 0x6ca630)
	/usr/lib/go/src/testing/testing.go:1446 +0x10b
created by testing.(*T).Run
	/usr/lib/go/src/testing/testing.go:1493 +0x35f
exit status 2
FAIL	capnproto.org/go/capnp/v3/rpc	0.008s

zenhack · 2022-10-13T00:14:09Z

TestHandleReturn_regression sometimes also fails (without hanging), with:

--- FAIL: TestHandleReturn_regression (0.00s)
    --- FAIL: TestHandleReturn_regression/BootstrapWithExpiredContext (0.00s)
        level0_test.go:1820: 
            	Error Trace:	level0_test.go:1820
            	Error:      	Target error should be in err chain:
            	            	expected: "context canceled"
            	            	in chain: "test.capnp:PingPong.echoNum: rpc: send message: rpc: connection closed"
            	            		"test.capnp:PingPong.echoNum: rpc: send message: rpc: connection closed"
            	            		"rpc: send message: rpc: connection closed"
            	            		"send message: rpc: connection closed"
            	            		"rpc: connection closed"
            	            		"connection closed"
            	Test:       	TestHandleReturn_regression/BootstrapWithExpiredContext
FAIL
exit status 1
FAIL	capnproto.org/go/capnp/v3/rpc	0.107s

When it hangs instead, it's on line 1816, which is a call to .Struct() on a result.

Previously, these tests used the same connection for both tests. This fixes intermittent test failures on the second test, as discussed in capnproto#318. I think, in principle, the tests *should* pass in that case anyway, but: 1. Keeping the cases isolated makes this easier to understand 2. This way we can (and now do) run the tests in parallel 3. While we should better test the scenarios where: - a connection is used after a bootstrap fails - bootstrap is called more than once - etc. Doing so without tests dedicated to those things will be very difficult to maintain; we could add some but that should be a separate (low priority) task.

zenhack · 2022-10-13T01:00:55Z

@lthibault, #320 fixes both failure modes for TestHandleReturn_regression.

zenhack · 2022-10-13T01:42:07Z

A trace for the TestBootstrapReceiverAnswerRpc is below. It looks like the client is sending a finish message for the bootstrap it's done with, and then getting confused when it finally gets the return message because it's forgotten about the question entirely. Probably there's somewhere where we're being too aggressive about removing stuff from the questions table and need to leave some record of this. I think we should only ever remove stuff from the table in two places:

handleReturn
when sending the initial question fails (and thus we won't receive a return)

I'll pick this back up tomorrow, and start by auditing places where we remove stuff from questions.

( "Send"
, Message
    { union' =
        Message'bootstrap
          Bootstrap { questionId = 0 , deprecatedObjectId = Nothing }
    }
)
( "Send"
, Message
    { union' =
        Message'call
          Call
            { questionId = 1
            , target =
                MessageTarget
                  { union' =
                      MessageTarget'promisedAnswer
                        PromisedAnswer { questionId = 0 , transform = [] }
                  }
            , interfaceId = 13288942172782596300
            , methodId = 0
            , params =
                Payload
                  { content = Nothing
                  , capTable =
                      [ CapDescriptor
                          { attachedFd = 255
                          , union' =
                              CapDescriptor'receiverAnswer
                                PromisedAnswer { questionId = 0 , transform = [] }
                          }
                      ]
                  }
            , sendResultsTo =
                Call'sendResultsTo' { union' = Call'sendResultsTo'caller }
            , allowThirdPartyTailCall = False
            }
    }
)
( "Send"
, Message
    { union' =
        Message'finish Finish { questionId = 0 , releaseResultCaps = True }
    }
)
( "Recv"
, Message
    { union' =
        Message'return
          Return
            { answerId = 0
            , releaseParamCaps = False
            , union' =
                Return'results
                  Payload
                    { content = Nothing
                    , capTable =
                        [ CapDescriptor
                            { attachedFd = 255 , union' = CapDescriptor'senderHosted 0 }
                        ]
                    }
            }
    }
)
( "Send"
, Message
    { union' =
        Message'abort
          Exception
            { reason = "rpc: incoming return: question 0 does not exist"
            , obsoleteIsCallersFault = False
            , obsoleteDurability = 0
            , type_ = Exception'Type'failed
            }
    }
)

zenhack · 2022-10-13T01:46:49Z

TestSendCancel failed in CI: https://github.com/capnproto/go-capnproto2/actions/runs/3238937452/jobs/5307722709#step:5:1483

...I think I've seen that before, so we should investigate that here as well.

This was manifesting as occasional failures of TestBootstrapReceiverAnswerRpc, as discussed in capnproto#318. When this was introduced, I attempted to avoid double-rejecting the promise by removing it from the table after the first rejection, but this is in fact incorrect in the case where we cancel the question, because the entry needs to stick around until the return message comes in. With this patch, instead, we solve the problem by just having releaseQuestions() check the `finished` flag before calling Reject.

zenhack · 2022-10-14T01:47:23Z

#322 fixes TestBootstrapReceiverAnswerRpc.

...and let the method implementation check the context itself. This solves an occasional hang (observed in capnproto#318) in TestRecvCancel, which involves a method implementation that waits for its context to be canceled and then closes a channel. Without this patch, if the context is cancelled early enough for handleCall to see it, it won't run the method at all, and so the channel will never be closed, causing the test to hang on a receive on that channel.

zenhack · 2022-10-14T04:13:00Z

#324 fixes TestRecvCancel. I'll tackle the rest tomorrow.

zenhack · 2022-10-17T00:43:26Z

#327 fixes TestRecvAbort.

...to try to better shake out flaky tests & race conditions. Many of the tests listed in capnproto#318 "usually" passed, which is how they snuck in in the first place. This adjusts our CI to run the tests in the rpc package 500 times.

zenhack · 2022-10-24T18:09:43Z

Closing, since all the relevant prs have been merged.

zenhack mentioned this issue Oct 13, 2022

Better isolate the tests under TestHandleReturn_regression #320

Merged

zenhack mentioned this issue Oct 14, 2022

Fix a regression introduced by releaseQuestions() #322

Merged

zenhack mentioned this issue Oct 14, 2022

Add Go 1.19 to the build matrix. #323

Merged

zenhack mentioned this issue Oct 14, 2022

Server.handleCall: Run methods when the ctx is already canceled #324

Merged

lthibault mentioned this issue Oct 14, 2022

Race condition in capnp #325

Closed

zenhack mentioned this issue Oct 17, 2022

CI: run RPC tests repeatedly #328

Merged

zenhack closed this as completed Oct 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flaky RPC tests #318

Flaky RPC tests #318

zenhack commented Oct 9, 2022 •

edited by lthibault

Loading

zenhack commented Oct 12, 2022

zenhack commented Oct 13, 2022

zenhack commented Oct 13, 2022

zenhack commented Oct 13, 2022

zenhack commented Oct 13, 2022

zenhack commented Oct 14, 2022

zenhack commented Oct 14, 2022

zenhack commented Oct 17, 2022

zenhack commented Oct 24, 2022

Flaky RPC tests #318

Flaky RPC tests #318

Comments

zenhack commented Oct 9, 2022 • edited by lthibault Loading

zenhack commented Oct 12, 2022

zenhack commented Oct 13, 2022

zenhack commented Oct 13, 2022

zenhack commented Oct 13, 2022

zenhack commented Oct 13, 2022

zenhack commented Oct 14, 2022

zenhack commented Oct 14, 2022

zenhack commented Oct 17, 2022

zenhack commented Oct 24, 2022

zenhack commented Oct 9, 2022 •

edited by lthibault

Loading