Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky RPC tests #318

Closed
5 tasks done
zenhack opened this issue Oct 9, 2022 · 9 comments
Closed
5 tasks done

Flaky RPC tests #318

zenhack opened this issue Oct 9, 2022 · 9 comments

Comments

@zenhack
Copy link
Contributor

zenhack commented Oct 9, 2022

There are a few tests in the RPC package that usually pass, but fail intermittently:

level0_test.go:148: 
        	Error Trace:	level0_test.go:148
        	Error:      	Received unexpected error:
        	            	context deadline exceeded
        	Test:       	TestRecvAbort
        	Messages:   	should resolve bootstrap capability

There may be one or two more. These crop up running the tests locally often enough to be easily reproducible but not every time, so at some point they managed to slip past CI and get merged. We should fix these; I don't want to keep having to wonder when working on something else whether I've broken something or a test is just flaky; this slows down development across the board.

@zenhack
Copy link
Contributor Author

zenhack commented Oct 12, 2022

Observations:

  • TestRecvCancel only ever hangs when run in parallel with other tests. It passes consistently if you either do go test -run TestRecvCancel or delete the t.Parallel() at the top of the function. I don't yet understand why that would happen; it doesn't look like there are any shared resources with other tests that might interfere, and there's no timeouts or anything...
  • The timeout in TestRecvAbort trigger the above failure isn't theoretically impossible, though a 1 second timeout seems generous. I'm having trouble reproducing the issue right now, but if I was able to reproduce it semi-reliably, my first thought would be to try commenting out t.Parallel() to see if that fixes anything.

It seems likely that some of these are race conditions that are sufficiently timing dependent that the noise from other tests is needed to trigger them.


Also, I managed to trigger a failure other than the above:

--- FAIL: TestBootstrapReceiverAnswerRpc (0.00s)
panic: test.capnp:CapArgsTest.call: rpc: connection closed [recovered]
	panic: test.capnp:CapArgsTest.call: rpc: connection closed

goroutine 27 [running]:
testing.tRunner.func1.2({0x667820, 0xc000586060})
	/usr/lib/go/src/testing/testing.go:1396 +0x24e
testing.tRunner.func1()
	/usr/lib/go/src/testing/testing.go:1399 +0x39f
panic({0x667820, 0xc000586060})
	/usr/lib/go/src/runtime/panic.go:884 +0x212
capnproto.org/go/capnp/v3/rpc.chkfatal(...)
	/home/isd/src/go-wk/go-capnproto2/rpc/receiveranswer_test.go:51
capnproto.org/go/capnp/v3/rpc.TestBootstrapReceiverAnswerRpc(0xc0000eed00)
	/home/isd/src/go-wk/go-capnproto2/rpc/receiveranswer_test.go:87 +0x4f3
testing.tRunner(0xc0000eed00, 0x6ca630)
	/usr/lib/go/src/testing/testing.go:1446 +0x10b
created by testing.(*T).Run
	/usr/lib/go/src/testing/testing.go:1493 +0x35f
exit status 2
FAIL	capnproto.org/go/capnp/v3/rpc	0.008s

@zenhack
Copy link
Contributor Author

zenhack commented Oct 13, 2022

TestHandleReturn_regression sometimes also fails (without hanging), with:

--- FAIL: TestHandleReturn_regression (0.00s)
    --- FAIL: TestHandleReturn_regression/BootstrapWithExpiredContext (0.00s)
        level0_test.go:1820: 
            	Error Trace:	level0_test.go:1820
            	Error:      	Target error should be in err chain:
            	            	expected: "context canceled"
            	            	in chain: "test.capnp:PingPong.echoNum: rpc: send message: rpc: connection closed"
            	            		"test.capnp:PingPong.echoNum: rpc: send message: rpc: connection closed"
            	            		"rpc: send message: rpc: connection closed"
            	            		"send message: rpc: connection closed"
            	            		"rpc: connection closed"
            	            		"connection closed"
            	Test:       	TestHandleReturn_regression/BootstrapWithExpiredContext
FAIL
exit status 1
FAIL	capnproto.org/go/capnp/v3/rpc	0.107s

When it hangs instead, it's on line 1816, which is a call to .Struct() on a result.

zenhack added a commit to zenhack/go-capnp that referenced this issue Oct 13, 2022
Previously, these tests used the same connection for both tests.
This fixes intermittent test failures on the second test, as discussed
in capnproto#318.

I think, in principle, the tests *should* pass in that case anyway,
but:

1. Keeping the cases isolated makes this easier to understand
2. This way we can (and now do) run the tests in parallel
3. While we should better test the scenarios where:
   - a connection is used after a bootstrap fails
   - bootstrap is called more than once
   - etc.
   Doing so without tests dedicated to those things will be very
   difficult to maintain; we could add some but that should be a
   separate (low priority) task.
@zenhack
Copy link
Contributor Author

zenhack commented Oct 13, 2022

@lthibault, #320 fixes both failure modes for TestHandleReturn_regression.

@zenhack
Copy link
Contributor Author

zenhack commented Oct 13, 2022

A trace for the TestBootstrapReceiverAnswerRpc is below. It looks like the client is sending a finish message for the bootstrap it's done with, and then getting confused when it finally gets the return message because it's forgotten about the question entirely. Probably there's somewhere where we're being too aggressive about removing stuff from the questions table and need to leave some record of this. I think we should only ever remove stuff from the table in two places:

  1. handleReturn
  2. when sending the initial question fails (and thus we won't receive a return)

I'll pick this back up tomorrow, and start by auditing places where we remove stuff from questions.

( "Send"
, Message
    { union' =
        Message'bootstrap
          Bootstrap { questionId = 0 , deprecatedObjectId = Nothing }
    }
)
( "Send"
, Message
    { union' =
        Message'call
          Call
            { questionId = 1
            , target =
                MessageTarget
                  { union' =
                      MessageTarget'promisedAnswer
                        PromisedAnswer { questionId = 0 , transform = [] }
                  }
            , interfaceId = 13288942172782596300
            , methodId = 0
            , params =
                Payload
                  { content = Nothing
                  , capTable =
                      [ CapDescriptor
                          { attachedFd = 255
                          , union' =
                              CapDescriptor'receiverAnswer
                                PromisedAnswer { questionId = 0 , transform = [] }
                          }
                      ]
                  }
            , sendResultsTo =
                Call'sendResultsTo' { union' = Call'sendResultsTo'caller }
            , allowThirdPartyTailCall = False
            }
    }
)
( "Send"
, Message
    { union' =
        Message'finish Finish { questionId = 0 , releaseResultCaps = True }
    }
)
( "Recv"
, Message
    { union' =
        Message'return
          Return
            { answerId = 0
            , releaseParamCaps = False
            , union' =
                Return'results
                  Payload
                    { content = Nothing
                    , capTable =
                        [ CapDescriptor
                            { attachedFd = 255 , union' = CapDescriptor'senderHosted 0 }
                        ]
                    }
            }
    }
)
( "Send"
, Message
    { union' =
        Message'abort
          Exception
            { reason = "rpc: incoming return: question 0 does not exist"
            , obsoleteIsCallersFault = False
            , obsoleteDurability = 0
            , type_ = Exception'Type'failed
            }
    }
)

@zenhack
Copy link
Contributor Author

zenhack commented Oct 13, 2022

TestSendCancel failed in CI: https://github.com/capnproto/go-capnproto2/actions/runs/3238937452/jobs/5307722709#step:5:1483

...I think I've seen that before, so we should investigate that here as well.

zenhack added a commit to zenhack/go-capnp that referenced this issue Oct 14, 2022
This was manifesting as occasional failures of
TestBootstrapReceiverAnswerRpc, as discussed in capnproto#318. When this was
introduced, I attempted to avoid double-rejecting the promise by
removing it from the table after the first rejection, but this is in
fact incorrect in the case where we cancel the question, because the
entry needs to stick around until the return message comes in.

With this patch, instead, we solve the problem by just having
releaseQuestions() check the `finished` flag before calling Reject.
@zenhack
Copy link
Contributor Author

zenhack commented Oct 14, 2022

#322 fixes TestBootstrapReceiverAnswerRpc.

zenhack added a commit to zenhack/go-capnp that referenced this issue Oct 14, 2022
...and let the method implementation check the context itself.

This solves an occasional hang (observed in capnproto#318) in TestRecvCancel,
which involves a method implementation that waits for its context to be
canceled and then closes a channel. Without this patch, if the context
is cancelled early enough for handleCall to see it, it won't run the
method at all, and so the channel will never be closed, causing the test
to hang on a receive on that channel.
@zenhack
Copy link
Contributor Author

zenhack commented Oct 14, 2022

#324 fixes TestRecvCancel. I'll tackle the rest tomorrow.

@zenhack
Copy link
Contributor Author

zenhack commented Oct 17, 2022

#327 fixes TestRecvAbort.

zenhack added a commit to zenhack/go-capnp that referenced this issue Oct 17, 2022
...to try to better shake out flaky tests & race conditions. Many of the
tests listed in capnproto#318 "usually" passed, which is how they snuck in in the
first place. This adjusts our CI to run the tests in the rpc package 500
times.
@zenhack
Copy link
Contributor Author

zenhack commented Oct 24, 2022

Closing, since all the relevant prs have been merged.

@zenhack zenhack closed this as completed Oct 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant