Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky asynchronous test fails with wrong spec state #465

Closed
williammartin opened this issue May 1, 2018 · 4 comments
Closed

Flaky asynchronous test fails with wrong spec state #465

williammartin opened this issue May 1, 2018 · 4 comments
Assignees
Labels

Comments

@williammartin
Copy link
Sponsor Collaborator

Occasionally when I run the ginkgo tests I see the following failure:

• Failure [0.030 seconds]
Shared RunnableNode behavior JustBeforeEach Nodes asynchronous functions when the function times out [It] should return the timeout 
/tmp/build/a94a8fe5/gopath/src/github.com/onsi/ginkgo/internal/leafnodes/shared_runner_test.go:216

  Expected
      <types.SpecState>: 5
  to equal
      <types.SpecState>: 6

  /tmp/build/a94a8fe5/gopath/src/github.com/onsi/ginkgo/internal/leafnodes/shared_runner_test.go:220

  Full Stack Trace
  	/tmp/build/a94a8fe5/gopath/src/github.com/onsi/gomega/internal/assertion/assertion.go:69 +0x25b
  github.com/onsi/gomega/internal/assertion.(*Assertion).Should(0xc42041e5c0, 0xb0dc40, 0xc4203504e0, 0x0, 0x0, 0x0, 0xc42041e5c0)
  	/tmp/build/a94a8fe5/gopath/src/github.com/onsi/gomega/internal/assertion/assertion.go:27 +0xd0
  github.com/onsi/ginkgo/internal/leafnodes_test.AsynchronousSharedRunnerBehaviors.func2.5.2()
  	/usr/local/go/src/testing/testing.go:777 +0x16e
  created by testing.(*T).Run
  	/usr/local/go/src/testing/testing.go:824 +0x565
@williammartin
Copy link
Sponsor Collaborator Author

williammartin commented May 1, 2018

This test is failing on the following line: https://github.com/onsi/ginkgo/blob/master/internal/leafnodes/shared_runner_test.go#L220

If we look at the spec states (https://github.com/onsi/ginkgo/blob/master/types/types.go#L144-L145) we are expecting a state of timeout but instead get one of panicked.

Looking back at the test we can see there is a panic: https://github.com/onsi/ginkgo/blob/master/internal/leafnodes/shared_runner_test.go#L212

If I was a betting man I'd say that it's likely on a constrained test environment, regardless of whether we timeout or not, there's probably a race in the run function where we might panic and trigger the defer block (https://github.com/onsi/ginkgo/blob/master/internal/leafnodes/runner.go#L76), setting the spec state to panicked.

This can be reproduced with two extra lines:

time.Sleep(time.Second) as the last statement in the runAsync goroutine
time.Sleep(time.Second * 2) on the line before `r.failer.Drain`

Effectively what this means is that we would set the spec status to timedout, and then just before we drain that status, the panic fires, triggering the defer block and setting the status to panicked.

@williammartin williammartin added this to the Tests are not flaky milestone May 1, 2018
@williammartin williammartin self-assigned this May 1, 2018
@nodo
Copy link
Collaborator

nodo commented May 16, 2018

@williammartin I am not sure I have understood your point. Where would you put the sleep exactly?

A possible the root cause (maybe different from the one above) is the too narrow difference in time between the timeout (set to 10ms) and the sleep in the async func (set to 20ms)

Suppose that there is a sleep long enough before the select statement here (e.g. 50ms) . In that case, the timeout does not occur (rightfully), because the whole runAsync function is taking too long and the test panics. If this is the problem we could just add a bit of delay here.

However, is it possible that this sleep would happen? 🤔

@nodo
Copy link
Collaborator

nodo commented May 22, 2018

@williammartin any thoughts on the comment above?

@williammartin
Copy link
Sponsor Collaborator Author

Yes but I thought we could discuss when you're back. Not the easiest to describe racy behaviour over GitHub.

@nodo nodo closed this as completed in 294d8f4 May 29, 2018
blgm pushed a commit that referenced this issue Jul 10, 2018
The tests was occasionally failing due to
a race between the goroutine waiting for a timeout and the goroutine
setting the test state.

This commit changes the test so that it does not close the channel in
time, so that the chance of flakiness is significantely reduced.

Fixes #465

Co-authored-by: William Martin <wmartin@pivotal.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants