Fix flaky `operator debug` test #12501

tgross · 2022-04-07T18:33:14Z

We introduced a pprof-interval argument to operator debug in #11938, and unfortunately this has resulted in a lot of test flakes. The actual command in use is mostly fine (although I've fixed some quirks here), so what's really happened is that the change has revealed some existing issues in the tests. Best reviewed commit-by-commit but a summary of the changes is below. (No changelog entry because this has only shipped for 1.3-beta.1)

Make first pprof collection synchronous to preserve the existing
behavior for the common case where the pprof interval matches the
duration.
Clamp operator debug pprof timing to that of the command. The
pprof-duration should be no more than duration and the
pprof-interval should be no more than pprof-duration. Clamp the
values rather than throwing errors, which could change the commands
that existing users might already have in debugging scripts
Testing: remove test parallelism

The operator debug tests that stand up servers can't be run in
parallel, because we don't have a way of canceling the API calls for
pprof. The agent will still be running the last pprof when we exit,
and that breaks the next test that talks to that same agent.
(Because you can only run one pprof at a time on any process!)

We could split off each subtest into its own server, but this test
suite is already very slow. In future work we should fix this "for
real" by making the API call cancelable.
Testing: assert against unexpected errors in operator debug tests.

If we assert there are no unexpected error outputs, it's easier for
the developer to debug when something is going wrong with the tests
because the error output will be presented as a failing test, rather
than just a failing exit code check. Or worse, no failing exit code
check!

This also forces us to be explicit about which tests will return 0
exit codes but still emit (presumably ignorable) error outputs.

Additional minor bug fixes (mostly in tests) and test refactorings:

Fix text alignment on pprof Duration in operator debug output
Remove "done" channel from operator debug event stream test. The
goroutine we're blocking for here already tells us it's done by
sending a value, so block on that instead of an extraneous channel
Event stream test timer should start at current time, not zero
Remove noise from operator debug test log output. The t.Logf
calls already are picked out from the rest of the test output by
being prefixed with the filename.
Remove explicit pprof args so we use the defaults clamped from
duration/interval

The `t.Logf` calls already are picked out from the rest of the test output by being prefixed with the filename.

The goroutine we're blocking for here already tells us its done by sending a value, so block on that instead of an extraneous channel

* `pprof-duration` should be no more than `duration` * `pprof-interval` should be no more than `pprof-duration` * clamp the values rather than throwing errors, which could change the commands that existing users might already have in debugging scripts

If we assert there are no unexpected error outputs, it's easier for the developer to debug when something is going wrong with the tests because the error output will be presented as a failing test, rather than just a failing exit code check. Or worse, no failing exit code check! This also forces us to be explicit about which tests will return 0 exit codes but still emit (presumably ignorable) error outputs.

The `operator debug` tests that stand up servers can't be run in parallel, because we don't have a way of canceling the API calls for pprof. The agent will still be running the last pprof when we exit, and that breaks the next test that talks to that same agent. We could split off each subtest into its own server, but this test suite is already very slow. In future work we should fix this "for real" by making the API call cancelable.

shoenig

LGTM!

github-actions · 2022-10-22T02:42:47Z

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

tgross added 9 commits April 7, 2022 14:12

remove noise from operator debug test log output

3ded774

The `t.Logf` calls already are picked out from the rest of the test output by being prefixed with the filename.

event stream test timer should start at current time, not zero

8fedd1d

remove done channel from operator debug event stream test

c7abac2

The goroutine we're blocking for here already tells us its done by sending a value, so block on that instead of an extraneous channel

fix text alignment on pprof Duration in operator debug output

45ebe98

make first pprof collection synchronous

056d59c

remove pprof args so we use the defaults for duration/interval

93061f4

tgross requested review from davemay99, shoenig and DerekStrickland April 7, 2022 18:33

tgross added this to the 1.3.0 milestone Apr 7, 2022

tgross added theme/cli type/bug theme/flaky-tests labels Apr 7, 2022

shoenig approved these changes Apr 7, 2022

View reviewed changes

tgross merged commit ab6f13d into main Apr 7, 2022

tgross deleted the b-flaky-test-operator-debug branch April 7, 2022 19:00

github-actions bot locked as resolved and limited conversation to collaborators Oct 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix flaky `operator debug` test #12501

Fix flaky `operator debug` test #12501

tgross commented Apr 7, 2022 •

edited

Loading

shoenig left a comment

github-actions bot commented Oct 22, 2022

Fix flaky operator debug test #12501

Fix flaky operator debug test #12501

Conversation

tgross commented Apr 7, 2022 • edited Loading

shoenig left a comment

Choose a reason for hiding this comment

github-actions bot commented Oct 22, 2022

Fix flaky `operator debug` test #12501

Fix flaky `operator debug` test #12501

tgross commented Apr 7, 2022 •

edited

Loading