Skip to content

Commit

Permalink
Fix flaky operator debug test (#12501)
Browse files Browse the repository at this point in the history
We introduced a `pprof-interval` argument to `operator debug` in #11938, and unfortunately this has resulted in a lot of test flakes. The actual command in use is mostly fine (although I've fixed some quirks here), so what's really happened is that the change has revealed some existing issues in the tests. Summary of changes:

* Make first pprof collection synchronous to preserve the existing
  behavior for the common case where the pprof interval matches the
  duration.

* Clamp `operator debug` pprof timing to that of the command. The
  `pprof-duration` should be no more than `duration` and the
  `pprof-interval` should be no more than `pprof-duration`. Clamp the
  values rather than throwing errors, which could change the commands
  that existing users might already have in debugging scripts

* Testing: remove test parallelism

  The `operator debug` tests that stand up servers can't be run in
  parallel, because we don't have a way of canceling the API calls for
  pprof. The agent will still be running the last pprof when we exit,
  and that breaks the next test that talks to that same agent.
  (Because you can only run one pprof at a time on any process!)

  We could split off each subtest into its own server, but this test
  suite is already very slow. In future work we should fix this "for
  real" by making the API call cancelable.


* Testing: assert against unexpected errors in `operator debug` tests.

  If we assert there are no unexpected error outputs, it's easier for
  the developer to debug when something is going wrong with the tests
  because the error output will be presented as a failing test, rather
  than just a failing exit code check. Or worse, no failing exit code
  check!

  This also forces us to be explicit about which tests will return 0
  exit codes but still emit (presumably ignorable) error outputs.

Additional minor bug fixes (mostly in tests) and test refactorings:

* Fix text alignment on pprof Duration in `operator debug` output

* Remove "done" channel from `operator debug` event stream test. The
  goroutine we're blocking for here already tells us it's done by
  sending a value, so block on that instead of an extraneous channel

* Event stream test timer should start at current time, not zero

* Remove noise from `operator debug` test log output. The `t.Logf`
  calls already are picked out from the rest of the test output by
  being prefixed with the filename.

* Remove explicit pprof args so we use the defaults clamped from
  duration/interval
  • Loading branch information
tgross committed Apr 7, 2022
1 parent 2c6e84c commit ab6f13d
Show file tree
Hide file tree
Showing 4 changed files with 139 additions and 136 deletions.
80 changes: 42 additions & 38 deletions command/operator_debug.go
Original file line number Diff line number Diff line change
Expand Up @@ -183,11 +183,12 @@ Debug Options:
Filter client nodes based on node class.
-pprof-duration=<duration>
Duration for pprof collection. Defaults to 1s.
Duration for pprof collection. Defaults to 1s or -duration, whichever is less.
-pprof-interval=<pprof-interval>
The interval between pprof collections. Set interval equal to
duration to capture a single snapshot. Defaults to 250ms.
duration to capture a single snapshot. Defaults to 250ms or
-pprof-duration, whichever is less.
-server-id=<server1>,<server2>
Comma separated list of Nomad server names to monitor for logs, API
Expand Down Expand Up @@ -406,27 +407,27 @@ func (c *OperatorDebugCommand) Run(args []string) int {
return 1
}

// Parse the pprof capture interval
pi, err := time.ParseDuration(pprofInterval)
if err != nil {
c.Ui.Error(fmt.Sprintf("Error parsing pprof-interval: %s: %s", pprofInterval, err.Error()))
return 1
}
c.pprofInterval = pi

// Parse the pprof capture duration
// Parse and clamp the pprof capture duration
pd, err := time.ParseDuration(pprofDuration)
if err != nil {
c.Ui.Error(fmt.Sprintf("Error parsing pprof duration: %s: %s", pprofDuration, err.Error()))
return 1
}
if pd.Seconds() > d.Seconds() {
pd = d
}
c.pprofDuration = pd

// Validate pprof interval
if pi.Seconds() > pd.Seconds() {
c.Ui.Error(fmt.Sprintf("pprof-interval %s must be less than pprof-duration %s", pprofInterval, pprofDuration))
// Parse and clamp the pprof capture interval
pi, err := time.ParseDuration(pprofInterval)
if err != nil {
c.Ui.Error(fmt.Sprintf("Error parsing pprof-interval: %s: %s", pprofInterval, err.Error()))
return 1
}
if pi.Seconds() > pd.Seconds() {
pi = pd
}
c.pprofInterval = pi

// Parse event stream topic filter
t, err := topicsFromString(eventTopic)
Expand Down Expand Up @@ -615,7 +616,7 @@ func (c *OperatorDebugCommand) Run(args []string) int {
}
c.Ui.Output(fmt.Sprintf(" Interval: %s", interval))
c.Ui.Output(fmt.Sprintf(" Duration: %s", duration))
c.Ui.Output(fmt.Sprintf(" pprof Interval: %s", pprofInterval))
c.Ui.Output(fmt.Sprintf(" pprof Interval: %s", pprofInterval))
if c.pprofDuration.Seconds() != 1 {
c.Ui.Output(fmt.Sprintf(" pprof Duration: %s", c.pprofDuration))
}
Expand Down Expand Up @@ -684,7 +685,7 @@ func (c *OperatorDebugCommand) collect(client *api.Client) error {
c.collectVault(clusterDir, vaultAddr)

c.collectAgentHosts(client)
go c.collectPeriodicPprofs(client)
c.collectPeriodicPprofs(client)

c.collectPeriodic(client)

Expand Down Expand Up @@ -898,31 +899,34 @@ func (c *OperatorDebugCommand) collectAgentHost(path, id string, client *api.Cli
}

func (c *OperatorDebugCommand) collectPeriodicPprofs(client *api.Client) {
duration := time.After(c.pprofDuration)
// Create a ticker to execute on every interval ticks
ticker := time.NewTicker(c.pprofInterval)

var pprofIntervalCount int
var name string

// Additionally, an out of loop execute to imitate first tick
c.collectPprofs(client, pprofIntervalCount)

for {
select {
case <-duration:
return

case <-ticker.C:
name = fmt.Sprintf("%04d", pprofIntervalCount)
c.Ui.Output(fmt.Sprintf(" Capture pprofInterval %s", name))
c.collectPprofs(client, pprofIntervalCount)
pprofIntervalCount++
// Take the first set of pprofs synchronously...
c.Ui.Output(" Capture pprofInterval 0000")
c.collectPprofs(client, 0)
if c.pprofInterval == c.pprofDuration {
return
}

case <-c.ctx.Done():
return
// ... and then move the rest off into a goroutine
go func() {
ctx, cancel := context.WithTimeout(c.ctx, c.duration)
defer cancel()
timer, stop := helper.NewSafeTimer(c.pprofInterval)
defer stop()

pprofIntervalCount := 1
for {
select {
case <-ctx.Done():
return
case <-timer.C:
c.Ui.Output(fmt.Sprintf(" Capture pprofInterval %04d", pprofIntervalCount))
c.collectPprofs(client, pprofIntervalCount)
timer.Reset(c.pprofInterval)
pprofIntervalCount++
}
}
}
}()
}

// collectPprofs captures the /agent/pprof for each listed node
Expand Down
Loading

0 comments on commit ab6f13d

Please sign in to comment.