Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Goroutine leak detection with unit tests #4416

Merged

Conversation

OhmSpectator
Copy link
Member

@OhmSpectator OhmSpectator commented Nov 1, 2024

This pull request introduces a goroutine leak detection system in the watcher component, along with a comprehensive test suite to validate the accuracy and reliability of leak identification. Additionally, a minor update ensures stack dumps work seamlessly within watcher by verifying that the required directory is created as needed.

Key Changes:

  • Goroutine Leak Detection in watcher
    The new detection system monitors goroutine counts to identify potential leaks by combining a threshold check for immediate spikes with trend-based analysis. This approach smooths data over time and uses dynamically adjusted thresholds to flag gradual increases, potentially indicating leaks.

  • Testing for Leak Scenarios
    A suite of tests simulates various scenarios including growth patterns, spikes, gradual leaks, and edge cases to ensure the detector accurately distinguishes between normal variations and true leaks.

TODO:

  • Test in a known leak scenario (in progress)
  • Make the detector parameters configurable
  • Add the documentation

@OhmSpectator OhmSpectator force-pushed the feature/goroutine-count-to-watcher branch from f9cd1d6 to e6478ad Compare November 5, 2024 20:59
@github-actions github-actions bot requested a review from eriknordmark November 5, 2024 21:00
@OhmSpectator OhmSpectator force-pushed the feature/goroutine-count-to-watcher branch 4 times, most recently from f494457 to cc28001 Compare November 6, 2024 13:41
@OhmSpectator OhmSpectator changed the title [WIP] pillar/watcher: Add goroutine leak detection. Add Goroutine leak detection with unit tests Nov 6, 2024
@OhmSpectator OhmSpectator marked this pull request as ready for review November 6, 2024 13:50
Copy link
Contributor

@eriknordmark eriknordmark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@deitch deitch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is impressive. I liked the advanced thought in this. A few questions:

  1. Is it possible to change the constants for timing and thresholds to variables, with the defaults being constants, and they can be overridden via eve config? There may be legitimate scenarios where what the constants consider a problem is perfectly normal operation. We don't want to have to upgrade EVE on devices to change them when a config would do.
  2. Can we document this?

@OhmSpectator
Copy link
Member Author

This is impressive. I liked the advanced thought in this. A few questions:

  1. Is it possible to change the constants for timing and thresholds to variables, with the defaults being constants, and they can be overridden via eve config? There may be legitimate scenarios where what the constants consider a problem is perfectly normal operation. We don't want to have to upgrade EVE on devices to change them when a config would do.
  2. Can we document this?

Yep, I want to do both. I should have done it before converting from Draft =)

@OhmSpectator OhmSpectator marked this pull request as draft November 7, 2024 11:49
@OhmSpectator OhmSpectator force-pushed the feature/goroutine-count-to-watcher branch from cc28001 to d0bf014 Compare November 7, 2024 14:01
@github-actions github-actions bot requested a review from eriknordmark November 7, 2024 14:02
@OhmSpectator OhmSpectator force-pushed the feature/goroutine-count-to-watcher branch 4 times, most recently from 065127d to 04f15cd Compare November 7, 2024 19:07
@OhmSpectator
Copy link
Member Author

This is impressive. I liked the advanced thought in this. A few questions:

  1. Is it possible to change the constants for timing and thresholds to variables, with the defaults being constants, and they can be overridden via eve config? There may be legitimate scenarios where what the constants consider a problem is perfectly normal operation. We don't want to have to upgrade EVE on devices to change them when a config would do.
  2. Can we document this?

Done!

@OhmSpectator OhmSpectator marked this pull request as ready for review November 7, 2024 19:08
@OhmSpectator OhmSpectator force-pushed the feature/goroutine-count-to-watcher branch from 04f15cd to 259c801 Compare November 7, 2024 19:18
@OhmSpectator OhmSpectator force-pushed the feature/goroutine-count-to-watcher branch from e7b6985 to f6ccdc1 Compare November 8, 2024 10:14
@github-actions github-actions bot requested a review from eriknordmark November 8, 2024 10:14
…in DumpAllStacks.

Modify DumpAllStacks to create the debug directory if it doesn't already
exist, ensuring that stack dumps can always be written to the specified
location.

Signed-off-by: Nikolay Martyanov <nikolay@zededa.com>
The doc gives a short overview of the compoent and explains the manual
garbage collection functionality.

Signed-off-by: Nikolay Martyanov <nikolay@zededa.com>
Introduce a detection system to identify potential goroutine leaks by
combining a simple threshold check with trend-based monitoring. This
approach periodically tracks the number of active goroutines, checking
against a predefined maximum to catch immediate issues. Additionally, it
applies a smoothing function to analyze longer-term trends within a
sliding window.

The detection logic includes:

* A basic threshold check: If the goroutine count exceeds a set maximum,
  an alert is triggered.

* A moving average to smooth fluctuations in goroutine counts
  over time.

* Statistical analysis of the rate of change in goroutine counts,
  setting a dynamic threshold that adapts to typical variations. If the
  rate of increase consistently surpasses this threshold, it indicates a
  potential leak.

Signed-off-by: Nikolay Martyanov <nikolay@zededa.com>
@OhmSpectator OhmSpectator force-pushed the feature/goroutine-count-to-watcher branch from f6ccdc1 to a763555 Compare November 8, 2024 11:28
@OhmSpectator OhmSpectator force-pushed the feature/goroutine-count-to-watcher branch 3 times, most recently from 9a7e141 to 9bf7913 Compare November 8, 2024 13:48
@OhmSpectator
Copy link
Member Author

Fixed races for the logger in the tests

Introduce a suite of tests to verify the functionality of the goroutine
leak detection system, simulating various scenarios to ensure robust
leak identification.

Growth scenarios: Test scenarios for initial goroutine growth with
stabilization, sudden spikes, and decreases after spikes to validate
that these typical patterns do not trigger false positives.

Leak scenario: Include tests where the goroutine count gradually
increases, simulating a leak. The test verifies that the detector
correctly identifies this pattern as a leak.

Monitoring tests: Implement tests for the `goroutinesMonitor` function
to check detection behavior at regular intervals.

Edge cases for the helper function: Add tests to handle empty data
inputs, minimal data length, and cases where the moving average window
exceeds data length, ensuring resilience against edge cases.

Signed-off-by: Nikolay Martyanov <nikolay@zededa.com>
Document the goroutine leak detection approach, including methods to
monitor and identify abnormal increases in goroutines to support
proactive system maintenance.

Signed-off-by: Nikolay Martyanov <nikolay@zededa.com>
Added new configuration options for the goroutine leak detector,
enabling adjustment of parameters such as threshold, check interval,
analysis window, stats retention, and cooldown period through global
settings. This enhancement allows for dynamic tuning of leak detection
based on deployment needs without code modification.

Introduced `GoroutineLeakDetectionParams` struct to manage these
settings, along with validation to ensure parameters meet minimum
requirements. Updated the `watcher` component to use this configurable
setup, and integrated new global config keys to hold these values.

Signed-off-by: Nikolay Martyanov <nikolay@zededa.com>
…ine leak detection.

Extended tests to validate that the goroutine leak detection respects
updates to configuration values at runtime. Specifically, the tests
confirm that adjustments to the `keepStatsFor` parameter dynamically
alter the stats slice size, with changes logged as expected.

Signed-off-by: Nikolay Martyanov <nikolay@zededa.com>
Enable the `goroutinesMonitor` to be stopped via a cancellable context
within `GoroutineLeakDetectionParams`, allowing controlled termination
of the monitoring goroutine. This change introduces a `MakeStoppable`
method to set up a cancellable context and a `Stop` method to trigger
the cancellation, allowing tests to end monitoring cleanly.

Additionally, `checkStopCondition` was added to periodically check if
the goroutine should stop.

Updated tests to utilize this functionality, adding verification for
proper start and stop messages in the log output, ensuring that the
monitor operates correctly in both stoppable and unstoppable modes.

Signed-off-by: Nikolay Martyanov <nikolay@zededa.com>
@OhmSpectator OhmSpectator force-pushed the feature/goroutine-count-to-watcher branch from 9bf7913 to 9ee63bb Compare November 8, 2024 14:00
@OhmSpectator
Copy link
Member Author

Suppressed output in the tests, where unnecessary

Copy link
Contributor

@eriknordmark eriknordmark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does anything use the Stop() function?
Is the intent that setting some ConfigItems to a value of zero will cause a stop and setting that value back to non-zero will cause a Start()?

Copy link
Contributor

@eriknordmark eriknordmark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-run tests

@OhmSpectator
Copy link
Member Author

Does anything use the Stop() function? Is the intent that setting some ConfigItems to a value of zero will cause a stop and setting that value back to non-zero will cause a Start()?

It's used in the tests only, not to run the monitor several times in one package. See #4416 (comment)

Add a step to copy the goroutine leak detector stack dump (sigusr1) from
/persist/agentdebug/watcher/ to the collected output directory as
`goroutin-leak-detector-stacks-dump`, if present.

Signed-off-by: Nikolay Martyanov <nikolay@zededa.com>
@OhmSpectator
Copy link
Member Author

Added the stacks dump file into collect-info if exists.

@eriknordmark eriknordmark merged commit e2a67f5 into lf-edge:master Nov 9, 2024
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants