Skip to content

Commit

Permalink
pillar/docs: Add doc about goroutine leak detector.
Browse files Browse the repository at this point in the history
Document the goroutine leak detection approach, including methods to
monitor and identify abnormal increases in goroutines to support
proactive system maintenance.

Signed-off-by: Nikolay Martyanov <nikolay@zededa.com>
  • Loading branch information
OhmSpectator committed Nov 7, 2024
1 parent a616d52 commit a62ab72
Showing 1 changed file with 44 additions and 0 deletions.
44 changes: 44 additions & 0 deletions pkg/pillar/docs/watcher.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,3 +41,47 @@ By adaptively triggering garbage collection based on actual memory pressure and
allocation patterns, we ensure efficient memory usage and maintain system
performance. This approach helps prevent potential memory-related issues by
proactively managing resources.

## Goroutine Leak Detector

We have implemented a system to detect potential goroutine leaks by monitoring
the number of active goroutines over time. This proactive approach helps us
identify unusual increases that may indicate a leak.

To achieve this, we collect data on the number of goroutines at regular
intervals within the `goroutinesMonitor` function. However, raw data can be
noisy due to normal fluctuations in goroutine usage. To mitigate this, we apply
a moving average to the collected data using the `movingAverage` function. This
smoothing process reduces short-term variations and highlights longer-term
trends, making it easier to detect significant changes in the goroutine count.

After smoothing the data, we calculate the rate of change by determining the
difference between consecutive smoothed values. This rate of change reflects how
quickly the number of goroutines is increasing or decreasing over time. To
analyze this effectively, we compute the mean and standard deviation of the rate
of change using the `calculateMeanStdDev` function. These statistical measures
provide insights into the typical behavior and variability within our system.

Using the standard deviation, we set a dynamic threshold that adapts to the
system's normal operating conditions within the `detectGoroutineLeaks` function.
If both the mean rate of change and the latest observed rate exceed this
threshold, it indicates an abnormal increase in goroutine count, signaling a
potential leak. This method reduces false positives by accounting for natural
fluctuations and focusing on significant deviations from expected patterns.

When a potential leak is detected, we respond by dumping the stack traces of all
goroutines using the `handlePotentialGoroutineLeak` function. This action
provides detailed information that can help diagnose the source of the leak, as
it reveals where goroutines are being created and potentially not terminated
properly.

The goroutines stacks are collected and stored in a file for further analysis.
The file is stored in `/persist/agentdebug/watcher/sigusr1`. Also, a warning
message is logged to alert the user about the potential goroutine leak. To
search for relevant log messages, grep for `Potential goroutine leak` or
`Number of goroutines exceeds threshold`.

To prevent repeated handling of the same issue within a short time frame, we
incorporate a cooldown period in the `goroutinesMonitor` function. This ensures
that resources are not wasted on redundant operations and that the monitoring
system remains efficient.

0 comments on commit a62ab72

Please sign in to comment.