From d0bf0140541eafb54341775d1f4e267976804e83 Mon Sep 17 00:00:00 2001 From: Nikolay Martyanov Date: Thu, 7 Nov 2024 14:58:49 +0100 Subject: [PATCH] pillar/docs: Add doc about goroutine leak detector. Document the goroutine leak detection approach, including methods to monitor and identify abnormal increases in goroutines to support proactive system maintenance. Signed-off-by: Nikolay Martyanov --- pkg/pillar/docs/watcher.md | 38 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 38 insertions(+) diff --git a/pkg/pillar/docs/watcher.md b/pkg/pillar/docs/watcher.md index 6f2e0f8941..2401b58c29 100644 --- a/pkg/pillar/docs/watcher.md +++ b/pkg/pillar/docs/watcher.md @@ -41,3 +41,41 @@ By adaptively triggering garbage collection based on actual memory pressure and allocation patterns, we ensure efficient memory usage and maintain system performance. This approach helps prevent potential memory-related issues by proactively managing resources. + +### Goroutine Leak Detector + +We have implemented a system to detect potential goroutine leaks by monitoring +the number of active goroutines over time. This proactive approach helps us +identify unusual increases that may indicate a leak. + +To achieve this, we collect data on the number of goroutines at regular +intervals within the `goroutinesMonitor` function. However, raw data can be +noisy due to normal fluctuations in goroutine usage. To mitigate this, we apply +a moving average to the collected data using the `movingAverage` function. This +smoothing process reduces short-term variations and highlights longer-term +trends, making it easier to detect significant changes in the goroutine count. + +After smoothing the data, we calculate the rate of change by determining the +difference between consecutive smoothed values. This rate of change reflects how +quickly the number of goroutines is increasing or decreasing over time. To +analyze this effectively, we compute the mean and standard deviation of the rate +of change using the `calculateMeanStdDev` function. These statistical measures +provide insights into the typical behavior and variability within our system. + +Using the standard deviation, we set a dynamic threshold that adapts to the +system's normal operating conditions within the `detectGoroutineLeaks` function. +If both the mean rate of change and the latest observed rate exceed this +threshold, it indicates an abnormal increase in goroutine count, signaling a +potential leak. This method reduces false positives by accounting for natural +fluctuations and focusing on significant deviations from expected patterns. + +When a potential leak is detected, we respond by dumping the stack traces of all +goroutines using the `handlePotentialGoroutineLeak` function. This action +provides detailed information that can help diagnose the source of the leak, as +it reveals where goroutines are being created and potentially not terminated +properly. + +To prevent repeated handling of the same issue within a short time frame, we +incorporate a cooldown period in the `goroutinesMonitor` function. This ensures +that resources are not wasted on redundant operations and that the monitoring +system remains efficient.