You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Problem. I've got over 500 active alert checks pulling thousands of Graphite metrics. Most of the checks run every 5 min. I suspect that's what is causing regular high CPU load spikes. Arguably the load should be spread more evenly, hence leaving more resources room for those few important minutely checks.
Proposal. I have an idea. But since I am very new to the Bosun land I want to validate it with you first. The scheduler could add a random jitter before the very first check run for each alert. Thus effectively shifting all checks forward in time for an arbitrary number of seconds and spreading the load.
E.g. Given three alerts and 5 min check period, minutely CPU load graph looks...
now: ▅▁▁▁▁▅▁▁▁▁▅▁▁▁▁
after: ▂▂▂▁▁▂▂▂▁▁▂▂▂▁▁
☺
Implementation wise it can be a one time random delay for at most DefaultRunEvery * CheckFrequency seconds before kicking off each check for the first time. Surely this jitter can be a system and/or rule configuration option (in case someone needs a rather deterministic schedule).
What do you think?
The text was updated successfully, but these errors were encountered:
Problem. I've got over 500 active alert checks pulling thousands of Graphite metrics. Most of the checks run every 5 min. I suspect that's what is causing regular high CPU load spikes. Arguably the load should be spread more evenly, hence leaving more resources room for those few important minutely checks.
Proposal. I have an idea. But since I am very new to the Bosun land I want to validate it with you first. The scheduler could add a random jitter before the very first check run for each alert. Thus effectively shifting all checks forward in time for an arbitrary number of seconds and spreading the load.
E.g. Given three alerts and 5 min check period, minutely CPU load graph looks...
now: ▅▁▁▁▁▅▁▁▁▁▅▁▁▁▁
after: ▂▂▂▁▁▂▂▂▁▁▂▂▂▁▁
☺
Implementation wise it can be a one time random delay for at most
DefaultRunEvery * CheckFrequency
seconds before kicking off each check for the first time. Surely this jitter can be a system and/or rule configuration option (in case someone needs a rather deterministic schedule).What do you think?
The text was updated successfully, but these errors were encountered: