-
Notifications
You must be signed in to change notification settings - Fork 491
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Distributed alert checks to prevent high load spikes #2249
Distributed alert checks to prevent high load spikes #2249
Conversation
@captncraig any idea what this might do in terms of dependencies and unknown grouping? I think there may be instances where we assume things are running the same time slice. |
cmd/bosun/sched/alertRunner.go
Outdated
// Every alert gets a small shift in time. | ||
// This way the alerts with the same period are not fired | ||
// simultaneously, but are distributed. | ||
circular_shifts := make(map[int]int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps just a note here about what the keys and values represent in this map. I believe it is "run every value" to the next offset. Probably just comment that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The map is indeed run period
→ time shift to add
.
I will add a comment.
Thanks for the feedback!
This seems like a really nice solution. I think we need a configuration value to enable (or maybe disable it). Sometimes when developing, it is extremely useful for checks to run immediately when the app starts up. But I'm not sure if that use case is significant enough to make your behaviour an option you need to enable, or if it should be the default. I'll let @kylebrandt make the call on that. |
@kylebrandt has a great point I didn't consider about dependencies. Let me dig a bit on that. |
Unknowns will work fine. That is just based on the last event timestamp. As long as it runs consistently every N times, it will be fine. |
And it looks like the uneveluated / dependency things are run straight out of the data store from the last known run. There's a bit of a race condition involved there, so I wouldn't expect this to "break" that. Staggering could either increase or decrease the response time for dependency status flowing to other alerts (depending on which side runs faster), but I don't think this change will break things. |
Would be great to have this functionality merged! Any way to help to make it happen? (-: |
I'm for merging this once there is a configuration option (via the toml file) to enable and disable (default state disabled). |
Ah, fair enough. What would you like it to be named? |
@kylebrandt Clear. I completely agree. |
Thinking about this a little more I have a couple of thoughts / questions (stepping back from naming, the suggest name sound fine):
|
@kylebrandt Re point 2. What exactly do you mean by jitter? Random spreading? Or do you mean, have even more flattened spread: not only over integer cycles given by |
@kylebrandt Re point 1. We can make it that for the first time everything runs right away, and the shifts are applied to all subsequent runs. This will increase the complexity of code of course. Would that be better? |
I will add the config option. Thanks for the feedback. |
@grzkv I haven't fully thought it out yet. But for example if we have a Check frequency of 1 minute, checks are spread-out throughout the minute. If the Runevery is 5 minutes, the checks will be spread somewhere within a minute of that 5 minute cycle... Here is how I think of it in the big picture: So the ideal way to spread out the checks depends generally on the duration of your queries a (little's law - see https://serverfault.com/a/345446/2561 for picture). If you checks are every minute, but your queries respond in say an average of 5 seconds (when not being a bunch all at once) then it would be better to spread them out through the minute, instead of hitting the system with less checks, but still in all the same time resulting in higher than ideal concurrency. So we want spread the arrival rate out to reduce concurrency, and doing that over smaller amounts of time might be a better way to do that depending on average duration of queries in a non-overloaded system. |
However, if this code is working well I don't want to block getting it in. We should just make the option a validated string, so if we want to try other things later the config won't change from a bool to a different type. |
@kylebrandt You have inspired me to start reading about scheduling strategies 😄 I 100% agree with your reasoning. I thought something along the same lines. Let's proceed with the current solution for now. I have some ideas on how we can improve further. I will share them after we merge this. |
For practical purposes, running queries sequentially without any kind of time distribution is almost always enough to deal with concurrency concerns. Even assuming the working set for TSDB fits in memory, you're still not using up more than a single core — I didn't hear of any TSDB bothering to parallelize individual queries. |
We also rely on fairly regular check times for cache usage on tsdb queries.
We could potentially scatter the checks and still switch out the cache at
fixed regular intervals, but I think this approach accomplishes the goal
rarther well.
…On Tue, May 15, 2018, 6:50 AM Kirill Zakharenko ***@***.***> wrote:
For practical purposes, running queries sequentially without any kind of
time distribution is almost always enough to deal with concurrency concerns.
Even assuming the working set for TSDB fits in memory, you're still not
using up more than a single core — I didn't hear of any TSDB bothering to
parallelize individual queries.
Yes, it does not look pretty on the graphs, they are spikey, but why
should anyone care?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2249 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AALwHCYQVas3k0E0866L1HHIuauBq9rFks5tyrMNgaJpZM4T4UrF>
.
|
IIRC within time slices we still set the time of the query to align with the runinterval time regardless of when it actually runs. have we checked that with this method when we run something a whole runinterval offset that the query time is not to the previous runinterval? I haven't looked at that code in a while... |
…check scattering.
@kylebrandt @captncraig Added the config option. Please, have a look. |
@kylebrandt Could you please elaborate on what additional checks need to be made? |
cmd/bosun/conf/system.go
Outdated
UnknownThreshold int | ||
CheckFrequency Duration // Time between alert checks: 5m | ||
DefaultRunEvery int // Default number of check intervals to run each alert: 1 | ||
AlertCheckDistribution bool // Should the alert rule checks be scattered across their running period? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be a string so we could implement other types of alert check distribution. Then add something to loadSystemConfig in cmd/bosun/conf/system.go to make sure it is a valid option. Valid string would be "" for default, and whatever you name this one.
Changed config option to string type. |
So now you enable this by setting |
I think just an update to the system config docs and it should be ready to merge. |
@kylebrandt Added the docs. |
Hi @kylebrandt! This PR is just a tiny bit away from getting merged. What can we do to help making it happen? (-: |
@kylebrandt @captncraig I guess, something is still missing since the PR is not merged. Please, tell us what, and I will add it. |
This is a solution for #2065
The idea behind this is simple. Every check run is slightly shifted so that the checks are distributed uniformly.
For the subset of checks that run with the period
T
, a shift is added to every check. The shift ranges from0
toT-1
. The shifts are incremental. For example, if we have 6 checks every 5 mins (T=5
). The shifts will be0
,1
,2
,3
,4
,0
. This way, without the patch 6 checks will happen at times0
, and5
; with the patch, two checks will happen at the time0
, one at1
, one at2
, and so on. The total number of checks and check period stay the same.Here is the test that shows the effect of the patch on system load. Note, that the majority of checks in this system have 5 mins period.