Distributed alert checks to prevent high load spikes #2249

grzkv · 2018-05-09T13:31:35Z

This is a solution for #2065

The idea behind this is simple. Every check run is slightly shifted so that the checks are distributed uniformly.

For the subset of checks that run with the period T, a shift is added to every check. The shift ranges from 0 to T-1. The shifts are incremental. For example, if we have 6 checks every 5 mins (T=5). The shifts will be 0, 1, 2, 3, 4, 0. This way, without the patch 6 checks will happen at times 0, and 5; with the patch, two checks will happen at the time 0, one at 1, one at 2, and so on. The total number of checks and check period stay the same.

Here is the test that shows the effect of the patch on system load. Note, that the majority of checks in this system have 5 mins period.

kylebrandt · 2018-05-09T13:45:54Z

@captncraig any idea what this might do in terms of dependencies and unknown grouping? I think there may be instances where we assume things are running the same time slice.

captncraig · 2018-05-09T14:25:44Z

cmd/bosun/sched/alertRunner.go

+	// Every alert gets a small shift in time.
+	// This way the alerts with the same period are not fired
+	// simultaneously, but are distributed.
+	circular_shifts := make(map[int]int)


Perhaps just a note here about what the keys and values represent in this map. I believe it is "run every value" to the next offset. Probably just comment that.

The map is indeed run period → time shift to add.
I will add a comment.
Thanks for the feedback!

captncraig · 2018-05-09T14:27:06Z

This seems like a really nice solution. I think we need a configuration value to enable (or maybe disable it). Sometimes when developing, it is extremely useful for checks to run immediately when the app starts up.

But I'm not sure if that use case is significant enough to make your behaviour an option you need to enable, or if it should be the default. I'll let @kylebrandt make the call on that.

captncraig · 2018-05-09T14:28:36Z

@kylebrandt has a great point I didn't consider about dependencies. Let me dig a bit on that.

captncraig · 2018-05-09T14:48:33Z

Unknowns will work fine. That is just based on the last event timestamp. As long as it runs consistently every N times, it will be fine.

captncraig · 2018-05-09T14:53:52Z

And it looks like the uneveluated / dependency things are run straight out of the data store from the last known run. There's a bit of a race condition involved there, so I wouldn't expect this to "break" that. Staggering could either increase or decrease the response time for dependency status flowing to other alerts (depending on which side runs faster), but I don't think this change will break things.

mvuets · 2018-05-14T12:55:24Z

Would be great to have this functionality merged! Any way to help to make it happen? (-:

kylebrandt · 2018-05-14T14:05:58Z

I'm for merging this once there is a configuration option (via the toml file) to enable and disable (default state disabled).

mvuets · 2018-05-14T14:16:12Z

Ah, fair enough. What would you like it to be named?

grzkv · 2018-05-14T14:42:37Z

@kylebrandt Clear. I completely agree.
We could go with something like AlertCheckDistribution. How about that?

kylebrandt · 2018-05-14T14:47:16Z

Thinking about this a little more I have a couple of thoughts / questions (stepping back from naming, the suggest name sound fine):

Everything running at start (more or less) as craig pointed out is pretty useful for when editing rules. After saving the config your change should impact the alert relatively quickly.
I'm wondering if we should consider spreading the checks my moving them around with some jitter as opposed to whole cycles?

grzkv · 2018-05-14T14:56:51Z

@kylebrandt Re point 2. What exactly do you mean by jitter? Random spreading?

Or do you mean, have even more flattened spread: not only over integer cycles given by GetCheckFrequency()?

grzkv · 2018-05-14T15:00:08Z

@kylebrandt Re point 1. We can make it that for the first time everything runs right away, and the shifts are applied to all subsequent runs. This will increase the complexity of code of course. Would that be better?

grzkv · 2018-05-14T15:01:54Z

I will add the config option. Thanks for the feedback.

kylebrandt · 2018-05-14T16:59:40Z

@grzkv I haven't fully thought it out yet. But for example if we have a Check frequency of 1 minute, checks are spread-out throughout the minute. If the Runevery is 5 minutes, the checks will be spread somewhere within a minute of that 5 minute cycle...

Here is how I think of it in the big picture:
The objective is to reduce concurrent sessions within the system so it doesn't overload Bosun and/or (more likely) the Time series database(s) used by Bosun. This is because we don't want so many queries at once that queries end up taking longer because of the amount of concurrency, and we also don't want to be impacting other users of the TSDB (i.e. Grafana or other dashboard systems, or whatever).

So the ideal way to spread out the checks depends generally on the duration of your queries a (little's law - see https://serverfault.com/a/345446/2561 for picture). If you checks are every minute, but your queries respond in say an average of 5 seconds (when not being a bunch all at once) then it would be better to spread them out through the minute, instead of hitting the system with less checks, but still in all the same time resulting in higher than ideal concurrency. So we want spread the arrival rate out to reduce concurrency, and doing that over smaller amounts of time might be a better way to do that depending on average duration of queries in a non-overloaded system.

kylebrandt · 2018-05-14T21:15:47Z

However, if this code is working well I don't want to block getting it in. We should just make the option a validated string, so if we want to try other things later the config won't change from a bool to a different type.

grzkv · 2018-05-15T09:43:17Z

@kylebrandt You have inspired me to start reading about scheduling strategies 😄

I 100% agree with your reasoning. I thought something along the same lines. Let's proceed with the current solution for now.

I have some ideas on how we can improve further. I will share them after we merge this.

earwin · 2018-05-15T10:50:52Z

For practical purposes, running queries sequentially without any kind of time distribution is almost always enough to deal with concurrency concerns.

Even assuming the working set for TSDB fits in memory, you're still not using up more than a single core — I didn't hear of any TSDB bothering to parallelize individual queries.
Yes, it does not look pretty on the graphs, they are spikey, but why should anyone care?

captncraig · 2018-05-15T11:12:44Z

We also rely on fairly regular check times for cache usage on tsdb queries. We could potentially scatter the checks and still switch out the cache at fixed regular intervals, but I think this approach accomplishes the goal rarther well.

…

On Tue, May 15, 2018, 6:50 AM Kirill Zakharenko ***@***.***> wrote: For practical purposes, running queries sequentially without any kind of time distribution is almost always enough to deal with concurrency concerns. Even assuming the working set for TSDB fits in memory, you're still not using up more than a single core — I didn't hear of any TSDB bothering to parallelize individual queries. Yes, it does not look pretty on the graphs, they are spikey, but why should anyone care? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2249 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AALwHCYQVas3k0E0866L1HHIuauBq9rFks5tyrMNgaJpZM4T4UrF> .

kylebrandt · 2018-05-15T15:09:45Z

IIRC within time slices we still set the time of the query to align with the runinterval time regardless of when it actually runs. have we checked that with this method when we run something a whole runinterval offset that the query time is not to the previous runinterval?

I haven't looked at that code in a while...

…check scattering.

grzkv · 2018-05-15T16:26:10Z

@kylebrandt @captncraig Added the config option. Please, have a look.

grzkv · 2018-05-18T07:32:18Z

@kylebrandt Could you please elaborate on what additional checks need to be made?

kylebrandt · 2018-05-18T13:48:25Z

cmd/bosun/conf/system.go

+	UnknownThreshold       int
+	CheckFrequency         Duration // Time between alert checks: 5m
+	DefaultRunEvery        int      // Default number of check intervals to run each alert: 1
+	AlertCheckDistribution bool     // Should the alert rule checks be scattered across their running period?


I think this should be a string so we could implement other types of alert check distribution. Then add something to loadSystemConfig in cmd/bosun/conf/system.go to make sure it is a valid option. Valid string would be "" for default, and whatever you name this one.

grzkv · 2018-05-30T14:13:57Z

Changed config option to string type.
@kylebrandt please check it out.

captncraig · 2018-06-06T17:16:03Z

So now you enable this by setting AlertCheckDistribution = simple ? Seems reasonable. I'm good with this pending your ok @kylebrandt

kylebrandt · 2018-06-07T13:30:55Z

I think just an update to the system config docs and it should be ready to merge.

grzkv · 2018-06-08T09:35:59Z

@kylebrandt Added the docs.

mvuets · 2018-06-15T08:41:56Z

Hi @kylebrandt! This PR is just a tiny bit away from getting merged. What can we do to help making it happen? (-:

grzkv · 2018-09-07T09:24:38Z

@kylebrandt @captncraig I guess, something is still missing since the PR is not merged. Please, tell us what, and I will add it.

grzkv added 2 commits May 8, 2018 15:42

Added alert timing distribution

475ca88

Clean-up.

86a542b

grzkv mentioned this pull request May 9, 2018

Splay alerts randomly over a check period #2065

Closed

captncraig reviewed May 9, 2018

View reviewed changes

Added a comment about the shifts map

ee0db45

Added AlertCheckDistribution configuration option for toggling alert …

56440f6

…check scattering.

kylebrandt reviewed May 18, 2018

View reviewed changes

Changed AlertCheckDistribution option to string

699250a

Added documentation for AlertCheckDistribution system config option

a5c3c67

kylebrandt merged commit 357ba20 into bosun-monitor:master Sep 12, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed alert checks to prevent high load spikes #2249

Distributed alert checks to prevent high load spikes #2249

grzkv commented May 9, 2018

kylebrandt commented May 9, 2018

captncraig May 9, 2018

grzkv May 9, 2018 •

edited

Loading

captncraig commented May 9, 2018

captncraig commented May 9, 2018

captncraig commented May 9, 2018

captncraig commented May 9, 2018 •

edited

Loading

mvuets commented May 14, 2018

kylebrandt commented May 14, 2018

mvuets commented May 14, 2018

grzkv commented May 14, 2018

kylebrandt commented May 14, 2018

grzkv commented May 14, 2018

grzkv commented May 14, 2018

grzkv commented May 14, 2018

kylebrandt commented May 14, 2018 •

edited

Loading

kylebrandt commented May 14, 2018

grzkv commented May 15, 2018

earwin commented May 15, 2018

captncraig commented May 15, 2018 via email

kylebrandt commented May 15, 2018

grzkv commented May 15, 2018 •

edited

Loading

grzkv commented May 18, 2018

kylebrandt May 18, 2018

grzkv commented May 30, 2018

captncraig commented Jun 6, 2018

kylebrandt commented Jun 7, 2018

grzkv commented Jun 8, 2018

mvuets commented Jun 15, 2018

grzkv commented Sep 7, 2018

Distributed alert checks to prevent high load spikes #2249

Distributed alert checks to prevent high load spikes #2249

Conversation

grzkv commented May 9, 2018

kylebrandt commented May 9, 2018

captncraig May 9, 2018

Choose a reason for hiding this comment

grzkv May 9, 2018 • edited Loading

Choose a reason for hiding this comment

captncraig commented May 9, 2018

captncraig commented May 9, 2018

captncraig commented May 9, 2018

captncraig commented May 9, 2018 • edited Loading

mvuets commented May 14, 2018

kylebrandt commented May 14, 2018

mvuets commented May 14, 2018

grzkv commented May 14, 2018

kylebrandt commented May 14, 2018

grzkv commented May 14, 2018

grzkv commented May 14, 2018

grzkv commented May 14, 2018

kylebrandt commented May 14, 2018 • edited Loading

kylebrandt commented May 14, 2018

grzkv commented May 15, 2018

earwin commented May 15, 2018

captncraig commented May 15, 2018 via email

kylebrandt commented May 15, 2018

grzkv commented May 15, 2018 • edited Loading

grzkv commented May 18, 2018

kylebrandt May 18, 2018

Choose a reason for hiding this comment

grzkv commented May 30, 2018

captncraig commented Jun 6, 2018

kylebrandt commented Jun 7, 2018

grzkv commented Jun 8, 2018

mvuets commented Jun 15, 2018

grzkv commented Sep 7, 2018

grzkv May 9, 2018 •

edited

Loading

captncraig commented May 9, 2018 •

edited

Loading

kylebrandt commented May 14, 2018 •

edited

Loading

grzkv commented May 15, 2018 •

edited

Loading