Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed alert checks to prevent high load spikes #2249

Merged

Conversation

grzkv
Copy link
Contributor

@grzkv grzkv commented May 9, 2018

This is a solution for #2065

The idea behind this is simple. Every check run is slightly shifted so that the checks are distributed uniformly.

For the subset of checks that run with the period T, a shift is added to every check. The shift ranges from 0 to T-1. The shifts are incremental. For example, if we have 6 checks every 5 mins (T=5). The shifts will be 0, 1, 2, 3, 4, 0. This way, without the patch 6 checks will happen at times 0, and 5; with the patch, two checks will happen at the time 0, one at 1, one at 2, and so on. The total number of checks and check period stay the same.

Here is the test that shows the effect of the patch on system load. Note, that the majority of checks in this system have 5 mins period.
patch_test

@kylebrandt
Copy link
Member

@captncraig any idea what this might do in terms of dependencies and unknown grouping? I think there may be instances where we assume things are running the same time slice.

// Every alert gets a small shift in time.
// This way the alerts with the same period are not fired
// simultaneously, but are distributed.
circular_shifts := make(map[int]int)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps just a note here about what the keys and values represent in this map. I believe it is "run every value" to the next offset. Probably just comment that.

Copy link
Contributor Author

@grzkv grzkv May 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The map is indeed run periodtime shift to add.
I will add a comment.
Thanks for the feedback!

@captncraig
Copy link
Contributor

This seems like a really nice solution. I think we need a configuration value to enable (or maybe disable it). Sometimes when developing, it is extremely useful for checks to run immediately when the app starts up.

But I'm not sure if that use case is significant enough to make your behaviour an option you need to enable, or if it should be the default. I'll let @kylebrandt make the call on that.

@captncraig
Copy link
Contributor

@kylebrandt has a great point I didn't consider about dependencies. Let me dig a bit on that.

@captncraig
Copy link
Contributor

Unknowns will work fine. That is just based on the last event timestamp. As long as it runs consistently every N times, it will be fine.

@captncraig
Copy link
Contributor

captncraig commented May 9, 2018

And it looks like the uneveluated / dependency things are run straight out of the data store from the last known run. There's a bit of a race condition involved there, so I wouldn't expect this to "break" that. Staggering could either increase or decrease the response time for dependency status flowing to other alerts (depending on which side runs faster), but I don't think this change will break things.

@mvuets
Copy link
Contributor

mvuets commented May 14, 2018

Would be great to have this functionality merged! Any way to help to make it happen? (-:

@kylebrandt
Copy link
Member

I'm for merging this once there is a configuration option (via the toml file) to enable and disable (default state disabled).

@mvuets
Copy link
Contributor

mvuets commented May 14, 2018

Ah, fair enough. What would you like it to be named?

@grzkv
Copy link
Contributor Author

grzkv commented May 14, 2018

@kylebrandt Clear. I completely agree.
We could go with something like AlertCheckDistribution. How about that?

@kylebrandt
Copy link
Member

Thinking about this a little more I have a couple of thoughts / questions (stepping back from naming, the suggest name sound fine):

  1. Everything running at start (more or less) as craig pointed out is pretty useful for when editing rules. After saving the config your change should impact the alert relatively quickly.
  2. I'm wondering if we should consider spreading the checks my moving them around with some jitter as opposed to whole cycles?

@grzkv
Copy link
Contributor Author

grzkv commented May 14, 2018

@kylebrandt Re point 2. What exactly do you mean by jitter? Random spreading?

Or do you mean, have even more flattened spread: not only over integer cycles given by GetCheckFrequency()?

@grzkv
Copy link
Contributor Author

grzkv commented May 14, 2018

@kylebrandt Re point 1. We can make it that for the first time everything runs right away, and the shifts are applied to all subsequent runs. This will increase the complexity of code of course. Would that be better?

@grzkv
Copy link
Contributor Author

grzkv commented May 14, 2018

I will add the config option. Thanks for the feedback.

@kylebrandt
Copy link
Member

kylebrandt commented May 14, 2018

@grzkv I haven't fully thought it out yet. But for example if we have a Check frequency of 1 minute, checks are spread-out throughout the minute. If the Runevery is 5 minutes, the checks will be spread somewhere within a minute of that 5 minute cycle...

Here is how I think of it in the big picture:
The objective is to reduce concurrent sessions within the system so it doesn't overload Bosun and/or (more likely) the Time series database(s) used by Bosun. This is because we don't want so many queries at once that queries end up taking longer because of the amount of concurrency, and we also don't want to be impacting other users of the TSDB (i.e. Grafana or other dashboard systems, or whatever).

So the ideal way to spread out the checks depends generally on the duration of your queries a (little's law - see https://serverfault.com/a/345446/2561 for picture). If you checks are every minute, but your queries respond in say an average of 5 seconds (when not being a bunch all at once) then it would be better to spread them out through the minute, instead of hitting the system with less checks, but still in all the same time resulting in higher than ideal concurrency. So we want spread the arrival rate out to reduce concurrency, and doing that over smaller amounts of time might be a better way to do that depending on average duration of queries in a non-overloaded system.

@kylebrandt
Copy link
Member

However, if this code is working well I don't want to block getting it in. We should just make the option a validated string, so if we want to try other things later the config won't change from a bool to a different type.

@grzkv
Copy link
Contributor Author

grzkv commented May 15, 2018

@kylebrandt You have inspired me to start reading about scheduling strategies 😄

I 100% agree with your reasoning. I thought something along the same lines. Let's proceed with the current solution for now.

I have some ideas on how we can improve further. I will share them after we merge this.

@earwin
Copy link
Contributor

earwin commented May 15, 2018

For practical purposes, running queries sequentially without any kind of time distribution is almost always enough to deal with concurrency concerns.

Even assuming the working set for TSDB fits in memory, you're still not using up more than a single core — I didn't hear of any TSDB bothering to parallelize individual queries.
Yes, it does not look pretty on the graphs, they are spikey, but why should anyone care?

@captncraig
Copy link
Contributor

captncraig commented May 15, 2018 via email

@kylebrandt
Copy link
Member

IIRC within time slices we still set the time of the query to align with the runinterval time regardless of when it actually runs. have we checked that with this method when we run something a whole runinterval offset that the query time is not to the previous runinterval?

I haven't looked at that code in a while...

@grzkv
Copy link
Contributor Author

grzkv commented May 15, 2018

@kylebrandt @captncraig Added the config option. Please, have a look.

@grzkv
Copy link
Contributor Author

grzkv commented May 18, 2018

@kylebrandt Could you please elaborate on what additional checks need to be made?

UnknownThreshold int
CheckFrequency Duration // Time between alert checks: 5m
DefaultRunEvery int // Default number of check intervals to run each alert: 1
AlertCheckDistribution bool // Should the alert rule checks be scattered across their running period?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be a string so we could implement other types of alert check distribution. Then add something to loadSystemConfig in cmd/bosun/conf/system.go to make sure it is a valid option. Valid string would be "" for default, and whatever you name this one.

@grzkv
Copy link
Contributor Author

grzkv commented May 30, 2018

Changed config option to string type.
@kylebrandt please check it out.

@captncraig
Copy link
Contributor

So now you enable this by setting AlertCheckDistribution = simple ? Seems reasonable. I'm good with this pending your ok @kylebrandt

@kylebrandt
Copy link
Member

I think just an update to the system config docs and it should be ready to merge.

@grzkv
Copy link
Contributor Author

grzkv commented Jun 8, 2018

@kylebrandt Added the docs.

@mvuets
Copy link
Contributor

mvuets commented Jun 15, 2018

Hi @kylebrandt! This PR is just a tiny bit away from getting merged. What can we do to help making it happen? (-:

@grzkv
Copy link
Contributor Author

grzkv commented Sep 7, 2018

@kylebrandt @captncraig I guess, something is still missing since the PR is not merged. Please, tell us what, and I will add it.

@kylebrandt kylebrandt merged commit 357ba20 into bosun-monitor:master Sep 12, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants