-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Burnrate alerts aren't working correctly #47
Comments
Also, I have 3 alerts associated with an SLO: 10m-1h, 30m-6h, and 6h-24h. In prometheus, the alerts aren't duplicated because they're grouped together (as you can see in the query in the article), but in Sumo I got 3 emails per SLO while the system was down. |
Looking into this a bit more thoroughly, it looks like the monitor is being evaluated over the long period, and if the combined_burn exceeds the value of 1, anytime in that period it won't resolve. This would mean that it would have to be 1 or lower, for the long period of time. I think we might have to change the monitor to be evaluated over the short period of time, but move the calculations for the combined_burn into a scheduled search so that it can be evaluated over a period of time. |
It looks like a scheduled search wouldn't do it, but a scheduled view would. You can pre-populate the scheduled view with the current Also, I've noticed that I am using the trigger for "Warning" and "ResolvedWarning" which is tripped when the
|
Also, looking into the https://sre.google/workbook/alerting-on-slos/ more, it seems that they combine alerts based on the notification type. For example:
This query means that both SLO alerts are combined. If either one is triggered, it will send the same email. This has the benefit that there won't be 2 notifications that the alert has been triggered, and there won't be a duplication of alerts. I think it might be worthwhile updating the SLO configuration to the latest OpenSLO Spec. They have added a few objects such as "AlertPolicies" which have 1 or more "Alert Conditions". This would allow the configuration to group all of the "long/short burn rate" conditions into 1 alert. |
Ah dam, it looks like OpenSLO oslo doesn't support the latest OpenSLO Spec. |
hey @lswith, i will discuss the monitor not resolving with monitors team and get back on it by tomorrow. the update to oslo is currently blocked for two reasons : 1) they haven't updated oslo and 2) it doesn't support multi burn rate monitors yet. |
the monitor team is working on adding configurable resolution window for monitors, after that setting the resolve window to the short-burn period will give us the correct behaviour required for these alerts. cc: @tarunk2 |
I have an SLO that is 30m (short window) and 6h (long window). I've put the threshold the same on both.
When the SLO was triggered, it was quite quick (within 5m) but the alert took 6 hours to resolve after it went back to normal.
I would have expected it to be resolved quickly according to https://sre.google/workbook/alerting-on-slos/
Looking into this a bit deeper, I think that the threshold values on the monitor take 6 hours to evaluate, and it might not be possible to do "Multiwindow, Multi-Burn-Rate Alerts" using sumologic's monitors.
The text was updated successfully, but these errors were encountered: