-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alerting rules can end up in a state where they stop running indefinitely until a user intervenes to fix the problem #119650
Comments
Pinging @elastic/kibana-alerting-services (Team:Alerting Services) |
There is still an open question with regards to rule actions
@arisonl and I are trying to get an answer to this question but should be part of this research issue once we have an answer. |
I believe that in practice, if an action is set up, receiving the actions in the integrated system is an integral part of the rule from a use case perspective. We should assume that if it goes out, it is very important and there is a workflow associated with it, and hence having the rule running but the action failing should be anticipated to be just as bad as the rule not running at all, at least for a number of use cases. |
Added a link to a related issue (it was already described in the issue, but wasn't linked to the source issue). |
After some research, we've concluded that there are two types of problems that alerting rules can encounter:
As part of the research, we've identified the scenarios that can lead to the specified problems. More details are available in the research document. As a result of this research, the following issues have been created:
Closing this research issue in favor of the linked issues. |
Alerting rules should operate continually. When a user enables the alerting rule, it shouldn't stop running or fail indefinitely until the rule is disabled.
We should audit our codebase and identify scenarios where rules stop running indefinitely. Then, based on the findings, we should propose fixes or mitigations for the scenarios that can be.
As a starting point, I am aware of the following scenarios where rules stop running indefinitely. Since I only did a brief research, we should still analyze our code in-depth and prioritize after the fact so we can start discussing solutions and priorities for each.
unrecognizable
The text was updated successfully, but these errors were encountered: