-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Should we rename the resolved action group? #83464
Comments
Pinging @elastic/kibana-alerting-services (Team:Alerting Services) |
We'll likely want to change the event generated for these as well, currently |
Also, brought up in our last meeting where this came up, that it would be nice to have "verb forms" of "new-instance" and "resolved-instance", as we at least informally often refer to active "rules" (nee alerts) as "firing" or "triggering" or "scheduling action groups" - having a well-known term for this would be good, and same for "resolved", but I suppose that's implicit in it's name and the suggestions of "cleared" and "recovered". |
I looked at what terminology some other applications use and saw a few uses of |
I agree with @ymao1, both |
Can we queue up a discussion on all these verb-y things?
|
Yup, this has definitely been my experience - that alerts recover, but much of that experience is rooted in one tool (as ubiquitous as it might be), Nagios, so I don't know how representative it is. |
I've done some research:
I also looked at InfoSec platforms, such as the Hive, and they don't seem to have any kind of auto resolution of alert, as they need to be manually investigated. For such use cases, this seems less relevant... which makes sense, as this aligns with Case and how they use our framework. |
@pmuellr definitely worth doing, but probably belongs more in the scope of Working Group's terminology discussion than in this issue, right? Fancy spinning off a new issue, otherwise this discussion might get lost in this one (We can also make this a meta issue and open two smaller issues for these two things, up to you). |
I know Datadog use the term "OK" for the state, but they also have a concept of "recovery" thresholds which is similar to what this action group does: https://docs.datadoghq.com/monitors/faq/what-are-recovery-thresholds/ |
Good point , and that further strengthens the argument for Recover over Resolve, as they use Resolve for explicit action by a user, but Recover for something that (even thought define in advanced), is automatic from the perspective of the user investigating the alert. |
++ on this, that was exactly the point bringing up |
If you are not alerting on a negative condition, but are instead alerting for something you want to happen, then "recovered" doesn't make sense in that case. Maybe something like "condition met"/"condition not met"? |
This is a fair point, but I feel condition not met might suffer from a similar issue in an alert where a user never defines a clear condition to be met. |
I think the problem there is the use case determines whether recovered makes sense, not the alert type. So you may be using an index threshold alert both for negative alerting ("something went wrong") and for positive alerting ("something good happened"). |
I think we probably need a term to use when talking about alerts in the abstract, in terms of "recovered/resolved" - as well as a verb that makes them "active" (regardless of which action group they indicate they are in). There's a separate notion of what the alert should "show" to the user. I think "recovered" works fine when talking in the abstract. Even if you narrow the scope to the index threshold alert, what you should "show" to the user seems hard, since you could use the alert for negative and positive checks, and so "recovered" seems fine for negative checks, but seems wrong for positive checks. I wonder if this sort of confusion is limited to "generic" alerts like index threshold. If we end up with alert types that have action groups like "warning", "error", "critical", it seems fine that "recovered" is what you want. It's also not clear:
Fun experiment would be to build some kind of real-ish positive alert with index threshold, see what the UIs, action messages, etc look like. I wonder if the alert is named well enough, to indicate it's positive-ness, would that be enough to allow us to use "recovered" for these anyway. |
We do use |
It occurred to me last night, that as we group all of the domain specific action groups under an Active label, perhaps we should label the lack of an action group (which is what this is from the both the user's and the domain's perspective) as Inactive? Instead of Recovered which does suggest something about the unique identifier's new state, we could simply call it Inactive, which would make sense along side the Active alerts. The obvious downside to this is that it doesn't really mean anything other than being the opposite of Active, which makes me wonder if even Active is actually the right term. I fear we're on the edge of being so generic that the term loses all meaning. |
I'd prefer we keep the terms distinct between Rules and Alerts. |
I think we're on the same page but using different terminology. I'd prefer we keep communicating using the current terminology until the upcoming terminology is used outside the working group.
This is already happening because the alert and alert instance use "OK" status. Maybe we should change the alert "OK" status to reflect the new "resolved" action group. In your case "inactive" instead of "OK". This way the user can see a relation between the status and the "Run When" field. |
Haha, no, that was me using Rules to mean Alerts and Alert to mean AlertInstances 😆 |
The "ok" and "active" terminology is I believe only used by us and not used in any way by customers (eg, in action parameter templates), though it does show up in the various status values we return from APIs. If we changed "ok" to "inactive", we'd break any customers using the API and depending on this value, but I'd guess there are few if any customers doing this today. Certainly breaking API, but low risk, and if we have to do it, the sooner the better. We could also come up with a new field name for the new values of the terms, and return the old terms in the old field for some deprecation time ... |
It's time to make a decision, so given all the back and forth, the team had a quick synchronous 👍 / 👎 and we landed on the following: We will change the default term from “Resolved” to “Recovered”, as it fits most use cases and we feel users are most likely to understand its meaning across domains. |
@arisonl, @gmmorris and @bmcconaghy shared a concern on the action group name for the new alert on resolve feature. Some proposed alternatives:
cleared
andrecovered
with a stronger preference ofcleared
so far.The text was updated successfully, but these errors were encountered: