Should we rename the resolved action group? #83464

mikecote · 2020-11-16T18:48:43Z

@arisonl, @gmmorris and @bmcconaghy shared a concern on the action group name for the new alert on resolve feature. Some proposed alternatives: cleared and recovered with a stronger preference of cleared so far.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-11-16T18:48:45Z

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

pmuellr · 2020-11-17T14:43:01Z

We'll likely want to change the event generated for these as well, currently resolved-instance (it has peers new-instance and active-instance). Currently we only use this internally during the calculation of the alert instance summary, but we can accommodate a new name here by checking for both the old one and new one.

pmuellr · 2020-11-17T14:47:23Z

Also, brought up in our last meeting where this came up, that it would be nice to have "verb forms" of "new-instance" and "resolved-instance", as we at least informally often refer to active "rules" (nee alerts) as "firing" or "triggering" or "scheduling action groups" - having a well-known term for this would be good, and same for "resolved", but I suppose that's implicit in it's name and the suggestions of "cleared" and "recovered".

ymao1 · 2020-11-18T23:37:56Z

I looked at what terminology some other applications use and saw a few uses of recovered. To me, both cleared and resolved seem to imply some sort of external intervention while recovered does not. That might just be me though :)

arisonl · 2020-11-19T10:03:09Z

I agree with @ymao1, both resolved and cleared to me imply intervention that takes the alert out of an active status. For example through a manual action or with a timer (resolve it if it's on for more than X minutes). Cleared less so, and I understand it if others prefer cleared over recovered (does recovered sound ambiguous?). Bear in mind that in the competition there is also the notion of recovery thresholds for mitigating alert flapping (thresholds lower than the alert threshold that give you wiggle room before you consider the alert recovered, when the alert fluxes around the alert threshold).

pmuellr · 2020-11-19T16:12:53Z

Can we queue up a discussion on all these verb-y things?

rules which schedule actions that didn't before - triggered, activated, fired, etc
rules which don't schedule actions that did before - recovered, resolved, cleared
do we need a verb for other "state-y" changes - moving from one action group to another, "started generating errors", the eventual "no data" "state", etc
also need to distinguish from an alert being active, and and actually scheduling actions - an alert can be active, but throttled, so not schedule actions; what I'm looking for is a verb indicating actions are actually being scheduled, something better than "scheduled actions" (how I refer to it today)

gmmorris · 2020-11-19T17:33:27Z

I looked at what terminology some other applications use and saw a few uses of recovered. To me, both cleared and resolved seem to imply some sort of external intervention while recovered does not. That might just be me though :)

Yup, this has definitely been my experience - that alerts recover, but much of that experience is rooted in one tool (as ubiquitous as it might be), Nagios, so I don't know how representative it is.

gmmorris · 2020-11-20T12:05:19Z

I've done some research:

Term	Used By	Thoughts
Resolved	Prometheus, PagerDuty	We feel this term implies intervention that took the alert out of an active status , but in out case, it's that the rule no longer detects the conditions it was detecting before. That said, Prometheus use it to denote auto resolution which is the same as us. This feels different. DataDog actually use this for a button which manually resets a monitor, which further confirms the feeling that it implies intervention.
Recovered	DataDog, Nagios, New Relic	This term is used by Nagios and New Relic to imply that a data source enters a non-violating state after being in a violating state. Additionally, DataDog use "Recovery Threshold" to define a threshold at which the monitor is no longer in a violating state. This feels in line with our approach, where a detected condition, which caused an alert, can no longer be detected.
Cleared	I can't actually find anyone who uses it. Anyone know?
Completed	openDistro	This just seems wrong to me. What has completed? The alert execution? The Actions? The user has completed the investigation? Unclear and not very helpful.
OK	DataDog	This implies that the monitored source of the alert is actively reporting that it's OK. This is a different concept than our framework where we think in detection, rather than monitoring.
Auto-Close	OpsGenie	This happens in response to an Auto-Close Policy which is a condition for detecting resolution. We don't detect resolution, but rather stop detecting an alerting condition.
Auto-Resolved	VictorOps	Like resolved, but makes it clear it was resolved automatically. 🤔

I also looked at InfoSec platforms, such as the Hive, and they don't seem to have any kind of auto resolution of alert, as they need to be manually investigated. For such use cases, this seems less relevant... which makes sense, as this aligns with Case and how they use our framework.

gmmorris · 2020-11-20T12:17:07Z

Can we queue up a discussion on all these verb-y things?

@pmuellr definitely worth doing, but probably belongs more in the scope of Working Group's terminology discussion than in this issue, right? Fancy spinning off a new issue, otherwise this discussion might get lost in this one (We can also make this a meta issue and open two smaller issues for these two things, up to you).

peterschretlen · 2020-11-20T15:51:45Z

I know Datadog use the term "OK" for the state, but they also have a concept of "recovery" thresholds which is similar to what this action group does: https://docs.datadoghq.com/monitors/faq/what-are-recovery-thresholds/

gmmorris · 2020-11-20T15:57:27Z

I know Datadog use the term "OK" for the state, but they also have a concept of "recovery" thresholds which is similar to what this action group does: https://docs.datadoghq.com/monitors/faq/what-are-recovery-thresholds/

Good point , and that further strengthens the argument for Recover over Resolve, as they use Resolve for explicit action by a user, but Recover for something that (even thought define in advanced), is automatic from the perspective of the user investigating the alert.

arisonl · 2020-11-21T13:45:44Z

++ on this, that was exactly the point bringing up recovery thresholds here #83464 (comment): Recovery thresholds are a no-intervention mechanism, so I'd say they represent better how DD treats the concept in discussion.

bmcconaghy · 2020-11-23T17:36:56Z

If you are not alerting on a negative condition, but are instead alerting for something you want to happen, then "recovered" doesn't make sense in that case. Maybe something like "condition met"/"condition not met"?

gmmorris · 2020-11-23T17:47:53Z

If you are not alerting on a negative condition, but are instead alerting for something you want to happen, then "recovered" doesn't make sense in that case. Maybe something like "condition met"/"condition not met"?

This is a fair point, but I feel condition not met might suffer from a similar issue in an alert where a user never defines a clear condition to be met.
Do you think having Recovered as a default is OK if we allow Alert Type a way of providing a custom label that is more accurate for the domain of that alert type?

bmcconaghy · 2020-11-23T17:52:59Z

I think the problem there is the use case determines whether recovered makes sense, not the alert type. So you may be using an index threshold alert both for negative alerting ("something went wrong") and for positive alerting ("something good happened").

pmuellr · 2020-11-23T18:45:13Z

I think we probably need a term to use when talking about alerts in the abstract, in terms of "recovered/resolved" - as well as a verb that makes them "active" (regardless of which action group they indicate they are in). There's a separate notion of what the alert should "show" to the user. I think "recovered" works fine when talking in the abstract.

Even if you narrow the scope to the index threshold alert, what you should "show" to the user seems hard, since you could use the alert for negative and positive checks, and so "recovered" seems fine for negative checks, but seems wrong for positive checks. I wonder if this sort of confusion is limited to "generic" alerts like index threshold. If we end up with alert types that have action groups like "warning", "error", "critical", it seems fine that "recovered" is what you want.

It's also not clear:

how many cases we'll have of customers doing "positive" checks
how bad is it, if we did use "recover" for those "positive" checks
is consistency of that term, across all alerts, important? Will customers be confused if it's "recovered" for some alerts, but different (eg, "threshold not met") for other alerts? Perhaps some other UI affordance (special icon) could be used for the "recovered" action group, to help identify it's meaning.

Fun experiment would be to build some kind of real-ish positive alert with index threshold, see what the UIs, action messages, etc look like. I wonder if the alert is named well enough, to indicate it's positive-ness, would that be enough to allow us to use "recovered" for these anyway.

mikecote · 2020-11-23T18:59:51Z

OK DataDog This implies that the monitored source of the alert is actively reporting that it's OK. This is a different >concept than our framework where we think in detection, rather than monitoring.

We do use OK as an execution status and display it to the user as the alert status (instead of recovered or w/e other term we plan on using). OK does relate to what "hook" their actions would fire (Run When: OK). Though it doesn't re-notify / throttle, etc.

gmmorris · 2020-11-24T09:41:15Z

It occurred to me last night, that as we group all of the domain specific action groups under an Active label, perhaps we should label the lack of an action group (which is what this is from the both the user's and the domain's perspective) as Inactive?

Instead of Recovered which does suggest something about the unique identifier's new state, we could simply call it Inactive, which would make sense along side the Active alerts.

The obvious downside to this is that it doesn't really mean anything other than being the opposite of Active, which makes me wonder if even Active is actually the right term. I fear we're on the edge of being so generic that the term loses all meaning.

gmmorris · 2020-11-24T09:43:39Z

OK
DataDog
This implies that the monitored source of the alert is actively reporting that it's OK. This is a different >concept than our framework where we think in detection, rather than monitoring.

We do use OK as an execution status and display it to the user as the alert status (instead of recovered or w/e other term we plan on using). OK does relate to what "hook" their actions would fire (Run When: OK). Though it doesn't re-notify / throttle, etc.

I'd prefer we keep the terms distinct between Rules and Alerts.
My fear is users might confuse their Rule's status for the Alert's status... ensuring we don't use the same term in both should help reduce the danger of that specific confusion.

mikecote · 2020-11-24T12:45:57Z

I'd prefer we keep the terms distinct between Rules and Alerts.

I think we're on the same page but using different terminology. I'd prefer we keep communicating using the current terminology until the upcoming terminology is used outside the working group.

My fear is users might confuse their Rule's status for the Alert's status... ensuring we don't use the same term in both should help reduce the danger of that specific confusion.

This is already happening because the alert and alert instance use "OK" status. Maybe we should change the alert "OK" status to reflect the new "resolved" action group. In your case "inactive" instead of "OK". This way the user can see a relation between the status and the "Run When" field.

gmmorris · 2020-11-24T15:15:16Z

I think we're on the same page but using different terminology. I'd prefer we keep communicating using the current terminology until the upcoming terminology is used outside the working group.

Haha, no, that was me using Rules to mean Alerts and Alert to mean AlertInstances 😆
Which ended up just confusing you... sorry

pmuellr · 2020-11-24T15:56:23Z

The "ok" and "active" terminology is I believe only used by us and not used in any way by customers (eg, in action parameter templates), though it does show up in the various status values we return from APIs. If we changed "ok" to "inactive", we'd break any customers using the API and depending on this value, but I'd guess there are few if any customers doing this today. Certainly breaking API, but low risk, and if we have to do it, the sooner the better.

We could also come up with a new field name for the new values of the terms, and return the old terms in the old field for some deprecation time ...

gmmorris · 2020-11-24T18:01:57Z

It's time to make a decision, so given all the back and forth, the team had a quick synchronous 👍 / 👎 and we landed on the following:

We will change the default term from “Resolved” to “Recovered”, as it fits most use cases and we feel users are most likely to understand its meaning across domains.
That said, the concern about it being incorrect for certain use cases (such as Maps) seems solid enough to require addressing, so we’re also going to add the ability for an Alert Type to specify its own label, so that recovered alerts are labeled in a manner that makes sense in that Alert Type’s domain.

mikecote added discuss Feature:Alerting Feature:Actions Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Nov 16, 2020

gmmorris self-assigned this Nov 20, 2020

gmmorris mentioned this issue Nov 23, 2020

[Alerting] renames Resolved action group to Recovered #84123

Merged

3 tasks

gmmorris closed this as completed in #84123 Dec 1, 2020

kobelb added the needs-team Issues missing a team label label Jan 31, 2022

botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should we rename the resolved action group? #83464

Should we rename the resolved action group? #83464

mikecote commented Nov 16, 2020

elasticmachine commented Nov 16, 2020

pmuellr commented Nov 17, 2020

pmuellr commented Nov 17, 2020

ymao1 commented Nov 18, 2020

arisonl commented Nov 19, 2020 •

edited

Loading

pmuellr commented Nov 19, 2020

gmmorris commented Nov 19, 2020

gmmorris commented Nov 20, 2020 •

edited

Loading

gmmorris commented Nov 20, 2020 •

edited

Loading

peterschretlen commented Nov 20, 2020

gmmorris commented Nov 20, 2020

arisonl commented Nov 21, 2020 •

edited

Loading

bmcconaghy commented Nov 23, 2020

gmmorris commented Nov 23, 2020

bmcconaghy commented Nov 23, 2020

pmuellr commented Nov 23, 2020

mikecote commented Nov 23, 2020

gmmorris commented Nov 24, 2020

gmmorris commented Nov 24, 2020

mikecote commented Nov 24, 2020

gmmorris commented Nov 24, 2020 •

edited

Loading

pmuellr commented Nov 24, 2020

gmmorris commented Nov 24, 2020 •

edited

Loading

Should we rename the resolved action group? #83464

Should we rename the resolved action group? #83464

Comments

mikecote commented Nov 16, 2020

elasticmachine commented Nov 16, 2020

pmuellr commented Nov 17, 2020

pmuellr commented Nov 17, 2020

ymao1 commented Nov 18, 2020

arisonl commented Nov 19, 2020 • edited Loading

pmuellr commented Nov 19, 2020

gmmorris commented Nov 19, 2020

gmmorris commented Nov 20, 2020 • edited Loading

gmmorris commented Nov 20, 2020 • edited Loading

peterschretlen commented Nov 20, 2020

gmmorris commented Nov 20, 2020

arisonl commented Nov 21, 2020 • edited Loading

bmcconaghy commented Nov 23, 2020

gmmorris commented Nov 23, 2020

bmcconaghy commented Nov 23, 2020

pmuellr commented Nov 23, 2020

mikecote commented Nov 23, 2020

gmmorris commented Nov 24, 2020

gmmorris commented Nov 24, 2020

mikecote commented Nov 24, 2020

gmmorris commented Nov 24, 2020 • edited Loading

pmuellr commented Nov 24, 2020

gmmorris commented Nov 24, 2020 • edited Loading

arisonl commented Nov 19, 2020 •

edited

Loading

gmmorris commented Nov 20, 2020 •

edited

Loading

gmmorris commented Nov 20, 2020 •

edited

Loading

arisonl commented Nov 21, 2020 •

edited

Loading

gmmorris commented Nov 24, 2020 •

edited

Loading

gmmorris commented Nov 24, 2020 •

edited

Loading