Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should we rename the resolved action group? #83464

Closed
mikecote opened this issue Nov 16, 2020 · 23 comments · Fixed by #84123
Closed

Should we rename the resolved action group? #83464

mikecote opened this issue Nov 16, 2020 · 23 comments · Fixed by #84123
Assignees
Labels
discuss Feature:Actions Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@mikecote
Copy link
Contributor

@arisonl, @gmmorris and @bmcconaghy shared a concern on the action group name for the new alert on resolve feature. Some proposed alternatives: cleared and recovered with a stronger preference of cleared so far.

@mikecote mikecote added discuss Feature:Alerting Feature:Actions Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Nov 16, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@pmuellr
Copy link
Member

pmuellr commented Nov 17, 2020

We'll likely want to change the event generated for these as well, currently resolved-instance (it has peers new-instance and active-instance). Currently we only use this internally during the calculation of the alert instance summary, but we can accommodate a new name here by checking for both the old one and new one.

@pmuellr
Copy link
Member

pmuellr commented Nov 17, 2020

Also, brought up in our last meeting where this came up, that it would be nice to have "verb forms" of "new-instance" and "resolved-instance", as we at least informally often refer to active "rules" (nee alerts) as "firing" or "triggering" or "scheduling action groups" - having a well-known term for this would be good, and same for "resolved", but I suppose that's implicit in it's name and the suggestions of "cleared" and "recovered".

@ymao1
Copy link
Contributor

ymao1 commented Nov 18, 2020

I looked at what terminology some other applications use and saw a few uses of recovered. To me, both cleared and resolved seem to imply some sort of external intervention while recovered does not. That might just be me though :)

@arisonl
Copy link
Contributor

arisonl commented Nov 19, 2020

I agree with @ymao1, both resolved and cleared to me imply intervention that takes the alert out of an active status. For example through a manual action or with a timer (resolve it if it's on for more than X minutes). Cleared less so, and I understand it if others prefer cleared over recovered (does recovered sound ambiguous?). Bear in mind that in the competition there is also the notion of recovery thresholds for mitigating alert flapping (thresholds lower than the alert threshold that give you wiggle room before you consider the alert recovered, when the alert fluxes around the alert threshold).

@pmuellr
Copy link
Member

pmuellr commented Nov 19, 2020

Can we queue up a discussion on all these verb-y things?

  • rules which schedule actions that didn't before - triggered, activated, fired, etc
  • rules which don't schedule actions that did before - recovered, resolved, cleared
  • do we need a verb for other "state-y" changes - moving from one action group to another, "started generating errors", the eventual "no data" "state", etc
  • also need to distinguish from an alert being active, and and actually scheduling actions - an alert can be active, but throttled, so not schedule actions; what I'm looking for is a verb indicating actions are actually being scheduled, something better than "scheduled actions" (how I refer to it today)

@gmmorris
Copy link
Contributor

I looked at what terminology some other applications use and saw a few uses of recovered. To me, both cleared and resolved seem to imply some sort of external intervention while recovered does not. That might just be me though :)

Yup, this has definitely been my experience - that alerts recover, but much of that experience is rooted in one tool (as ubiquitous as it might be), Nagios, so I don't know how representative it is.

@gmmorris gmmorris self-assigned this Nov 20, 2020
@gmmorris
Copy link
Contributor

gmmorris commented Nov 20, 2020

I've done some research:

Term Used By Thoughts
Resolved Prometheus, PagerDuty We feel this term implies intervention that took the alert out of an active status , but in out case, it's that the rule no longer detects the conditions it was detecting before. That said, Prometheus use it to denote auto resolution which is the same as us. This feels different. DataDog actually use this for a button which manually resets a monitor, which further confirms the feeling that it implies intervention.
Recovered DataDog, Nagios, New Relic This term is used by Nagios and New Relic to imply that a data source enters a non-violating state after being in a violating state. Additionally, DataDog use "Recovery Threshold" to define a threshold at which the monitor is no longer in a violating state. This feels in line with our approach, where a detected condition, which caused an alert, can no longer be detected.
Cleared I can't actually find anyone who uses it. Anyone know?
Completed openDistro This just seems wrong to me. What has completed? The alert execution? The Actions? The user has completed the investigation? Unclear and not very helpful.
OK DataDog This implies that the monitored source of the alert is actively reporting that it's OK. This is a different concept than our framework where we think in detection, rather than monitoring.
Auto-Close OpsGenie This happens in response to an Auto-Close Policy which is a condition for detecting resolution. We don't detect resolution, but rather stop detecting an alerting condition.
Auto-Resolved VictorOps Like resolved, but makes it clear it was resolved automatically. 🤔

I also looked at InfoSec platforms, such as the Hive, and they don't seem to have any kind of auto resolution of alert, as they need to be manually investigated. For such use cases, this seems less relevant... which makes sense, as this aligns with Case and how they use our framework.

@gmmorris
Copy link
Contributor

gmmorris commented Nov 20, 2020

Can we queue up a discussion on all these verb-y things?

@pmuellr definitely worth doing, but probably belongs more in the scope of Working Group's terminology discussion than in this issue, right? Fancy spinning off a new issue, otherwise this discussion might get lost in this one (We can also make this a meta issue and open two smaller issues for these two things, up to you).

@peterschretlen
Copy link
Contributor

I know Datadog use the term "OK" for the state, but they also have a concept of "recovery" thresholds which is similar to what this action group does: https://docs.datadoghq.com/monitors/faq/what-are-recovery-thresholds/

@gmmorris
Copy link
Contributor

I know Datadog use the term "OK" for the state, but they also have a concept of "recovery" thresholds which is similar to what this action group does: https://docs.datadoghq.com/monitors/faq/what-are-recovery-thresholds/

Good point , and that further strengthens the argument for Recover over Resolve, as they use Resolve for explicit action by a user, but Recover for something that (even thought define in advanced), is automatic from the perspective of the user investigating the alert.

@arisonl
Copy link
Contributor

arisonl commented Nov 21, 2020

++ on this, that was exactly the point bringing up recovery thresholds here #83464 (comment): Recovery thresholds are a no-intervention mechanism, so I'd say they represent better how DD treats the concept in discussion.

@bmcconaghy
Copy link
Contributor

If you are not alerting on a negative condition, but are instead alerting for something you want to happen, then "recovered" doesn't make sense in that case. Maybe something like "condition met"/"condition not met"?

@gmmorris
Copy link
Contributor

If you are not alerting on a negative condition, but are instead alerting for something you want to happen, then "recovered" doesn't make sense in that case. Maybe something like "condition met"/"condition not met"?

This is a fair point, but I feel condition not met might suffer from a similar issue in an alert where a user never defines a clear condition to be met.
Do you think having Recovered as a default is OK if we allow Alert Type a way of providing a custom label that is more accurate for the domain of that alert type?

@bmcconaghy
Copy link
Contributor

I think the problem there is the use case determines whether recovered makes sense, not the alert type. So you may be using an index threshold alert both for negative alerting ("something went wrong") and for positive alerting ("something good happened").

@pmuellr
Copy link
Member

pmuellr commented Nov 23, 2020

I think we probably need a term to use when talking about alerts in the abstract, in terms of "recovered/resolved" - as well as a verb that makes them "active" (regardless of which action group they indicate they are in). There's a separate notion of what the alert should "show" to the user. I think "recovered" works fine when talking in the abstract.

Even if you narrow the scope to the index threshold alert, what you should "show" to the user seems hard, since you could use the alert for negative and positive checks, and so "recovered" seems fine for negative checks, but seems wrong for positive checks. I wonder if this sort of confusion is limited to "generic" alerts like index threshold. If we end up with alert types that have action groups like "warning", "error", "critical", it seems fine that "recovered" is what you want.

It's also not clear:

  • how many cases we'll have of customers doing "positive" checks
  • how bad is it, if we did use "recover" for those "positive" checks
  • is consistency of that term, across all alerts, important? Will customers be confused if it's "recovered" for some alerts, but different (eg, "threshold not met") for other alerts? Perhaps some other UI affordance (special icon) could be used for the "recovered" action group, to help identify it's meaning.

Fun experiment would be to build some kind of real-ish positive alert with index threshold, see what the UIs, action messages, etc look like. I wonder if the alert is named well enough, to indicate it's positive-ness, would that be enough to allow us to use "recovered" for these anyway.

@mikecote
Copy link
Contributor Author

OK DataDog This implies that the monitored source of the alert is actively reporting that it's OK. This is a different >concept than our framework where we think in detection, rather than monitoring.

We do use OK as an execution status and display it to the user as the alert status (instead of recovered or w/e other term we plan on using). OK does relate to what "hook" their actions would fire (Run When: OK). Though it doesn't re-notify / throttle, etc.

@gmmorris
Copy link
Contributor

It occurred to me last night, that as we group all of the domain specific action groups under an Active label, perhaps we should label the lack of an action group (which is what this is from the both the user's and the domain's perspective) as Inactive?

Instead of Recovered which does suggest something about the unique identifier's new state, we could simply call it Inactive, which would make sense along side the Active alerts.

The obvious downside to this is that it doesn't really mean anything other than being the opposite of Active, which makes me wonder if even Active is actually the right term. I fear we're on the edge of being so generic that the term loses all meaning.

@gmmorris
Copy link
Contributor

OK
DataDog
This implies that the monitored source of the alert is actively reporting that it's OK. This is a different >concept than our framework where we think in detection, rather than monitoring.

We do use OK as an execution status and display it to the user as the alert status (instead of recovered or w/e other term we plan on using). OK does relate to what "hook" their actions would fire (Run When: OK). Though it doesn't re-notify / throttle, etc.

I'd prefer we keep the terms distinct between Rules and Alerts.
My fear is users might confuse their Rule's status for the Alert's status... ensuring we don't use the same term in both should help reduce the danger of that specific confusion.

@mikecote
Copy link
Contributor Author

I'd prefer we keep the terms distinct between Rules and Alerts.

I think we're on the same page but using different terminology. I'd prefer we keep communicating using the current terminology until the upcoming terminology is used outside the working group.

My fear is users might confuse their Rule's status for the Alert's status... ensuring we don't use the same term in both should help reduce the danger of that specific confusion.

This is already happening because the alert and alert instance use "OK" status. Maybe we should change the alert "OK" status to reflect the new "resolved" action group. In your case "inactive" instead of "OK". This way the user can see a relation between the status and the "Run When" field.

@gmmorris
Copy link
Contributor

gmmorris commented Nov 24, 2020

I think we're on the same page but using different terminology. I'd prefer we keep communicating using the current terminology until the upcoming terminology is used outside the working group.

Haha, no, that was me using Rules to mean Alerts and Alert to mean AlertInstances 😆
Which ended up just confusing you... sorry

@pmuellr
Copy link
Member

pmuellr commented Nov 24, 2020

The "ok" and "active" terminology is I believe only used by us and not used in any way by customers (eg, in action parameter templates), though it does show up in the various status values we return from APIs. If we changed "ok" to "inactive", we'd break any customers using the API and depending on this value, but I'd guess there are few if any customers doing this today. Certainly breaking API, but low risk, and if we have to do it, the sooner the better.

We could also come up with a new field name for the new values of the terms, and return the old terms in the old field for some deprecation time ...

@gmmorris
Copy link
Contributor

gmmorris commented Nov 24, 2020

It's time to make a decision, so given all the back and forth, the team had a quick synchronous 👍 / 👎 and we landed on the following:

We will change the default term from “Resolved” to “Recovered”, as it fits most use cases and we feel users are most likely to understand its meaning across domains.
That said, the concern about it being incorrect for certain use cases (such as Maps) seems solid enough to require addressing, so we’re also going to add the ability for an Alert Type to specify its own label, so that recovered alerts are labeled in a manner that makes sense in that Alert Type’s domain.

@kobelb kobelb added the needs-team Issues missing a team label label Jan 31, 2022
@botelastic botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss Feature:Actions Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants