[Metrics] POC for Alerting on Metric Threshold #47165

Zacqary · 2019-10-02T21:38:16Z

(DON'T MERGE THIS TO MASTER)

This is a proof of concept for a metric alerting system. Through three endpoints, you can execute CRUD operations (well, CRD in this proof of concept) to manage a set of metric threshold alerts for any part of your infrastructure.

Design Intent

There is no UI yet, but I designed this system with the expectation that a user would:

Enter a search query in the Metrics Explorer
Select one of the resulting charts and choose "Alert on this" from the options menu

Out of the three basic types of search queries you might execute:

Searching for a metric on a specific host/agent/instance/other thing
Using a groupBy query to search for metrics on every host/agent/instance/etc
Searching for the aggregate of a metric across the entire infrastructure

This proof of concept allows you to create alerts on 1. and 2. It should be easy to add functionality for 3. by borrowing code from the Snapshot query, but in the interest of timeboxing this POC I stopped short of doing that.

The endpoints in this POC allow you to input all the parameters of a metric query, plus:

A threshold value
Whether to alert when the metric is greater than or less than (or >=/<=) this threshold
A Slack channel webhook to send alerts to

(You can also have it send a server log to Kibana on alert, but this is primarily for testing. There's a parameter for email alerts as well but I didn't get those working yet.)

Testing

To test this, make sure to explicitly enable both required plugins in your Kibana config file:

xpack.actions.enabled: true
xpack.alerting.enabled: true

You will also need to run Elasticsearch and Kibana with SSL enabled in order to use the Alerting APIs:

$ yarn es snapshot --ssl

$ yarn start --ssl

The shared Observability clusters don't seem to work with SSL enabled in Kibana. Try a Cloud account if you want to test this with more complexity than you can locally.

How It Works

Use the Create Alert API to define a metric, aggregator, threshold, and what part of the infrastructure to query. Also provide an interval.
Every time the interval passes, the alert will measure that metric over a time bucket equal to the interval length. If it has crossed the threshold, the alert will go into an Alert state.
When the value recovers and goes back over the threshold, the alert will go back into an OK state.
The alert will only send a notification when it changes state. If you have an alert to run at 1 minute intervals, you won't continue to get a notification every single minute.

groupBy queries are implemented by creating one alert for each possible group with a single API call. There's probably a more efficient way to do this using an Elasticsearch query and one single alert.

API Reference

`POST /api/infra/alerts/metric_threshold` - Create Alert

This API creates a new metric threshold alert. The /metric_threshold URL is a convention I'm using under the assumption that we might add more types of metric alerts besides simple threshold, such as rate of change alerts, anomaly or outlier detection, forecasting, etc.

Query parameters

metric::
(Required, string) The metric to measure and alert on, e.g. system.load.1

aggregator::
(Required, string) Valid options are avg, max, min, cardinality, rate, and count

comparator::
(Required, string) Valid options are >, <, >=, or <=

threshold::
(Required, number) An alert will fire when the metric is >, <, >=, or <= this value (as defined by the comparator)

interval::
(Required, string) Must be a valid calendar interval. This is how often to run the alert, and also the length of the time bucket that it will evaluate data over

searchField::
(Required, object) Takes a name and value param. This defines the field to retrieve metric data from.

name::
(Required, string) e.g host.hostname, agent.id, etc.

value::
(Required, string) If this is a specific value, a single alert for that value (e.g host.name: myHost) will be created. If this is *, a multi-alert will be created for every possible value of searchField.name. Essentially it will track every chart that you'd get back in a groupBy query on the Metrics Explorer.

indexPattern::
(Required, string) The index pattern to query for metric data, e.g. metricbeat-*

actions::
(Required, object) This can contain one or more of:

slack::
(Optional, string) A webhook URL for a Slack channel. When this alert fires, it will send notifications to this channel.

log::
(Optional, boolean) If true, this will log out a message to the Kibana server when the alert fires.

email::
(Not yet implemented)

`GET /api/infra/alerts/list` - List Alerts

This API will return a JSON array of all the currently created metric alerts, plus their current alert states. The value of currentAlertState can be:

0 - The alert is in an "OK" state
1 - The alert is in an "ALERT" state

Included in the AlertStates enum, but not yet implemented, are:

2 - A "WARN" state
3 - A "NO DATA" state, for when the alert queries the metric and receives no data back
4 - A "SNOOZED" state, for when we don't want the alert to fire right now

For multi-alerts, this API will return a parent alert that lists the IDs of an individual child alert for each grouping. In this POC, you will need to refer to each child alert to determine the overall alert state. We can automate this in a later iteration.

`DELETE /api/infra/alerts` - Delete an Alert

This API will delete an alert that you've created. This is a wrapper for the Alerting API's delete system, with some additional features:

It will delete all children of a multi-alert
It will disallow you from deleting only one child of a multi-alert, and insist that you delete the parent
It will clean up additional SavedObjects that the infrastructure alert system uses to track metric alerts

Query parameters

id::
(Required, string) The ID of the alert you'd like to delete

Known issue

The syntax for this query is /api/infra/alerts?id=<id>. I would prefer to do it like /api/infra/alerts/<id> but I don't actually know how to configure a path parameter using our routing system, so if someone could tell me how to do that, that would be great.

Feedback for the Alerting Team

API Limitations

Due to limitations in the way the Alerting API handles saved objects, our Create Alert API will:

Call the alerting API to register a new alert instance
Add a SavedObject of type infrastructure-alert in order to keep track of which alerts were created by the infrastructure app

The List Alerts API retrieves its alerts from the infrastructure-alert SavedObject collection. If there were a way to add tags to created alert instances so that we could retrieve them later, we might not have to maintain as much of a separate SavedObject database.

The Alerting API also doesn't allow you to retrieve the current state of an alert instance. Therefore, every time an alert evaluates, I have it update its infrastructure-alert SavedObject with its current state so that the List API can display it. I'd prefer to have an endpoint in the Alerting API that would allow me to do this.

#47379 is necessary for this POC to work. I cherry-picked the essential parts of it, but please definitely merge that.

Documentation

While I did end up figuring out what actionGroups were for, and I used them to differentiate between sending a fired notification and a recovered notification, I do agree with the note on #46547 that the documentation could be clearer about them.

It was difficult for me to figure out the {{{context}}} convention of templating alert messages. I copied this from the APM POC, but I'm still not sure where the docs actually explain that.

elasticmachine · 2019-10-02T21:38:18Z

Pinging @elastic/infra-logs-ui (Team:infra-logs-ui)

elasticmachine · 2019-10-02T22:47:50Z

💔 Build Failed

continuous-integration/kibana-ci/pull-request
Commit: 96ef0bd

elasticmachine · 2019-10-03T17:22:28Z

💔 Build Failed

continuous-integration/kibana-ci/pull-request
Commit: 4e30bb6

elasticmachine · 2019-10-03T17:52:42Z

💔 Build Failed

continuous-integration/kibana-ci/pull-request
Commit: cb27ef3

Zacqary · 2019-10-08T17:47:46Z

Known issue: Alert instance state seems to reset when the Kibana server restarts.

elasticmachine · 2019-10-08T22:05:51Z

💔 Build Failed

continuous-integration/kibana-ci/pull-request
Commit: 5eb056a

jasonrhodes · 2019-10-10T00:11:03Z

For PRs like this I would just keep them as a DRAFT PR and then the "do not merge" thing is implicit. But since GitHub makes it impossible to put a PR back into draft state, no worries on this one! 👍

jasonrhodes · 2019-10-22T12:55:25Z

Questions I'd love to see us answer during/after the R&D Review meeting:

Does the current API for Kibana alerts feel mature enough for us to file tickets to start building some real implementations? (How were the docs, how was it to get help from the alert team, are there glaring shortcomings, etc)
When an alert is created, what are the options for managing that alert currently? (Disabling, deleting, updating, etc) -- do we have to build our own UI to do this right now?
Are there features that this POC hasn't attempted but that we should test out in addition to this POC?
What do we think the highest value, lowest effort (MVP) version of an alert could be if we wanted to add one into one or both of our metrics and logging apps soon?

sgrodzicki · 2019-12-18T09:33:08Z

@Zacqary @jasonrhodes should we keep this open or maybe use it as a reference for our alerting efforts and close it?

Zacqary · 2019-12-18T17:24:50Z

I think it can close since it's not going to be merged.

[Infra] Add basic backend for metric threshold alerts

f33df05

Zacqary added WIP Work in progress Feature:Alerting Feature:Metrics UI Metrics UI feature Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services labels Oct 2, 2019

Merge remote-tracking branch 'upstream/master' into 46511-alerting-poc

96ef0bd

Zacqary added 2 commits October 3, 2019 11:38

Define separate fired/recovered action groups

4e30bb6

Allow alerting on arbitrary search fields besides host.name

cb27ef3

Add list and delete endpoioints

d64d0df

Add groupBy alerts

5eb056a

Zacqary marked this pull request as ready for review October 9, 2019 18:03

Zacqary requested a review from a team as a code owner October 9, 2019 18:03

Zacqary requested a review from a team October 9, 2019 18:03

Zacqary removed the WIP Work in progress label Oct 9, 2019

Zacqary changed the title ~~[Infra] POC for Alerting on Metric Threshold~~ [Metrics] POC for Alerting on Metric Threshold Oct 18, 2019

Zacqary closed this Dec 18, 2019

Zacqary mentioned this pull request Feb 13, 2020

[Metrics Alerts] Create Metric Threshold Alert Type and Executor #57606

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Metrics] POC for Alerting on Metric Threshold #47165

[Metrics] POC for Alerting on Metric Threshold #47165

Zacqary commented Oct 2, 2019 •

edited

Loading

elasticmachine commented Oct 2, 2019

elasticmachine commented Oct 2, 2019

elasticmachine commented Oct 3, 2019

elasticmachine commented Oct 3, 2019

Zacqary commented Oct 8, 2019

elasticmachine commented Oct 8, 2019

jasonrhodes commented Oct 10, 2019

jasonrhodes commented Oct 22, 2019 •

edited

Loading

sgrodzicki commented Dec 18, 2019

Zacqary commented Dec 18, 2019

[Metrics] POC for Alerting on Metric Threshold #47165

[Metrics] POC for Alerting on Metric Threshold #47165

Conversation

Zacqary commented Oct 2, 2019 • edited Loading

Design Intent

Testing

How It Works

API Reference

POST /api/infra/alerts/metric_threshold - Create Alert

Query parameters

GET /api/infra/alerts/list - List Alerts

DELETE /api/infra/alerts - Delete an Alert

Query parameters

Known issue

Feedback for the Alerting Team

API Limitations

Documentation

elasticmachine commented Oct 2, 2019

elasticmachine commented Oct 2, 2019

💔 Build Failed

elasticmachine commented Oct 3, 2019

💔 Build Failed

elasticmachine commented Oct 3, 2019

💔 Build Failed

Zacqary commented Oct 8, 2019

elasticmachine commented Oct 8, 2019

💔 Build Failed

jasonrhodes commented Oct 10, 2019

jasonrhodes commented Oct 22, 2019 • edited Loading

sgrodzicki commented Dec 18, 2019

Zacqary commented Dec 18, 2019

Zacqary commented Oct 2, 2019 •

edited

Loading

`POST /api/infra/alerts/metric_threshold` - Create Alert

`GET /api/infra/alerts/list` - List Alerts

`DELETE /api/infra/alerts` - Delete an Alert

jasonrhodes commented Oct 22, 2019 •

edited

Loading