Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Metrics] POC for Alerting on Metric Threshold #47165

Closed
wants to merge 6 commits into from

Conversation

Zacqary
Copy link
Contributor

@Zacqary Zacqary commented Oct 2, 2019

Closes #46511

(DON'T MERGE THIS TO MASTER)

This is a proof of concept for a metric alerting system. Through three endpoints, you can execute CRUD operations (well, CRD in this proof of concept) to manage a set of metric threshold alerts for any part of your infrastructure.

Design Intent

There is no UI yet, but I designed this system with the expectation that a user would:

  • Enter a search query in the Metrics Explorer
  • Select one of the resulting charts and choose "Alert on this" from the options menu

Out of the three basic types of search queries you might execute:

  1. Searching for a metric on a specific host/agent/instance/other thing
  2. Using a groupBy query to search for metrics on every host/agent/instance/etc
  3. Searching for the aggregate of a metric across the entire infrastructure

This proof of concept allows you to create alerts on 1. and 2. It should be easy to add functionality for 3. by borrowing code from the Snapshot query, but in the interest of timeboxing this POC I stopped short of doing that.

The endpoints in this POC allow you to input all the parameters of a metric query, plus:

  • A threshold value
  • Whether to alert when the metric is greater than or less than (or >=/<=) this threshold
  • A Slack channel webhook to send alerts to

(You can also have it send a server log to Kibana on alert, but this is primarily for testing. There's a parameter for email alerts as well but I didn't get those working yet.)

Testing

To test this, make sure to explicitly enable both required plugins in your Kibana config file:

xpack.actions.enabled: true
xpack.alerting.enabled: true

You will also need to run Elasticsearch and Kibana with SSL enabled in order to use the Alerting APIs:

$ yarn es snapshot --ssl
$ yarn start --ssl

The shared Observability clusters don't seem to work with SSL enabled in Kibana. Try a Cloud account if you want to test this with more complexity than you can locally.

How It Works

  • Use the Create Alert API to define a metric, aggregator, threshold, and what part of the infrastructure to query. Also provide an interval.
  • Every time the interval passes, the alert will measure that metric over a time bucket equal to the interval length. If it has crossed the threshold, the alert will go into an Alert state.
  • When the value recovers and goes back over the threshold, the alert will go back into an OK state.
  • The alert will only send a notification when it changes state. If you have an alert to run at 1 minute intervals, you won't continue to get a notification every single minute.

groupBy queries are implemented by creating one alert for each possible group with a single API call. There's probably a more efficient way to do this using an Elasticsearch query and one single alert.

API Reference

POST /api/infra/alerts/metric_threshold - Create Alert

This API creates a new metric threshold alert. The /metric_threshold URL is a convention I'm using under the assumption that we might add more types of metric alerts besides simple threshold, such as rate of change alerts, anomaly or outlier detection, forecasting, etc.

Query parameters

metric::
(Required, string) The metric to measure and alert on, e.g. system.load.1

aggregator::
(Required, string) Valid options are avg, max, min, cardinality, rate, and count

comparator::
(Required, string) Valid options are >, <, >=, or <=

threshold::
(Required, number) An alert will fire when the metric is >, <, >=, or <= this value (as defined by the comparator)

interval::
(Required, string) Must be a valid calendar interval. This is how often to run the alert, and also the length of the time bucket that it will evaluate data over

searchField::
(Required, object) Takes a name and value param. This defines the field to retrieve metric data from.

name::
(Required, string) e.g host.hostname, agent.id, etc.

value::
(Required, string) If this is a specific value, a single alert for that value (e.g host.name: myHost) will be created. If this is *, a multi-alert will be created for every possible value of searchField.name. Essentially it will track every chart that you'd get back in a groupBy query on the Metrics Explorer.

indexPattern::
(Required, string) The index pattern to query for metric data, e.g. metricbeat-*

actions::
(Required, object) This can contain one or more of:

slack::
(Optional, string) A webhook URL for a Slack channel. When this alert fires, it will send notifications to this channel.

log::
(Optional, boolean) If true, this will log out a message to the Kibana server when the alert fires.

email::
(Not yet implemented)

GET /api/infra/alerts/list - List Alerts

This API will return a JSON array of all the currently created metric alerts, plus their current alert states. The value of currentAlertState can be:

  • 0 - The alert is in an "OK" state
  • 1 - The alert is in an "ALERT" state

Included in the AlertStates enum, but not yet implemented, are:

  • 2 - A "WARN" state
  • 3 - A "NO DATA" state, for when the alert queries the metric and receives no data back
  • 4 - A "SNOOZED" state, for when we don't want the alert to fire right now

For multi-alerts, this API will return a parent alert that lists the IDs of an individual child alert for each grouping. In this POC, you will need to refer to each child alert to determine the overall alert state. We can automate this in a later iteration.

DELETE /api/infra/alerts - Delete an Alert

This API will delete an alert that you've created. This is a wrapper for the Alerting API's delete system, with some additional features:

  • It will delete all children of a multi-alert
  • It will disallow you from deleting only one child of a multi-alert, and insist that you delete the parent
  • It will clean up additional SavedObjects that the infrastructure alert system uses to track metric alerts

Query parameters

id::
(Required, string) The ID of the alert you'd like to delete

Known issue

The syntax for this query is /api/infra/alerts?id=<id>. I would prefer to do it like /api/infra/alerts/<id> but I don't actually know how to configure a path parameter using our routing system, so if someone could tell me how to do that, that would be great.

Feedback for the Alerting Team

API Limitations

Due to limitations in the way the Alerting API handles saved objects, our Create Alert API will:

  • Call the alerting API to register a new alert instance
  • Add a SavedObject of type infrastructure-alert in order to keep track of which alerts were created by the infrastructure app

The List Alerts API retrieves its alerts from the infrastructure-alert SavedObject collection. If there were a way to add tags to created alert instances so that we could retrieve them later, we might not have to maintain as much of a separate SavedObject database.

The Alerting API also doesn't allow you to retrieve the current state of an alert instance. Therefore, every time an alert evaluates, I have it update its infrastructure-alert SavedObject with its current state so that the List API can display it. I'd prefer to have an endpoint in the Alerting API that would allow me to do this.

#47379 is necessary for this POC to work. I cherry-picked the essential parts of it, but please definitely merge that.

Documentation

While I did end up figuring out what actionGroups were for, and I used them to differentiate between sending a fired notification and a recovered notification, I do agree with the note on #46547 that the documentation could be clearer about them.

It was difficult for me to figure out the {{{context}}} convention of templating alert messages. I copied this from the APM POC, but I'm still not sure where the docs actually explain that.

@Zacqary Zacqary added WIP Work in progress Feature:Alerting Feature:Metrics UI Metrics UI feature Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services labels Oct 2, 2019
@elasticmachine
Copy link
Contributor

Pinging @elastic/infra-logs-ui (Team:infra-logs-ui)

@elasticmachine
Copy link
Contributor

💔 Build Failed

@elasticmachine
Copy link
Contributor

💔 Build Failed

@elasticmachine
Copy link
Contributor

💔 Build Failed

@Zacqary
Copy link
Contributor Author

Zacqary commented Oct 8, 2019

Known issue: Alert instance state seems to reset when the Kibana server restarts.

@elasticmachine
Copy link
Contributor

💔 Build Failed

@Zacqary Zacqary marked this pull request as ready for review October 9, 2019 18:03
@Zacqary Zacqary requested a review from a team as a code owner October 9, 2019 18:03
@Zacqary Zacqary requested a review from a team October 9, 2019 18:03
@Zacqary Zacqary removed the WIP Work in progress label Oct 9, 2019
@jasonrhodes
Copy link
Member

For PRs like this I would just keep them as a DRAFT PR and then the "do not merge" thing is implicit. But since GitHub makes it impossible to put a PR back into draft state, no worries on this one! 👍

@Zacqary Zacqary changed the title [Infra] POC for Alerting on Metric Threshold [Metrics] POC for Alerting on Metric Threshold Oct 18, 2019
@jasonrhodes
Copy link
Member

jasonrhodes commented Oct 22, 2019

Questions I'd love to see us answer during/after the R&D Review meeting:

  1. Does the current API for Kibana alerts feel mature enough for us to file tickets to start building some real implementations? (How were the docs, how was it to get help from the alert team, are there glaring shortcomings, etc)
  2. When an alert is created, what are the options for managing that alert currently? (Disabling, deleting, updating, etc) -- do we have to build our own UI to do this right now?
  3. Are there features that this POC hasn't attempted but that we should test out in addition to this POC?
  4. What do we think the highest value, lowest effort (MVP) version of an alert could be if we wanted to add one into one or both of our metrics and logging apps soon?

@sgrodzicki
Copy link
Contributor

@Zacqary @jasonrhodes should we keep this open or maybe use it as a reference for our alerting efforts and close it?

@Zacqary
Copy link
Contributor Author

Zacqary commented Dec 18, 2019

I think it can close since it's not going to be merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Alerting Feature:Metrics UI Metrics UI feature Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Observability] Alerting POC, metrics threshold alert
4 participants