You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This is a proof of concept for a metric alerting system. Through three endpoints, you can execute CRUD operations (well, CRD in this proof of concept) to manage a set of metric threshold alerts for any part of your infrastructure.
Design Intent
There is no UI yet, but I designed this system with the expectation that a user would:
Enter a search query in the Metrics Explorer
Select one of the resulting charts and choose "Alert on this" from the options menu
Out of the three basic types of search queries you might execute:
Searching for a metric on a specific host/agent/instance/other thing
Using a groupBy query to search for metrics on every host/agent/instance/etc
Searching for the aggregate of a metric across the entire infrastructure
This proof of concept allows you to create alerts on 1. and 2. It should be easy to add functionality for 3. by borrowing code from the Snapshot query, but in the interest of timeboxing this POC I stopped short of doing that.
The endpoints in this POC allow you to input all the parameters of a metric query, plus:
A threshold value
Whether to alert when the metric is greater than or less than (or >=/<=) this threshold
A Slack channel webhook to send alerts to
(You can also have it send a server log to Kibana on alert, but this is primarily for testing. There's a parameter for email alerts as well but I didn't get those working yet.)
Testing
To test this, make sure to explicitly enable both required plugins in your Kibana config file:
You will also need to run Elasticsearch and Kibana with SSL enabled in order to use the Alerting APIs:
$ yarn es snapshot --ssl
$ yarn start --ssl
The shared Observability clusters don't seem to work with SSL enabled in Kibana. Try a Cloud account if you want to test this with more complexity than you can locally.
How It Works
Use the Create Alert API to define a metric, aggregator, threshold, and what part of the infrastructure to query. Also provide an interval.
Every time the interval passes, the alert will measure that metric over a time bucket equal to the interval length. If it has crossed the threshold, the alert will go into an Alert state.
When the value recovers and goes back over the threshold, the alert will go back into an OK state.
The alert will only send a notification when it changes state. If you have an alert to run at 1 minute intervals, you won't continue to get a notification every single minute.
groupBy queries are implemented by creating one alert for each possible group with a single API call. There's probably a more efficient way to do this using an Elasticsearch query and one single alert.
API Reference
POST /api/infra/alerts/metric_threshold - Create Alert
This API creates a new metric threshold alert. The /metric_threshold URL is a convention I'm using under the assumption that we might add more types of metric alerts besides simple threshold, such as rate of change alerts, anomaly or outlier detection, forecasting, etc.
Query parameters
metric::
(Required, string) The metric to measure and alert on, e.g. system.load.1
aggregator::
(Required, string) Valid options are avg, max, min, cardinality, rate, and count
comparator::
(Required, string) Valid options are >, <, >=, or <=
threshold::
(Required, number) An alert will fire when the metric is >, <, >=, or <= this value (as defined by the comparator)
interval::
(Required, string) Must be a valid calendar interval. This is how often to run the alert, and also the length of the time bucket that it will evaluate data over
searchField::
(Required, object) Takes a name and value param. This defines the field to retrieve metric data from.
name::
(Required, string) e.g host.hostname, agent.id, etc.
value::
(Required, string) If this is a specific value, a single alert for that value (e.g host.name: myHost) will be created. If this is *, a multi-alert will be created for every possible value of searchField.name. Essentially it will track every chart that you'd get back in a groupBy query on the Metrics Explorer.
indexPattern::
(Required, string) The index pattern to query for metric data, e.g. metricbeat-*
actions::
(Required, object) This can contain one or more of:
slack::
(Optional, string) A webhook URL for a Slack channel. When this alert fires, it will send notifications to this channel.
log::
(Optional, boolean) If true, this will log out a message to the Kibana server when the alert fires.
email::
(Not yet implemented)
GET /api/infra/alerts/list - List Alerts
This API will return a JSON array of all the currently created metric alerts, plus their current alert states. The value of currentAlertState can be:
0 - The alert is in an "OK" state
1 - The alert is in an "ALERT" state
Included in the AlertStates enum, but not yet implemented, are:
2 - A "WARN" state
3 - A "NO DATA" state, for when the alert queries the metric and receives no data back
4 - A "SNOOZED" state, for when we don't want the alert to fire right now
For multi-alerts, this API will return a parent alert that lists the IDs of an individual child alert for each grouping. In this POC, you will need to refer to each child alert to determine the overall alert state. We can automate this in a later iteration.
DELETE /api/infra/alerts - Delete an Alert
This API will delete an alert that you've created. This is a wrapper for the Alerting API's delete system, with some additional features:
It will delete all children of a multi-alert
It will disallow you from deleting only one child of a multi-alert, and insist that you delete the parent
It will clean up additional SavedObjects that the infrastructure alert system uses to track metric alerts
Query parameters
id::
(Required, string) The ID of the alert you'd like to delete
Known issue
The syntax for this query is /api/infra/alerts?id=<id>. I would prefer to do it like /api/infra/alerts/<id> but I don't actually know how to configure a path parameter using our routing system, so if someone could tell me how to do that, that would be great.
Feedback for the Alerting Team
API Limitations
Due to limitations in the way the Alerting API handles saved objects, our Create Alert API will:
Call the alerting API to register a new alert instance
Add a SavedObject of type infrastructure-alert in order to keep track of which alerts were created by the infrastructure app
The List Alerts API retrieves its alerts from the infrastructure-alert SavedObject collection. If there were a way to add tags to created alert instances so that we could retrieve them later, we might not have to maintain as much of a separate SavedObject database.
The Alerting API also doesn't allow you to retrieve the current state of an alert instance. Therefore, every time an alert evaluates, I have it update its infrastructure-alert SavedObject with its current state so that the List API can display it. I'd prefer to have an endpoint in the Alerting API that would allow me to do this.
#47379 is necessary for this POC to work. I cherry-picked the essential parts of it, but please definitely merge that.
Documentation
While I did end up figuring out what actionGroups were for, and I used them to differentiate between sending a fired notification and a recovered notification, I do agree with the note on #46547 that the documentation could be clearer about them.
It was difficult for me to figure out the {{{context}}} convention of templating alert messages. I copied this from the APM POC, but I'm still not sure where the docs actually explain that.
For PRs like this I would just keep them as a DRAFT PR and then the "do not merge" thing is implicit. But since GitHub makes it impossible to put a PR back into draft state, no worries on this one! 👍
Zacqary
changed the title
[Infra] POC for Alerting on Metric Threshold
[Metrics] POC for Alerting on Metric Threshold
Oct 18, 2019
Questions I'd love to see us answer during/after the R&D Review meeting:
Does the current API for Kibana alerts feel mature enough for us to file tickets to start building some real implementations? (How were the docs, how was it to get help from the alert team, are there glaring shortcomings, etc)
When an alert is created, what are the options for managing that alert currently? (Disabling, deleting, updating, etc) -- do we have to build our own UI to do this right now?
Are there features that this POC hasn't attempted but that we should test out in addition to this POC?
What do we think the highest value, lowest effort (MVP) version of an alert could be if we wanted to add one into one or both of our metrics and logging apps soon?
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #46511
(DON'T MERGE THIS TO MASTER)
This is a proof of concept for a metric alerting system. Through three endpoints, you can execute CRUD operations (well, CRD in this proof of concept) to manage a set of metric threshold alerts for any part of your infrastructure.
Design Intent
There is no UI yet, but I designed this system with the expectation that a user would:
Out of the three basic types of search queries you might execute:
This proof of concept allows you to create alerts on 1. and 2. It should be easy to add functionality for 3. by borrowing code from the Snapshot query, but in the interest of timeboxing this POC I stopped short of doing that.
The endpoints in this POC allow you to input all the parameters of a metric query, plus:
(You can also have it send a server log to Kibana on alert, but this is primarily for testing. There's a parameter for email alerts as well but I didn't get those working yet.)
Testing
To test this, make sure to explicitly enable both required plugins in your Kibana config file:
You will also need to run Elasticsearch and Kibana with SSL enabled in order to use the Alerting APIs:
The shared Observability clusters don't seem to work with SSL enabled in Kibana. Try a Cloud account if you want to test this with more complexity than you can locally.
How It Works
groupBy
queries are implemented by creating one alert for each possible group with a single API call. There's probably a more efficient way to do this using an Elasticsearch query and one single alert.API Reference
POST /api/infra/alerts/metric_threshold
- Create AlertThis API creates a new metric threshold alert. The
/metric_threshold
URL is a convention I'm using under the assumption that we might add more types of metric alerts besides simple threshold, such as rate of change alerts, anomaly or outlier detection, forecasting, etc.Query parameters
metric
::(Required, string) The metric to measure and alert on, e.g.
system.load.1
aggregator
::(Required, string) Valid options are
avg
,max
,min
,cardinality
,rate
, andcount
comparator
::(Required, string) Valid options are
>
,<
,>=
, or<=
threshold
::(Required, number) An alert will fire when the
metric
is>
,<
,>=
, or<=
this value (as defined by thecomparator
)interval
::(Required, string) Must be a valid calendar interval. This is how often to run the alert, and also the length of the time bucket that it will evaluate data over
searchField
::(Required, object) Takes a
name
andvalue
param. This defines the field to retrieve metric data from.name
::(Required, string) e.g
host.hostname
,agent.id
, etc.value
::(Required, string) If this is a specific value, a single alert for that value (e.g
host.name: myHost
) will be created. If this is*
, a multi-alert will be created for every possible value ofsearchField.name
. Essentially it will track every chart that you'd get back in a groupBy query on the Metrics Explorer.indexPattern
::(Required, string) The index pattern to query for metric data, e.g.
metricbeat-*
actions
::(Required, object) This can contain one or more of:
slack
::(Optional, string) A webhook URL for a Slack channel. When this alert fires, it will send notifications to this channel.
log
::(Optional, boolean) If true, this will log out a message to the Kibana server when the alert fires.
email
::(Not yet implemented)
GET /api/infra/alerts/list
- List AlertsThis API will return a JSON array of all the currently created metric alerts, plus their current alert states. The value of
currentAlertState
can be:0
- The alert is in an "OK" state1
- The alert is in an "ALERT" stateIncluded in the
AlertStates
enum, but not yet implemented, are:2
- A "WARN" state3
- A "NO DATA" state, for when the alert queries the metric and receives no data back4
- A "SNOOZED" state, for when we don't want the alert to fire right nowFor multi-alerts, this API will return a parent alert that lists the IDs of an individual child alert for each grouping. In this POC, you will need to refer to each child alert to determine the overall alert state. We can automate this in a later iteration.
DELETE /api/infra/alerts
- Delete an AlertThis API will delete an alert that you've created. This is a wrapper for the Alerting API's delete system, with some additional features:
Query parameters
id
::(Required, string) The ID of the alert you'd like to delete
Known issue
The syntax for this query is
/api/infra/alerts?id=<id>
. I would prefer to do it like/api/infra/alerts/<id>
but I don't actually know how to configure a path parameter using our routing system, so if someone could tell me how to do that, that would be great.Feedback for the Alerting Team
API Limitations
Due to limitations in the way the Alerting API handles saved objects, our Create Alert API will:
infrastructure-alert
in order to keep track of which alerts were created by the infrastructure appThe List Alerts API retrieves its alerts from the
infrastructure-alert
SavedObject collection. If there were a way to add tags to created alert instances so that we could retrieve them later, we might not have to maintain as much of a separate SavedObject database.The Alerting API also doesn't allow you to retrieve the current state of an alert instance. Therefore, every time an alert evaluates, I have it update its
infrastructure-alert
SavedObject with its current state so that the List API can display it. I'd prefer to have an endpoint in the Alerting API that would allow me to do this.#47379 is necessary for this POC to work. I cherry-picked the essential parts of it, but please definitely merge that.
Documentation
While I did end up figuring out what
actionGroups
were for, and I used them to differentiate between sending afired
notification and arecovered
notification, I do agree with the note on #46547 that the documentation could be clearer about them.It was difficult for me to figure out the
{{{context}}}
convention of templating alert messages. I copied this from the APM POC, but I'm still not sure where the docs actually explain that.