Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[alerting] sorted limit of groups in index threshold alert #58905

Closed
pmuellr opened this issue Feb 28, 2020 · 4 comments · Fixed by #60120
Closed

[alerting] sorted limit of groups in index threshold alert #58905

pmuellr opened this issue Feb 28, 2020 · 4 comments · Fixed by #60120
Assignees
Labels
Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v7.7.0

Comments

@pmuellr
Copy link
Member

pmuellr commented Feb 28, 2020

The watcher index threshold alert which the new Kibana alerting index threshold alert is based on, has an option to limit the number of "groups" returned (when using groupField). The Kibana alert supports this, but the watcher one labels it as "Top n of ...", implying that the groups are somehow sorted before limiting, presumably showing you the most relevant groups.

It's not quite clear how this works, given all the aggregation functions. I think for count, average max and sum, you'd basically want to pick the groups that the highest values being processed. For min, you'd want the lowest. For between? And I added a "notBetween" to the Kibana alert. I think maybe we just don't sort for those. note: between is a comparator, not an aggregation

We'll need to figure out how to work this into our query DSL that we are sending. I could see some sorting done with the size limiter, not quite sure if that's still applicable given we're doing a different query than watcher did, but seems like a start.

@pmuellr pmuellr added Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Feb 28, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@pmuellr pmuellr changed the title [alerting]: sorted limit of groups in index threshold alert [alerting] sorted limit of groups in index threshold alert Feb 28, 2020
@pmuellr
Copy link
Member Author

pmuellr commented Mar 12, 2020

Where this gets interesting is when the aggregation function changes - count() avg() min() max() sum(). The query for count() is different from the others, as we don't need a metric aggregation, ES provides the doc counts in the date range buckets.

Somehow near the size: 42 part of the request, we'll want to apply some kind of ordering. This seems to be the best reference on this:

request/response with count() over top 42 host.name.keyword

request
{
    "index": [
        "es-apm-sys-sim"
    ],
    "body": {
        "size": 0,
        "query": {
            "bool": {
                "filter": {
                    "range": {
                        "@timestamp": {
                            "gte": "2020-03-12T17:57:36.229Z",
                            "lt": "2020-03-12T17:58:26.229Z",
                            "format": "strict_date_time"
                        }
                    }
                }
            }
        },
        "aggs": {
            "groupAgg": {
                "terms": {
                    "field": "host.name.keyword",
                    "size": 42
                },
                "aggs": {
                    "dateAgg": {
                        "date_range": {
                            "field": "@timestamp",
                            "ranges": [
                                {
                                    "from": "2020-03-12T17:57:36.229Z",
                                    "to": "2020-03-12T17:58:26.229Z"
                                }
                            ]
                        }
                    }
                }
            }
        }
    },
    "ignoreUnavailable": true,
    "allowNoIndices": true,
    "ignore": [
        404
    ]
}
response
{
    "took": 2,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 196,
            "relation": "eq"
        },
        "max_score": null,
        "hits": []
    },
    "aggregations": {
        "groupAgg": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
                {
                    "key": "host-A",
                    "doc_count": 49,
                    "dateAgg": {
                        "buckets": [
                            {
                                "key": "2020-03-12T17:57:36.229Z-2020-03-12T17:58:26.229Z",
                                "from": 1584035856229,
                                "from_as_string": "2020-03-12T17:57:36.229Z",
                                "to": 1584035906229,
                                "to_as_string": "2020-03-12T17:58:26.229Z",
                                "doc_count": 49
                            }
                        ]
                    }
                },
                {
                    "key": "host-B",
                    "doc_count": 49,
                    "dateAgg": {
                        "buckets": [
                            {
                                "key": "2020-03-12T17:57:36.229Z-2020-03-12T17:58:26.229Z",
                                "from": 1584035856229,
                                "from_as_string": "2020-03-12T17:57:36.229Z",
                                "to": 1584035906229,
                                "to_as_string": "2020-03-12T17:58:26.229Z",
                                "doc_count": 49
                            }
                        ]
                    }
                },
                {
                    "key": "host-C",
                    "doc_count": 49,
                    "dateAgg": {
                        "buckets": [
                            {
                                "key": "2020-03-12T17:57:36.229Z-2020-03-12T17:58:26.229Z",
                                "from": 1584035856229,
                                "from_as_string": "2020-03-12T17:57:36.229Z",
                                "to": 1584035906229,
                                "to_as_string": "2020-03-12T17:58:26.229Z",
                                "doc_count": 49
                            }
                        ]
                    }
                },
                {
                    "key": "host-D",
                    "doc_count": 49,
                    "dateAgg": {
                        "buckets": [
                            {
                                "key": "2020-03-12T17:57:36.229Z-2020-03-12T17:58:26.229Z",
                                "from": 1584035856229,
                                "from_as_string": "2020-03-12T17:57:36.229Z",
                                "to": 1584035906229,
                                "to_as_string": "2020-03-12T17:58:26.229Z",
                                "doc_count": 49
                            }
                        ]
                    }
                }
            ]
        }
    }
}

request/response with avg(system.cpu.total.norm.pct) over top 42 host.name.keyword

request
{
    "index": [
        "es-apm-sys-sim"
    ],
    "body": {
        "size": 0,
        "query": {
            "bool": {
                "filter": {
                    "range": {
                        "@timestamp": {
                            "gte": "2020-03-12T18:04:00.650Z",
                            "lt": "2020-03-12T18:04:50.650Z",
                            "format": "strict_date_time"
                        }
                    }
                }
            }
        },
        "aggs": {
            "groupAgg": {
                "terms": {
                    "field": "host.name.keyword",
                    "size": 42
                },
                "aggs": {
                    "dateAgg": {
                        "date_range": {
                            "field": "@timestamp",
                            "ranges": [
                                {
                                    "from": "2020-03-12T18:04:00.650Z",
                                    "to": "2020-03-12T18:04:50.650Z"
                                }
                            ]
                        },
                        "aggs": {
                            "metricAgg": {
                                "avg": {
                                    "field": "system.cpu.total.norm.pct"
                                }
                            }
                        }
                    }
                }
            }
        }
    },
    "ignoreUnavailable": true,
    "allowNoIndices": true,
    "ignore": [
        404
    ]
}
response
{
    "took": 2,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 196,
            "relation": "eq"
        },
        "max_score": null,
        "hits": []
    },
    "aggregations": {
        "groupAgg": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
                {
                    "key": "host-A",
                    "doc_count": 49,
                    "dateAgg": {
                        "buckets": [
                            {
                                "key": "2020-03-12T18:04:00.650Z-2020-03-12T18:04:50.650Z",
                                "from": 1584036240650,
                                "from_as_string": "2020-03-12T18:04:00.650Z",
                                "to": 1584036290650,
                                "to_as_string": "2020-03-12T18:04:50.650Z",
                                "doc_count": 49,
                                "metricAgg": {
                                    "value": 0.8295918362481254
                                }
                            }
                        ]
                    }
                },
                {
                    "key": "host-B",
                    "doc_count": 49,
                    "dateAgg": {
                        "buckets": [
                            {
                                "key": "2020-03-12T18:04:00.650Z-2020-03-12T18:04:50.650Z",
                                "from": 1584036240650,
                                "from_as_string": "2020-03-12T18:04:00.650Z",
                                "to": 1584036290650,
                                "to_as_string": "2020-03-12T18:04:50.650Z",
                                "doc_count": 49,
                                "metricAgg": {
                                    "value": 0.608163266765828
                                }
                            }
                        ]
                    }
                },
                {
                    "key": "host-C",
                    "doc_count": 49,
                    "dateAgg": {
                        "buckets": [
                            {
                                "key": "2020-03-12T18:04:00.650Z-2020-03-12T18:04:50.650Z",
                                "from": 1584036240650,
                                "from_as_string": "2020-03-12T18:04:00.650Z",
                                "to": 1584036290650,
                                "to_as_string": "2020-03-12T18:04:50.650Z",
                                "doc_count": 49,
                                "metricAgg": {
                                    "value": 0.44183673238267707
                                }
                            }
                        ]
                    }
                },
                {
                    "key": "host-D",
                    "doc_count": 49,
                    "dateAgg": {
                        "buckets": [
                            {
                                "key": "2020-03-12T18:04:00.650Z-2020-03-12T18:04:50.650Z",
                                "from": 1584036240650,
                                "from_as_string": "2020-03-12T18:04:00.650Z",
                                "to": 1584036290650,
                                "to_as_string": "2020-03-12T18:04:50.650Z",
                                "doc_count": 49,
                                "metricAgg": {
                                    "value": 0.11428571690102013
                                }
                            }
                        ]
                    }
                }
            ]
        }
    }
}

Here's where to instrument the code to get these kind of data dumps:

logger.debug(`${logPrefix} call: ${JSON.stringify(esQuery)}`);
try {
esResult = await callCluster('search', esQuery);
} catch (err) {
logger.warn(`${logPrefix} error: ${JSON.stringify(err.message)}`);
throw new Error('error running search');
}
logger.debug(`${logPrefix} result: ${JSON.stringify(esResult)}`);

@pmuellr pmuellr self-assigned this Mar 13, 2020
@pmuellr
Copy link
Member Author

pmuellr commented Mar 13, 2020

Here are some relevant text from the doc on using order with a terms aggregation:

It is also possible to order the buckets based on a "deeper" aggregation in the hierarchy. This is supported as long as the aggregations path are of a single-bucket type, where the last aggregation in the path may either be a single-bucket one or a metrics one.

In our case, the aggs path includes a multi-bucket date_range agg, so we can't reference the existing "leaf" metric agg this way. We'd need to create a new agg over the entire date range, independent of the date_range agg.

Warning: Sorting by ascending _count or by sub aggregation is discouraged as it increases the error on document counts. ... errors are unbounded. One particular case that could still be useful is sorting by min or max aggregation: counts will not be accurate but at least the top buckets will be correctly picked.

So that means the sort would work fine for min() max() and count() (the default sort criteria anyway), leaving avg() and sum() as being "error" prone.

@pmuellr
Copy link
Member Author

pmuellr commented Mar 13, 2020

Seems worth noting as well that we're using the same query in the alert executor AND the time series query to render the viz graph in the alert ui. It's far more important to get the former right, than the latter.

The alert executor only ends up with a single date_range bucket, so seems like we could optimize on that, or not optimize and use the whole date range. Add a new agg to get the calculated metric with a single bucket agg based on the last/only date range being requested (or whole date range), and then reference that agg in the order param.

For count(), we don't need to do anything. No new agg required, no order required. Should work as is currently coded. For min(), we'd want to sort ascending, for all the other's we'll sort descending (the default).

This doesn't take into account the comparators < <= > >= between notbetween, and you'd kind think that it might ... should these two alerts sort differently?

avg(some-field) < threshold-value
avg(some-field) > threshold-value

Maybe? like you'd want to sort ascending for the first, but descending for the second. That leads to quandries like how would you sort

avg(some-field) between (threshold-value-1, threshold-value-2)

That would involve sorting by the distance of bucket's average from a total average, or something? Seems hard-to-impossible, and beyond the scope of just getting basic ordering in, so will defer working on that part, for the first PR to address this issue.

pmuellr added a commit to pmuellr/kibana that referenced this issue Mar 16, 2020
The current index threshold alert uses a `size` limit on term aggregation, when used, but does not sort the buckets, so it's just using descending count on the grouped buckets as the sort to determine what to return.

The watcher API for the index threshold notes this as "top N of", implying a sort.

This PR applies sorting when the using `groupBy: top`, and the `aggType != count`.  For count, ES is already sorting the way we want.

The sort is calculated as a separate agg beside the date_range aggregation, which is the same metrics agg specified in the query - `aggType(aggField)`.  This field is then referenced in a new `order` property in the terms agg, using 'asc' sorting for `min`, and `desc` sorting for `avg`, `max`, and `sum`.

This doesn't change the shape of the output at all, just changes which term buckets will be returned, if there are more term buckets than requested with the `termSize` parameter.
pmuellr added a commit that referenced this issue Mar 17, 2020
The current index threshold alert uses a `size` limit on term aggregation, when used, but does not sort the buckets, so it's just using descending count on the grouped buckets as the sort to determine what to return.

The watcher API for the index threshold notes this as "top N of", implying a sort.

This PR applies sorting when the using `groupBy: top`, and the `aggType != count`.  For count, ES is already sorting the way we want.

The sort is calculated as a separate agg beside the date_range aggregation, which is the same metrics agg specified in the query - `aggType(aggField)`.  This field is then referenced in a new `order` property in the terms agg, using 'asc' sorting for `min`, and `desc` sorting for `avg`, `max`, and `sum`.

This doesn't change the shape of the output at all, just changes which term buckets will be returned, if there are more term buckets than requested with the `termSize` parameter.
pmuellr added a commit to pmuellr/kibana that referenced this issue Mar 17, 2020
The current index threshold alert uses a `size` limit on term aggregation, when used, but does not sort the buckets, so it's just using descending count on the grouped buckets as the sort to determine what to return.

The watcher API for the index threshold notes this as "top N of", implying a sort.

This PR applies sorting when the using `groupBy: top`, and the `aggType != count`.  For count, ES is already sorting the way we want.

The sort is calculated as a separate agg beside the date_range aggregation, which is the same metrics agg specified in the query - `aggType(aggField)`.  This field is then referenced in a new `order` property in the terms agg, using 'asc' sorting for `min`, and `desc` sorting for `avg`, `max`, and `sum`.

This doesn't change the shape of the output at all, just changes which term buckets will be returned, if there are more term buckets than requested with the `termSize` parameter.
gmmorris added a commit to gmmorris/kibana that referenced this issue Mar 17, 2020
* master: (30 commits)
  [TSVB] fix text color when using custom background color (elastic#60261)
  Fix import to timefilter from in TSVB (elastic#60296)
  [NP] Get rid of usage redirectWhenMissing service (elastic#59777)
  [SIEM] Fix Timeline footer styling (elastic#59587)
  [ML] Fixes to error handling for analytics jobs and file data viz (elastic#60249)
  Give better stack traces for Unhandled Promise Rejection warnings (elastic#60235)
  resolves elastic#58905 (elastic#60120)
  Added variables button for text fields in Pagerduty component. (elastic#60189)
  adds test that action vars are rendered for alert action parms (elastic#60310)
  Closes 59786 by removing the update toast (elastic#60172)
  [EPM] Packages list tabs (elastic#60167)
  Added message variables button for Webhook body form field (elastic#60174)
  Revert "adds new test (elastic#60064)"
  [Maps] move MapSavedObject type out of telemetry (elastic#60127)
  [Reporting] Fix error handling for job handler in route (elastic#60161)
  [Endpoint] TEST: verify alerts page header says 'Alerts' (elastic#60206)
  EMT-248: implement ack resource to accept event payload to acknowledge agent actions (elastic#60218)
  Migrate dual validated range (elastic#59689)
  Embeddable triggers (elastic#58440)
  [Endpoint] Sample data generator CLI script (elastic#59952)
  ...
pmuellr added a commit that referenced this issue Mar 17, 2020
The current index threshold alert uses a `size` limit on term aggregation, when used, but does not sort the buckets, so it's just using descending count on the grouped buckets as the sort to determine what to return.

The watcher API for the index threshold notes this as "top N of", implying a sort.

This PR applies sorting when the using `groupBy: top`, and the `aggType != count`.  For count, ES is already sorting the way we want.

The sort is calculated as a separate agg beside the date_range aggregation, which is the same metrics agg specified in the query - `aggType(aggField)`.  This field is then referenced in a new `order` property in the terms agg, using 'asc' sorting for `min`, and `desc` sorting for `avg`, `max`, and `sum`.

This doesn't change the shape of the output at all, just changes which term buckets will be returned, if there are more term buckets than requested with the `termSize` parameter.
@kobelb kobelb added the needs-team Issues missing a team label label Jan 31, 2022
@botelastic botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v7.7.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants