[Metrics UI] Alerts fail when hitting the bucket limit #68492

hendry-lim · 2020-06-08T09:31:51Z

Maintainer Edit

Metric threshold/inventory alerts are unable to handle a Too Many Buckets exception within the alert executor.
Metric threshold queries sometimes override the range filter and query too much data, triggering a Too Many Buckets exception

Original Submitted Issue

Kibana version: 7.7.1

Elasticsearch version: 7.7.1

Server OS version: RHEL 8

Browser version: 83.0.4103.97

Browser OS version: Windows 10

Original install method (e.g. download page, yum, from source, etc.): Docker

Describe the bug:
Alert instances were not created with the following filter in Metric Threshold alert:
NOT host.name:dv* and NOT host.name:ts*

However, alert instances were created if we only used the following:
NOT host.name:dv*

There are other hosts that exceeded the memory threshold other than those that matched dv* and ts*.

Steps to reproduce:

Create a Metric Threshold
Condition: Average of system.memory.used.pct is above or equals 0.8
For the last 5 minutes
Filter: NOT host.name:dv* and NOT host.name:ts*
Alert per host.name

Expected behavior:
Alert instances should be created with either/both filters applied as long as there are hosts that exceed the memory threshold.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-06-10T11:06:29Z

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

elasticmachine · 2020-06-10T14:47:38Z

Pinging @elastic/logs-metrics-ui (Team:logs-metrics-ui)

hendry-lim · 2020-06-19T12:16:53Z

Issue persists in 7.8.0.

Zacqary · 2020-06-24T17:35:37Z

Tested this on 7.8 and I wasn't able to reproduce. When alerting per host.name I was always able to see one instance per matched host.name regardless of how many NOT clauses I added to the filter. Maybe there's something else going on with your data that I'm not seeing on my end?

What happens when you go to the Metrics Explorer and search for the data like this?

Metric: system.memory.used.pct
Graph per: host.name
Filter: NOT host.name:dv* and NOT host.name:ts*

How many graphs show up when you enter those search terms? And do you see any noticeable gaps in the lines on the graphs, like where they're not reporting any data at all? Trying to figure out a clue as to what could be causing this problem.

hendry-lim · 2020-06-25T00:06:41Z

The filters and grouping work fine in both Inventory and Metrics Explorer. However, no alert instances are created if we apply the same filter in an alert.
We have also tried creating the Metric Threshold alert manually and through the Metrics Explorer UI, but we saw the same result.

We have also noticed there are discrepancies between the chart shown on the alert flyout with the Metrics Explorer charts. I am not sure if this is related, but please refer to the following screenshots.

Metrics Explorer

Expand

Alert flyout

Expand

Based on the above screenshots, you may notice that we have got 2 hosts with CPU exceeding 50% usage most of the time. However, if we refer to the alert flyout, the chart only plotted the usage for one host. Does this mean the chart in the alert flyout does not support grouping in 7.8 yet?

Now, based on the same Metrics Explorer charts and alert configuration, we know that there should be at least 1 alert instance every time, because the 2nd chart on the first row shows that host CPU usage consistently hovers around 56%-58%, but we are seeing nothing in the alert instance list.

If we remove the 2nd condition: and not host.name:*doi*, we have alert instances created:

Zacqary · 2020-06-26T17:34:41Z

Still trying to figure out how to reproduce this result on my end, but in the meantime I can answer:

Does this mean the chart in the alert flyout does not support grouping in 7.8 yet?

Correct, we only show you one sample group on the chart and don't yet have a way to paginate through all the rest of them. We do have #67684 coming up in 7.9 which can at least tell you if some of your matched groups will cause the alert to fire, but we can talk about adding pagination if that'd be a good UX improvement.

Will update this issue once we get closer to figuring out what's causing your problem.

hendry-lim · 2020-06-27T02:38:50Z

Yup, sure, my initial post was based on our customer's production and DR environment, while my subsequent post was based on our demo/test environment. We are able to replicate the same in both environments.
Yup, can't wait for 7.9, 67684 will be very useful.

hendry-lim · 2020-07-01T09:38:56Z

Noticed the following query error in ES, looks like the query failed to execute. Tried to run the query, but I encountered Trying to create too many buckets. Must be less than or equal to: [20000] but was [21410]. error. Our current max bucket size is 20000.

Expand

[0], node[Cfor_9aZTfa_WQZL2bk8Sw], [R], s[STARTED], a[id = wNfykpUlRKGaSOaP7o6ajQ]: Failed to execute[SearchRequest
    {
        searchType = QUERY_THEN_FETCH,
        indices = [metricbeat-7.8.0-2020.06.19-000001, metricbeat-7.8.0, metricbeat-7.8.0-2020.06.26-000002],
        indicesOptions = IndicesOptions[ignore_unavailable = false, allow_no_indices = true, expand_wildcards_open = true, expand_wildcards_closed = false, expand_wildcards_hidden = false, allow_aliases_to_multiple_indices = true, forbid_closed_indices = true, ignore_aliases = false, ignore_throttled = true],
        types = [],
        routing = 'null',
        preference = 'null',
        requestCache = null,
        scroll = null,
        maxConcurrentShardRequests = 0,
        batchedReduceSize = 512,
        preFilterShardSize = null,
        allowPartialSearchResults = true,
        localClusterAlias = emcprod,
        getOrCreateAbsoluteStartMillis = 1593595703152,
        ccsMinimizeRoundtrips = true,
        source =
        {
            "from": 0,
            "size": 0,
            "query":
            {
                "bool":
                {
                    "filter": [
                        {
                            "bool":
                            {
                                "filter": [
                                    {
                                        "bool":
                                        {
                                            "must_not": [
                                                {
                                                    "bool":
                                                    {
                                                        "should": [
                                                            {
                                                                "query_string":
                                                                {
                                                                    "query": "alpine*",
                                                                    "fields": ["host.name^1.0"],
                                                                    "type": "best_fields",
                                                                    "default_operator": "or",
                                                                    "max_determinized_states": 10000,
                                                                    "enable_position_increments": true,
                                                                    "fuzziness": "AUTO",
                                                                    "fuzzy_prefix_length": 0,
                                                                    "fuzzy_max_expansions": 50,
                                                                    "phrase_slop": 0,
                                                                    "escape": false,
                                                                    "auto_generate_synonyms_phrase_query": true,
                                                                    "fuzzy_transpositions": true,
                                                                    "boost": 1.0
                                                                }
                                                            }
                                                        ],
                                                        "adjust_pure_negative": true,
                                                        "minimum_should_match": "1",
                                                        "boost": 1.0
                                                    }
                                                }
                                            ],
                                            "adjust_pure_negative": true,
                                            "boost": 1.0
                                        }
                                    },
                                    {
                                        "bool":
                                        {
                                            "must_not": [
                                                {
                                                    "bool":
                                                    {
                                                        "should": [
                                                            {
                                                                "query_string":
                                                                {
                                                                    "query": "doi*",
                                                                    "fields": ["host.name^1.0"],
                                                                    "type": "best_fields",
                                                                    "default_operator": "or",
                                                                    "max_determinized_states": 10000,
                                                                    "enable_position_increments": true,
                                                                    "fuzziness": "AUTO",
                                                                    "fuzzy_prefix_length": 0,
                                                                    "fuzzy_max_expansions": 50,
                                                                    "phrase_slop": 0,
                                                                    "escape": false,
                                                                    "auto_generate_synonyms_phrase_query": true,
                                                                    "fuzzy_transpositions": true,
                                                                    "boost": 1.0
                                                                }
                                                            }
                                                        ],
                                                        "adjust_pure_negative": true,
                                                        "minimum_should_match": "1",
                                                        "boost": 1.0
                                                    }
                                                }
                                            ],
                                            "adjust_pure_negative": true,
                                            "boost": 1.0
                                        }
                                    }
                                ],
                                "adjust_pure_negative": true,
                                "boost": 1.0
                            }
                        }
                    ],
                    "adjust_pure_negative": true,
                    "boost": 1.0
                }
            },
            "aggregations":
            {
                "groupings":
                {
                    "composite":
                    {
                        "size": 10,
                        "sources": [
                            {
                                "groupBy":
                                {
                                    "terms":
                                    {
                                        "field": "host.name",
                                        "missing_bucket": false,
                                        "order": "asc"
                                    }
                                }
                            }
                        ],
                        "after":
                        {
                            "groupBy": "repos02"
                        }
                    },
                    "aggregations":
                    {
                        "aggregatedIntervals":
                        {
                            "date_histogram":
                            {
                                "field": "@timestamp",
                                "fixed_interval": "5m",
                                "offset": -188000,
                                "order":
                                {
                                    "_key": "asc"
                                },
                                "keyed": false,
                                "min_doc_count": 0,
                                "extended_bounds":
                                {
                                    "min": 1593594112712,
                                    "max": 1593595612712
                                }
                            },
                            "aggregations":
                            {
                                "aggregatedValue":
                                {
                                    "avg":
                                    {
                                        "field": "system.memory.actual.used.pct"
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
]lastShard[true]

pmuellr · 2020-07-02T14:10:20Z

Exceeding a bucket limit could explain why @Zacqary was unable to reproduce this.

I believe the index threshold does some kind of calculation to determine how many buckets could be created, and does something if it's over some limit (eg, 10K) - reducing the number of date ranges or something.

Also looking at that query, wondering if the filter can be moved out of the query structure - it would then get cached, which might be good or bad :-). I believe we do have it outside the query for the index threshold.

Should also note that the index threshold only does elaborate date range aggregations like this for generating the data for the graph it shows - the query used when the alert runs only looks at one specific date range. I wonder if this query was actually for a graph visualization rather than what the alert ran.

hendry-lim · 2020-07-02T14:19:57Z

What I noticed was this error occurs regularly at a time interval, so I don't think it is caused by the graph visualization. The alert flyout was also not opened when this error appeared in ES log.

Zacqary · 2020-07-02T16:12:29Z

@hendry-lim Thanks for that error, that's helpful! Metric threshold alerts don't have a too many buckets handler, so it looks like that's what we'll need to add.

Zacqary · 2020-07-02T18:22:41Z

So actually on closer investigation it looks like your query is missing a range filter, which I think is due to a bug in the way we construct the alert query. It should only be producing a maximum of 30 buckets (6 groups, 5 minute intervals, 25 minutes of data to look at as a buffer), but because the range filter got overwritten it's actually looking back at all data for all time.

It would still be valuable to handle the bucket limit in case you wanted to alert on 4000 groups at once, but in this query's case it shouldn't be hitting that bucket limit at all. So this is two bugs for us to fix.

Zacqary · 2020-07-02T22:27:51Z

#70672 should fix the root cause of this issue.

@simianhacker told me that the getAllCompositeData function may already be able to handle an unlimited amount of groups, so that PR may resolve this whole thing without us needing to write an additional handler for a Too Many Buckets error. I'd like to stress test it to make sure, though.

hendry-lim · 2020-07-03T00:00:03Z

Thanks a lot for your help @Zacqary @pmuellr

hendry-lim · 2020-07-28T00:54:13Z

Looking good in 7.8.1.

hendry-lim changed the title ~~[Alerting] Metric threshold with multiple filters did not work~~ [Alerting] Metric threshold with multiple filters is not working Jun 8, 2020

flash1293 added Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) triage_needed labels Jun 10, 2020

mikecote added Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services and removed Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Jun 10, 2020

Zacqary added bug Fixes for quality problems that affect the customer experience Feature:Metrics UI Metrics UI feature labels Jul 2, 2020

Zacqary changed the title ~~[Alerting] Metric threshold with multiple filters is not working~~ [Metrics UI] Alerts fail when hitting the bucket limit Jul 2, 2020

Zacqary mentioned this issue Jul 2, 2020

[Metrics UI] Fix a bug in Metric Threshold query filter construction #70672

Merged

1 task

hendry-lim closed this as completed Jul 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Metrics UI] Alerts fail when hitting the bucket limit #68492

[Metrics UI] Alerts fail when hitting the bucket limit #68492

hendry-lim commented Jun 8, 2020 •

edited by Zacqary

Loading

elasticmachine commented Jun 10, 2020

elasticmachine commented Jun 10, 2020

hendry-lim commented Jun 19, 2020

Zacqary commented Jun 24, 2020

hendry-lim commented Jun 25, 2020 •

edited

Loading

Zacqary commented Jun 26, 2020

hendry-lim commented Jun 27, 2020

hendry-lim commented Jul 1, 2020 •

edited

Loading

pmuellr commented Jul 2, 2020

hendry-lim commented Jul 2, 2020 •

edited

Loading

Zacqary commented Jul 2, 2020

Zacqary commented Jul 2, 2020 •

edited

Loading

Zacqary commented Jul 2, 2020

hendry-lim commented Jul 3, 2020

hendry-lim commented Jul 28, 2020

[Metrics UI] Alerts fail when hitting the bucket limit #68492

[Metrics UI] Alerts fail when hitting the bucket limit #68492

Comments

hendry-lim commented Jun 8, 2020 • edited by Zacqary Loading

Maintainer Edit

Original Submitted Issue

elasticmachine commented Jun 10, 2020

elasticmachine commented Jun 10, 2020

hendry-lim commented Jun 19, 2020

Zacqary commented Jun 24, 2020

hendry-lim commented Jun 25, 2020 • edited Loading

Zacqary commented Jun 26, 2020

hendry-lim commented Jun 27, 2020

hendry-lim commented Jul 1, 2020 • edited Loading

pmuellr commented Jul 2, 2020

hendry-lim commented Jul 2, 2020 • edited Loading

Zacqary commented Jul 2, 2020

Zacqary commented Jul 2, 2020 • edited Loading

Zacqary commented Jul 2, 2020

hendry-lim commented Jul 3, 2020

hendry-lim commented Jul 28, 2020

hendry-lim commented Jun 8, 2020 •

edited by Zacqary

Loading

hendry-lim commented Jun 25, 2020 •

edited

Loading

hendry-lim commented Jul 1, 2020 •

edited

Loading

hendry-lim commented Jul 2, 2020 •

edited

Loading

Zacqary commented Jul 2, 2020 •

edited

Loading