Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Metrics UI] Alerts fail when hitting the bucket limit #68492

Closed
hendry-lim opened this issue Jun 8, 2020 · 15 comments
Closed

[Metrics UI] Alerts fail when hitting the bucket limit #68492

hendry-lim opened this issue Jun 8, 2020 · 15 comments
Labels
bug Fixes for quality problems that affect the customer experience Feature:Metrics UI Metrics UI feature Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services triage_needed

Comments

@hendry-lim
Copy link

hendry-lim commented Jun 8, 2020

Maintainer Edit

  • Metric threshold/inventory alerts are unable to handle a Too Many Buckets exception within the alert executor.
  • Metric threshold queries sometimes override the range filter and query too much data, triggering a Too Many Buckets exception

Original Submitted Issue

Kibana version: 7.7.1

Elasticsearch version: 7.7.1

Server OS version: RHEL 8

Browser version: 83.0.4103.97

Browser OS version: Windows 10

Original install method (e.g. download page, yum, from source, etc.): Docker

Describe the bug:
Alert instances were not created with the following filter in Metric Threshold alert:
NOT host.name:dv* and NOT host.name:ts*

However, alert instances were created if we only used the following:
NOT host.name:dv*

There are other hosts that exceeded the memory threshold other than those that matched dv* and ts*.

Steps to reproduce:

  1. Create a Metric Threshold
  2. Condition: Average of system.memory.used.pct is above or equals 0.8
  3. For the last 5 minutes
  4. Filter: NOT host.name:dv* and NOT host.name:ts*
  5. Alert per host.name

Expected behavior:
Alert instances should be created with either/both filters applied as long as there are hosts that exceed the memory threshold.

@hendry-lim hendry-lim changed the title [Alerting] Metric threshold with multiple filters did not work [Alerting] Metric threshold with multiple filters is not working Jun 8, 2020
@flash1293 flash1293 added Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) triage_needed labels Jun 10, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@mikecote mikecote added Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services and removed Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Jun 10, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/logs-metrics-ui (Team:logs-metrics-ui)

@hendry-lim
Copy link
Author

Issue persists in 7.8.0.

@Zacqary
Copy link
Contributor

Zacqary commented Jun 24, 2020

Tested this on 7.8 and I wasn't able to reproduce. When alerting per host.name I was always able to see one instance per matched host.name regardless of how many NOT clauses I added to the filter. Maybe there's something else going on with your data that I'm not seeing on my end?

What happens when you go to the Metrics Explorer and search for the data like this?
Screen Shot 2020-06-24 at 12 29 51 PM

  • Metric: system.memory.used.pct
  • Graph per: host.name
  • Filter: NOT host.name:dv* and NOT host.name:ts*

How many graphs show up when you enter those search terms? And do you see any noticeable gaps in the lines on the graphs, like where they're not reporting any data at all? Trying to figure out a clue as to what could be causing this problem.

@hendry-lim
Copy link
Author

hendry-lim commented Jun 25, 2020

The filters and grouping work fine in both Inventory and Metrics Explorer. However, no alert instances are created if we apply the same filter in an alert.
We have also tried creating the Metric Threshold alert manually and through the Metrics Explorer UI, but we saw the same result.

We have also noticed there are discrepancies between the chart shown on the alert flyout with the Metrics Explorer charts. I am not sure if this is related, but please refer to the following screenshots.

Metrics Explorer

Expand

metrics explorer

Alert flyout

Expand

alert flyout

Based on the above screenshots, you may notice that we have got 2 hosts with CPU exceeding 50% usage most of the time. However, if we refer to the alert flyout, the chart only plotted the usage for one host. Does this mean the chart in the alert flyout does not support grouping in 7.8 yet?

Now, based on the same Metrics Explorer charts and alert configuration, we know that there should be at least 1 alert instance every time, because the 2nd chart on the first row shows that host CPU usage consistently hovers around 56%-58%, but we are seeing nothing in the alert instance list.

alert instance

If we remove the 2nd condition: and not host.name:*doi*, we have alert instances created:

alert instances 2

@Zacqary
Copy link
Contributor

Zacqary commented Jun 26, 2020

Still trying to figure out how to reproduce this result on my end, but in the meantime I can answer:

Does this mean the chart in the alert flyout does not support grouping in 7.8 yet?

Correct, we only show you one sample group on the chart and don't yet have a way to paginate through all the rest of them. We do have #67684 coming up in 7.9 which can at least tell you if some of your matched groups will cause the alert to fire, but we can talk about adding pagination if that'd be a good UX improvement.

Will update this issue once we get closer to figuring out what's causing your problem.

@hendry-lim
Copy link
Author

Yup, sure, my initial post was based on our customer's production and DR environment, while my subsequent post was based on our demo/test environment. We are able to replicate the same in both environments.
Yup, can't wait for 7.9, 67684 will be very useful.

@hendry-lim
Copy link
Author

hendry-lim commented Jul 1, 2020

Noticed the following query error in ES, looks like the query failed to execute. Tried to run the query, but I encountered Trying to create too many buckets. Must be less than or equal to: [20000] but was [21410]. error. Our current max bucket size is 20000.

Expand
[0], node[Cfor_9aZTfa_WQZL2bk8Sw], [R], s[STARTED], a[id = wNfykpUlRKGaSOaP7o6ajQ]: Failed to execute[SearchRequest
    {
        searchType = QUERY_THEN_FETCH,
        indices = [metricbeat-7.8.0-2020.06.19-000001, metricbeat-7.8.0, metricbeat-7.8.0-2020.06.26-000002],
        indicesOptions = IndicesOptions[ignore_unavailable = false, allow_no_indices = true, expand_wildcards_open = true, expand_wildcards_closed = false, expand_wildcards_hidden = false, allow_aliases_to_multiple_indices = true, forbid_closed_indices = true, ignore_aliases = false, ignore_throttled = true],
        types = [],
        routing = 'null',
        preference = 'null',
        requestCache = null,
        scroll = null,
        maxConcurrentShardRequests = 0,
        batchedReduceSize = 512,
        preFilterShardSize = null,
        allowPartialSearchResults = true,
        localClusterAlias = emcprod,
        getOrCreateAbsoluteStartMillis = 1593595703152,
        ccsMinimizeRoundtrips = true,
        source =
        {
            "from": 0,
            "size": 0,
            "query":
            {
                "bool":
                {
                    "filter": [
                        {
                            "bool":
                            {
                                "filter": [
                                    {
                                        "bool":
                                        {
                                            "must_not": [
                                                {
                                                    "bool":
                                                    {
                                                        "should": [
                                                            {
                                                                "query_string":
                                                                {
                                                                    "query": "alpine*",
                                                                    "fields": ["host.name^1.0"],
                                                                    "type": "best_fields",
                                                                    "default_operator": "or",
                                                                    "max_determinized_states": 10000,
                                                                    "enable_position_increments": true,
                                                                    "fuzziness": "AUTO",
                                                                    "fuzzy_prefix_length": 0,
                                                                    "fuzzy_max_expansions": 50,
                                                                    "phrase_slop": 0,
                                                                    "escape": false,
                                                                    "auto_generate_synonyms_phrase_query": true,
                                                                    "fuzzy_transpositions": true,
                                                                    "boost": 1.0
                                                                }
                                                            }
                                                        ],
                                                        "adjust_pure_negative": true,
                                                        "minimum_should_match": "1",
                                                        "boost": 1.0
                                                    }
                                                }
                                            ],
                                            "adjust_pure_negative": true,
                                            "boost": 1.0
                                        }
                                    },
                                    {
                                        "bool":
                                        {
                                            "must_not": [
                                                {
                                                    "bool":
                                                    {
                                                        "should": [
                                                            {
                                                                "query_string":
                                                                {
                                                                    "query": "doi*",
                                                                    "fields": ["host.name^1.0"],
                                                                    "type": "best_fields",
                                                                    "default_operator": "or",
                                                                    "max_determinized_states": 10000,
                                                                    "enable_position_increments": true,
                                                                    "fuzziness": "AUTO",
                                                                    "fuzzy_prefix_length": 0,
                                                                    "fuzzy_max_expansions": 50,
                                                                    "phrase_slop": 0,
                                                                    "escape": false,
                                                                    "auto_generate_synonyms_phrase_query": true,
                                                                    "fuzzy_transpositions": true,
                                                                    "boost": 1.0
                                                                }
                                                            }
                                                        ],
                                                        "adjust_pure_negative": true,
                                                        "minimum_should_match": "1",
                                                        "boost": 1.0
                                                    }
                                                }
                                            ],
                                            "adjust_pure_negative": true,
                                            "boost": 1.0
                                        }
                                    }
                                ],
                                "adjust_pure_negative": true,
                                "boost": 1.0
                            }
                        }
                    ],
                    "adjust_pure_negative": true,
                    "boost": 1.0
                }
            },
            "aggregations":
            {
                "groupings":
                {
                    "composite":
                    {
                        "size": 10,
                        "sources": [
                            {
                                "groupBy":
                                {
                                    "terms":
                                    {
                                        "field": "host.name",
                                        "missing_bucket": false,
                                        "order": "asc"
                                    }
                                }
                            }
                        ],
                        "after":
                        {
                            "groupBy": "repos02"
                        }
                    },
                    "aggregations":
                    {
                        "aggregatedIntervals":
                        {
                            "date_histogram":
                            {
                                "field": "@timestamp",
                                "fixed_interval": "5m",
                                "offset": -188000,
                                "order":
                                {
                                    "_key": "asc"
                                },
                                "keyed": false,
                                "min_doc_count": 0,
                                "extended_bounds":
                                {
                                    "min": 1593594112712,
                                    "max": 1593595612712
                                }
                            },
                            "aggregations":
                            {
                                "aggregatedValue":
                                {
                                    "avg":
                                    {
                                        "field": "system.memory.actual.used.pct"
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
]lastShard[true]

@pmuellr
Copy link
Member

pmuellr commented Jul 2, 2020

Exceeding a bucket limit could explain why @Zacqary was unable to reproduce this.

I believe the index threshold does some kind of calculation to determine how many buckets could be created, and does something if it's over some limit (eg, 10K) - reducing the number of date ranges or something.

Also looking at that query, wondering if the filter can be moved out of the query structure - it would then get cached, which might be good or bad :-). I believe we do have it outside the query for the index threshold.

Should also note that the index threshold only does elaborate date range aggregations like this for generating the data for the graph it shows - the query used when the alert runs only looks at one specific date range. I wonder if this query was actually for a graph visualization rather than what the alert ran.

@hendry-lim
Copy link
Author

hendry-lim commented Jul 2, 2020

What I noticed was this error occurs regularly at a time interval, so I don't think it is caused by the graph visualization. The alert flyout was also not opened when this error appeared in ES log.

@Zacqary
Copy link
Contributor

Zacqary commented Jul 2, 2020

@hendry-lim Thanks for that error, that's helpful! Metric threshold alerts don't have a too many buckets handler, so it looks like that's what we'll need to add.

@Zacqary Zacqary added bug Fixes for quality problems that affect the customer experience Feature:Metrics UI Metrics UI feature labels Jul 2, 2020
@Zacqary Zacqary changed the title [Alerting] Metric threshold with multiple filters is not working [Metrics UI] Alerts fail when hitting the bucket limit Jul 2, 2020
@Zacqary
Copy link
Contributor

Zacqary commented Jul 2, 2020

So actually on closer investigation it looks like your query is missing a range filter, which I think is due to a bug in the way we construct the alert query. It should only be producing a maximum of 30 buckets (6 groups, 5 minute intervals, 25 minutes of data to look at as a buffer), but because the range filter got overwritten it's actually looking back at all data for all time.

It would still be valuable to handle the bucket limit in case you wanted to alert on 4000 groups at once, but in this query's case it shouldn't be hitting that bucket limit at all. So this is two bugs for us to fix.

@Zacqary
Copy link
Contributor

Zacqary commented Jul 2, 2020

#70672 should fix the root cause of this issue.

@simianhacker told me that the getAllCompositeData function may already be able to handle an unlimited amount of groups, so that PR may resolve this whole thing without us needing to write an additional handler for a Too Many Buckets error. I'd like to stress test it to make sure, though.

@hendry-lim
Copy link
Author

Thanks a lot for your help @Zacqary @pmuellr

@hendry-lim
Copy link
Author

Looking good in 7.8.1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience Feature:Metrics UI Metrics UI feature Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services triage_needed
Projects
None yet
Development

No branches or pull requests

6 participants