Terms agg: calculate aggs on 'other' bucket #12411

j0hnsmith · 2015-07-23T08:33:46Z

The terms aggregation now provides an 'other' bucket with a count, I'd like to see the same aggregations performed on the 'other' bucket. Eg if I'm doing a stats aggregation I have stats (sum) for docs with term foo and bar but not for docs where the field is missing or has a null value.

This is really important for analytics type services as all the values must add up to 100% of the total.

There's quite a bit of discussion about it in #5324.

The text was updated successfully, but these errors were encountered:

clintongormley · 2015-07-23T12:11:07Z

@j0hnsmith calculating sub-aggs on an other bucket requires two round trips. The first calculates the top-ten terms (plus their sub-aggs). The second calculates the sub-aggs on everything except the top-ten terms.

To support this in Elasticsearch, we'd need to implement #12316 first. However this is something you can do yourself today.

j0hnsmith · 2015-07-24T11:14:27Z

I know there are workarounds, but with every level of sub aggregation they get progressively more complex, this could simplify some very complex queries.

vivekmoosani · 2016-01-14T06:04:09Z

+1

PaulGrandperrin · 2016-01-15T18:03:26Z

+1

powermik · 2016-02-19T12:39:42Z

+1

dynomeat · 2016-03-10T01:49:48Z

+1

EdwardKaravakis · 2016-05-25T13:15:02Z

+1

markharwood · 2018-03-16T17:23:08Z

cc @elastic/es-search-aggs

timroes · 2019-07-01T08:15:58Z

I just wanted to check on in this issue and ping the current team @elastic/es-analytics-geo (since team names seem to have been changed).

We currently use that workaround described above in Kibana to calculate the "Other bucket" for terms aggregations and it's causing us a lot of pain. For one thing, it's the only thing we need to do a second request to gather all information needed for a visualization to render, thus needing some special handling in our infrastructure. Also since that "Other bucket" itself is not really a bucket in the terms of ES, we need a lot of special handling for those, e.g. the filter creation logic need to handle them individually. We also see a lot of issues where (our code) doesn't work properly when having more complex aggregation configurations for your visualization. And last but not least, since they are not real buckets we can never make them properly work with Bucket Script or Bucket Selector, which we want to implement in the future, but then need some special handling or disable them for Other again.

Having the "Other bucket" feature available in Elasticsearch (most importantly for Terms, but we've also seen user asking for it on the Filters Agg and Significant Terms aggs, and I assume in the future also on Rare Terms agg), would be one of the really huge wins for Kibana visualizations and its infrastructure. If there is anything we can support you with, please let me know.

polyfractal · 2019-07-01T15:41:47Z

Not overly familiar with the issue, but is there a reason that the missing aggregation and/or missing parameter on terms agg aren't sufficient?

E.g. the missing aggregation can go next to the terms aggregation and give you all the documents that don't have the particular field which is being aggregated in the terms agg, and allow you to perform sub-aggs there.

Alternatively, you can set the missing param on a terms agg to something unique (__$MISSING$__ or whatever), and then use that bucket for sub-aggregations.

timroes · 2019-07-02T07:59:08Z

I think there is some confusing between missing and "Other documents" here :-) Missing is a bucket containing all the documents not having a value in that specific field and instead they will get that missing value. Other bucket should contain all documents not returned in as a bucket otherwise. These can totally fine have values for that field, but just not be under the top x terms requested. So if you request the top 5 terms, and a document is not in one of those buckets it should be in the other bucket. It could still be that this document has a value for that field, just not one of the top 5 common ones.

I can also explain shortly why that was such a highly requested feature in Kibana for a long time, that we decided to implement in (in the above described not very stable way) on our side: Pie charts :-)

If you want to draw the top 5 countries in a pie chart, without the other value, they would always make up 100% of the chart. That can be very confusing depending on what you're trying to visualize, because those countries are not making up at all 100% of the data, but maybe just 30%. Users want to see that, by having 70% drawn as others and these top 5 values as individual slices. Similar things apply for other chart types as well.

Just FYI: we also expose the Missing setting and use as you described above internally a unique identifier for that bucket, so we can find it later, but this is a completely different functionality:

polyfractal · 2019-07-02T12:55:57Z

Gotcha, makes sense 👍 . Thanks for the extra details @timroes. Mark added the team-discuss label to this, so we'll chat about it in the next meeting.

polyfractal · 2019-07-03T15:33:18Z

Chatted in analytics meeting, and unfortunately we're at the same roadblock as four years ago. We can't calculate an "other" bucket before we know the global top-n results... and at that point it's too late to build an "other" bucket because we are reducing on the coordinating node. To do this we need two-pass/multi-pass support in the aggregation framework which doesn't exist today (although we have been talking about how we could potentially do it). First pass to find top-n, second pass to collect everything else that wasn't in the top-n into a bucket.

How is Kibana performing the two-passes today? What does the structure of the second query look like (you can point me at code too, that's fine)? Trying to see if there is something we can do to help make the second pass easier until the agg framework has true multi-pass capability. Is it a filter agg for must_not: <top-n terms> or similar?

As a side note, when/if Jim's API (#26472) is implemented, external two-pass implementations will at least have a consistent view of the index which will make the situation a little better.

timroes · 2019-07-04T14:29:32Z

Yes Kibana will basically build a must_not: <top-n terms> from the first request.

Here is an example query where I enabled Other Bucket for two nested term queries with one sum metric:

Actual query

{
  "aggs": {
    "2": {
      "terms": {
        "field": "user",
        "order": {
          "1": "desc"
        },
        "size": 5
      },
      "aggs": {
        "1": {
          "sum": {
            "field": "number"
          }
        },
        "3": {
          "terms": {
            "field": "state",
            "order": {
              "1": "desc"
            },
            "size": 1
          },
          "aggs": {
            "1": {
              "sum": {
                "field": "number"
              }
            }
          }
        }
      }
    }
  },
  "size": 0,
  "_source": {
    "excludes": []
  },
  "stored_fields": [
    "*"
  ],
  "script_fields": {
    "is_bug": {
      "script": {
        "source": "return doc['labels'].contains('bug') ? 1 : 0",
        "lang": "painless"
      }
    }
  },
  "docvalue_fields": [
    {
      "field": "closed_at.time",
      "format": "date_time"
    },
    {
      "field": "created_at.time",
      "format": "date_time"
    },
    {
      "field": "last_crawled_at",
      "format": "date_time"
    },
    {
      "field": "updated_at.time",
      "format": "date_time"
    }
  ],
  "query": {
    "bool": {
      "must": [],
      "filter": [
        {
          "match_all": {}
        }
      ],
      "should": [],
      "must_not": []
    }
  }
}

Level 1 (user) other bucket query

{
  "aggs": {
    "other-filter": {
      "aggs": {
        "1": {
          "sum": {
            "field": "number"
          }
        },
        "3": {
          "terms": {
            "field": "state",
            "order": {
              "1": "desc"
            },
            "size": 1
          },
          "aggs": {
            "1": {
              "sum": {
                "field": "number"
              }
            }
          }
        }
      },
      "filters": {
        "filters": {
          "": {
            "bool": {
              "must": [
                {
                  "exists": {
                    "field": "user"
                  }
                }
              ],
              "filter": [],
              "should": [],
              "must_not": [
                {
                  "match_phrase": {
                    "user": {
                      "query": "spalger"
                    }
                  }
                },
                {
                  "match_phrase": {
                    "user": {
                      "query": "nreese"
                    }
                  }
                },
                {
                  "match_phrase": {
                    "user": {
                      "query": "cjcenizal"
                    }
                  }
                },
                {
                  "match_phrase": {
                    "user": {
                      "query": "kibanamachine"
                    }
                  }
                },
                {
                  "match_phrase": {
                    "user": {
                      "query": "stacey-gammon"
                    }
                  }
                }
              ]
            }
          }
        }
      }
    }
  },
  "size": 0,
  "_source": {
    "excludes": []
  },
  "stored_fields": [
    "*"
  ],
  "script_fields": {
    "is_bug": {
      "script": {
        "source": "return doc['labels'].contains('bug') ? 1 : 0",
        "lang": "painless"
      }
    }
  },
  "docvalue_fields": [
    {
      "field": "closed_at.time",
      "format": "date_time"
    },
    {
      "field": "created_at.time",
      "format": "date_time"
    },
    {
      "field": "last_crawled_at",
      "format": "date_time"
    },
    {
      "field": "updated_at.time",
      "format": "date_time"
    }
  ],
  "query": {
    "bool": {
      "must": [],
      "filter": [
        {
          "match_all": {}
        }
      ],
      "should": [],
      "must_not": []
    }
  }
}

Level 2 (State) other bucket query

{
  "aggs": {
    "other-filter": {
      "aggs": {
        "1": {
          "sum": {
            "field": "number"
          }
        }
      },
      "filters": {
        "filters": {
          "-spalger": {
            "bool": {
              "must": [
                {
                  "match_phrase": {
                    "user": {
                      "query": "spalger"
                    }
                  }
                },
                {
                  "exists": {
                    "field": "state"
                  }
                }
              ],
              "filter": [],
              "should": [],
              "must_not": [
                {
                  "match_phrase": {
                    "state": {
                      "query": "closed"
                    }
                  }
                }
              ]
            }
          },
          "-nreese": {
            "bool": {
              "must": [
                {
                  "match_phrase": {
                    "user": {
                      "query": "nreese"
                    }
                  }
                },
                {
                  "exists": {
                    "field": "state"
                  }
                }
              ],
              "filter": [],
              "should": [],
              "must_not": [
                {
                  "match_phrase": {
                    "state": {
                      "query": "closed"
                    }
                  }
                }
              ]
            }
          },
          "-cjcenizal": {
            "bool": {
              "must": [
                {
                  "match_phrase": {
                    "user": {
                      "query": "cjcenizal"
                    }
                  }
                },
                {
                  "exists": {
                    "field": "state"
                  }
                }
              ],
              "filter": [],
              "should": [],
              "must_not": [
                {
                  "match_phrase": {
                    "state": {
                      "query": "closed"
                    }
                  }
                }
              ]
            }
          },
          "-kibanamachine": {
            "bool": {
              "must": [
                {
                  "match_phrase": {
                    "user": {
                      "query": "kibanamachine"
                    }
                  }
                },
                {
                  "exists": {
                    "field": "state"
                  }
                }
              ],
              "filter": [],
              "should": [],
              "must_not": [
                {
                  "match_phrase": {
                    "state": {
                      "query": "closed"
                    }
                  }
                }
              ]
            }
          },
          "-stacey-gammon": {
            "bool": {
              "must": [
                {
                  "match_phrase": {
                    "user": {
                      "query": "stacey-gammon"
                    }
                  }
                },
                {
                  "exists": {
                    "field": "state"
                  }
                }
              ],
              "filter": [],
              "should": [],
              "must_not": [
                {
                  "match_phrase": {
                    "state": {
                      "query": "closed"
                    }
                  }
                }
              ]
            }
          },
          "-__other__": {
            "bool": {
              "must": [
                {
                  "exists": {
                    "field": "state"
                  }
                }
              ],
              "filter": [],
              "should": [],
              "must_not": [
                {
                  "bool": {
                    "should": [
                      {
                        "match_phrase": {
                          "user": "spalger"
                        }
                      },
                      {
                        "match_phrase": {
                          "user": "nreese"
                        }
                      },
                      {
                        "match_phrase": {
                          "user": "cjcenizal"
                        }
                      },
                      {
                        "match_phrase": {
                          "user": "kibanamachine"
                        }
                      },
                      {
                        "match_phrase": {
                          "user": "stacey-gammon"
                        }
                      }
                    ],
                    "minimum_should_match": 1
                  }
                },
                {
                  "match_phrase": {
                    "state": {
                      "query": "closed"
                    }
                  }
                }
              ]
            }
          }
        }
      }
    }
  },
  "size": 0,
  "_source": {
    "excludes": []
  },
  "stored_fields": [
    "*"
  ],
  "script_fields": {
    "is_bug": {
      "script": {
        "source": "return doc['labels'].contains('bug') ? 1 : 0",
        "lang": "painless"
      }
    }
  },
  "docvalue_fields": [
    {
      "field": "closed_at.time",
      "format": "date_time"
    },
    {
      "field": "created_at.time",
      "format": "date_time"
    },
    {
      "field": "last_crawled_at",
      "format": "date_time"
    },
    {
      "field": "updated_at.time",
      "format": "date_time"
    }
  ],
  "query": {
    "bool": {
      "must": [],
      "filter": [
        {
          "match_all": {}
        }
      ],
      "should": [],
      "must_not": []
    }
  }
}

In general you can see those requests when doing a terms aggregation for a visualization in Kibana, enable Other Buckets on that, and then use the Inspect button on top and switch from the tabular data view to the request view. It will show all requests done.

In general the implementation itself works that every aggregation can execute a post flight request, and the post flight request for terms can be found in the terms.js file with most of the actual merging and filtering logic happening in _terms_other_bucket_helper.js. If you inspect that code you'll also get a good feeling why I would prefer that logic to live inside Elasticsearch :-)

wchaparro · 2024-02-15T20:42:18Z

closing as not planned.

j0hnsmith changed the title ~~Calculate aggregations on 'other' bucket~~ Terms agg: calculate aggs on 'other' bucket Jul 23, 2015

clintongormley added discuss :Analytics/Aggregations Aggregations labels Jul 23, 2015

clintongormley added stalled and removed discuss labels Jul 24, 2015

tbragin mentioned this issue Sep 18, 2015

Add "missing" and "other" values to terms agg elastic/kibana#1961

Closed

Bargs mentioned this issue Jun 15, 2016

Added option to display 'Others' buckets in the pie chart elastic/kibana#7464

Closed

clintongormley mentioned this issue Jun 30, 2016

non-binary gender option in term aggr. example #19188

Merged

ppisljar mentioned this issue Sep 21, 2016

Bar graph "order by" incorrect with double split terms elastic/kibana#5512

Closed

colings86 added the >feature label Apr 24, 2018

not-napoleon added the team-discuss label Jul 1, 2019

$@polyfractal$ polyfractal removed the team-discuss label Jul 3, 2019

$@polyfractal$ polyfractal mentioned this issue Jan 10, 2020

Multi-pass aggregation support #50863

Open

rjernst added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label May 4, 2020

FrankHassanabad mentioned this issue Apr 26, 2021

[Security Solutions] We should change our query to use a real "others", rather than a "missing" query elastic/kibana#98350

Closed

wchaparro closed this as not planned Won't fix, can't repro, duplicate, stale Feb 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Terms agg: calculate aggs on 'other' bucket #12411

Terms agg: calculate aggs on 'other' bucket #12411

j0hnsmith commented Jul 23, 2015

clintongormley commented Jul 23, 2015

j0hnsmith commented Jul 24, 2015

vivekmoosani commented Jan 14, 2016

PaulGrandperrin commented Jan 15, 2016

powermik commented Feb 19, 2016

dynomeat commented Mar 10, 2016

EdwardKaravakis commented May 25, 2016

markharwood commented Mar 16, 2018

timroes commented Jul 1, 2019

polyfractal commented Jul 1, 2019

timroes commented Jul 2, 2019 •

edited

Loading

polyfractal commented Jul 2, 2019

polyfractal commented Jul 3, 2019

timroes commented Jul 4, 2019

wchaparro commented Feb 15, 2024

Terms agg: calculate aggs on 'other' bucket #12411

Terms agg: calculate aggs on 'other' bucket #12411

Comments

j0hnsmith commented Jul 23, 2015

clintongormley commented Jul 23, 2015

j0hnsmith commented Jul 24, 2015

vivekmoosani commented Jan 14, 2016

PaulGrandperrin commented Jan 15, 2016

powermik commented Feb 19, 2016

dynomeat commented Mar 10, 2016

EdwardKaravakis commented May 25, 2016

markharwood commented Mar 16, 2018

timroes commented Jul 1, 2019

polyfractal commented Jul 1, 2019

timroes commented Jul 2, 2019 • edited Loading

polyfractal commented Jul 2, 2019

polyfractal commented Jul 3, 2019

timroes commented Jul 4, 2019

wchaparro commented Feb 15, 2024

timroes commented Jul 2, 2019 •

edited

Loading