-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Terms agg: calculate aggs on 'other' bucket #12411
Comments
@j0hnsmith calculating sub-aggs on an other bucket requires two round trips. The first calculates the top-ten terms (plus their sub-aggs). The second calculates the sub-aggs on everything except the top-ten terms. To support this in Elasticsearch, we'd need to implement #12316 first. However this is something you can do yourself today. |
I know there are workarounds, but with every level of sub aggregation they get progressively more complex, this could simplify some very complex queries. |
+1 |
4 similar comments
+1 |
+1 |
+1 |
+1 |
cc @elastic/es-search-aggs |
I just wanted to check on in this issue and ping the current team @elastic/es-analytics-geo (since team names seem to have been changed). We currently use that workaround described above in Kibana to calculate the "Other bucket" for terms aggregations and it's causing us a lot of pain. For one thing, it's the only thing we need to do a second request to gather all information needed for a visualization to render, thus needing some special handling in our infrastructure. Also since that "Other bucket" itself is not really a bucket in the terms of ES, we need a lot of special handling for those, e.g. the filter creation logic need to handle them individually. We also see a lot of issues where (our code) doesn't work properly when having more complex aggregation configurations for your visualization. And last but not least, since they are not real buckets we can never make them properly work with Bucket Script or Bucket Selector, which we want to implement in the future, but then need some special handling or disable them for Other again. Having the "Other bucket" feature available in Elasticsearch (most importantly for Terms, but we've also seen user asking for it on the Filters Agg and Significant Terms aggs, and I assume in the future also on Rare Terms agg), would be one of the really huge wins for Kibana visualizations and its infrastructure. If there is anything we can support you with, please let me know. |
Not overly familiar with the issue, but is there a reason that the E.g. the missing aggregation can go next to the terms aggregation and give you all the documents that don't have the particular field which is being aggregated in the terms agg, and allow you to perform sub-aggs there. Alternatively, you can set the |
I think there is some confusing between missing and "Other documents" here :-) Missing is a bucket containing all the documents not having a value in that specific field and instead they will get that missing value. Other bucket should contain all documents not returned in as a bucket otherwise. These can totally fine have values for that field, but just not be under the top x terms requested. So if you request the top 5 terms, and a document is not in one of those buckets it should be in the other bucket. It could still be that this document has a value for that field, just not one of the top 5 common ones. I can also explain shortly why that was such a highly requested feature in Kibana for a long time, that we decided to implement in (in the above described not very stable way) on our side: Pie charts :-) If you want to draw the top 5 countries in a pie chart, without the other value, they would always make up 100% of the chart. That can be very confusing depending on what you're trying to visualize, because those countries are not making up at all 100% of the data, but maybe just 30%. Users want to see that, by having 70% drawn as others and these top 5 values as individual slices. Similar things apply for other chart types as well. Just FYI: we also expose the Missing setting and use as you described above internally a unique identifier for that bucket, so we can find it later, but this is a completely different functionality: |
Gotcha, makes sense 👍 . Thanks for the extra details @timroes. Mark added the team-discuss label to this, so we'll chat about it in the next meeting. |
Chatted in analytics meeting, and unfortunately we're at the same roadblock as four years ago. We can't calculate an "other" bucket before we know the global How is Kibana performing the two-passes today? What does the structure of the second query look like (you can point me at code too, that's fine)? Trying to see if there is something we can do to help make the second pass easier until the agg framework has true multi-pass capability. Is it a As a side note, when/if Jim's API (#26472) is implemented, external two-pass implementations will at least have a consistent view of the index which will make the situation a little better. |
Yes Kibana will basically build a Here is an example query where I enabled Other Bucket for two nested term queries with one sum metric: Actual query{
"aggs": {
"2": {
"terms": {
"field": "user",
"order": {
"1": "desc"
},
"size": 5
},
"aggs": {
"1": {
"sum": {
"field": "number"
}
},
"3": {
"terms": {
"field": "state",
"order": {
"1": "desc"
},
"size": 1
},
"aggs": {
"1": {
"sum": {
"field": "number"
}
}
}
}
}
}
},
"size": 0,
"_source": {
"excludes": []
},
"stored_fields": [
"*"
],
"script_fields": {
"is_bug": {
"script": {
"source": "return doc['labels'].contains('bug') ? 1 : 0",
"lang": "painless"
}
}
},
"docvalue_fields": [
{
"field": "closed_at.time",
"format": "date_time"
},
{
"field": "created_at.time",
"format": "date_time"
},
{
"field": "last_crawled_at",
"format": "date_time"
},
{
"field": "updated_at.time",
"format": "date_time"
}
],
"query": {
"bool": {
"must": [],
"filter": [
{
"match_all": {}
}
],
"should": [],
"must_not": []
}
}
} Level 1 (user) other bucket query{
"aggs": {
"other-filter": {
"aggs": {
"1": {
"sum": {
"field": "number"
}
},
"3": {
"terms": {
"field": "state",
"order": {
"1": "desc"
},
"size": 1
},
"aggs": {
"1": {
"sum": {
"field": "number"
}
}
}
}
},
"filters": {
"filters": {
"": {
"bool": {
"must": [
{
"exists": {
"field": "user"
}
}
],
"filter": [],
"should": [],
"must_not": [
{
"match_phrase": {
"user": {
"query": "spalger"
}
}
},
{
"match_phrase": {
"user": {
"query": "nreese"
}
}
},
{
"match_phrase": {
"user": {
"query": "cjcenizal"
}
}
},
{
"match_phrase": {
"user": {
"query": "kibanamachine"
}
}
},
{
"match_phrase": {
"user": {
"query": "stacey-gammon"
}
}
}
]
}
}
}
}
}
},
"size": 0,
"_source": {
"excludes": []
},
"stored_fields": [
"*"
],
"script_fields": {
"is_bug": {
"script": {
"source": "return doc['labels'].contains('bug') ? 1 : 0",
"lang": "painless"
}
}
},
"docvalue_fields": [
{
"field": "closed_at.time",
"format": "date_time"
},
{
"field": "created_at.time",
"format": "date_time"
},
{
"field": "last_crawled_at",
"format": "date_time"
},
{
"field": "updated_at.time",
"format": "date_time"
}
],
"query": {
"bool": {
"must": [],
"filter": [
{
"match_all": {}
}
],
"should": [],
"must_not": []
}
}
} Level 2 (State) other bucket query{
"aggs": {
"other-filter": {
"aggs": {
"1": {
"sum": {
"field": "number"
}
}
},
"filters": {
"filters": {
"-spalger": {
"bool": {
"must": [
{
"match_phrase": {
"user": {
"query": "spalger"
}
}
},
{
"exists": {
"field": "state"
}
}
],
"filter": [],
"should": [],
"must_not": [
{
"match_phrase": {
"state": {
"query": "closed"
}
}
}
]
}
},
"-nreese": {
"bool": {
"must": [
{
"match_phrase": {
"user": {
"query": "nreese"
}
}
},
{
"exists": {
"field": "state"
}
}
],
"filter": [],
"should": [],
"must_not": [
{
"match_phrase": {
"state": {
"query": "closed"
}
}
}
]
}
},
"-cjcenizal": {
"bool": {
"must": [
{
"match_phrase": {
"user": {
"query": "cjcenizal"
}
}
},
{
"exists": {
"field": "state"
}
}
],
"filter": [],
"should": [],
"must_not": [
{
"match_phrase": {
"state": {
"query": "closed"
}
}
}
]
}
},
"-kibanamachine": {
"bool": {
"must": [
{
"match_phrase": {
"user": {
"query": "kibanamachine"
}
}
},
{
"exists": {
"field": "state"
}
}
],
"filter": [],
"should": [],
"must_not": [
{
"match_phrase": {
"state": {
"query": "closed"
}
}
}
]
}
},
"-stacey-gammon": {
"bool": {
"must": [
{
"match_phrase": {
"user": {
"query": "stacey-gammon"
}
}
},
{
"exists": {
"field": "state"
}
}
],
"filter": [],
"should": [],
"must_not": [
{
"match_phrase": {
"state": {
"query": "closed"
}
}
}
]
}
},
"-__other__": {
"bool": {
"must": [
{
"exists": {
"field": "state"
}
}
],
"filter": [],
"should": [],
"must_not": [
{
"bool": {
"should": [
{
"match_phrase": {
"user": "spalger"
}
},
{
"match_phrase": {
"user": "nreese"
}
},
{
"match_phrase": {
"user": "cjcenizal"
}
},
{
"match_phrase": {
"user": "kibanamachine"
}
},
{
"match_phrase": {
"user": "stacey-gammon"
}
}
],
"minimum_should_match": 1
}
},
{
"match_phrase": {
"state": {
"query": "closed"
}
}
}
]
}
}
}
}
}
},
"size": 0,
"_source": {
"excludes": []
},
"stored_fields": [
"*"
],
"script_fields": {
"is_bug": {
"script": {
"source": "return doc['labels'].contains('bug') ? 1 : 0",
"lang": "painless"
}
}
},
"docvalue_fields": [
{
"field": "closed_at.time",
"format": "date_time"
},
{
"field": "created_at.time",
"format": "date_time"
},
{
"field": "last_crawled_at",
"format": "date_time"
},
{
"field": "updated_at.time",
"format": "date_time"
}
],
"query": {
"bool": {
"must": [],
"filter": [
{
"match_all": {}
}
],
"should": [],
"must_not": []
}
}
} In general you can see those requests when doing a terms aggregation for a visualization in Kibana, enable Other Buckets on that, and then use the Inspect button on top and switch from the tabular data view to the request view. It will show all requests done. In general the implementation itself works that every aggregation can execute a post flight request, and the post flight request for terms can be found in the terms.js file with most of the actual merging and filtering logic happening in _terms_other_bucket_helper.js. If you inspect that code you'll also get a good feeling why I would prefer that logic to live inside Elasticsearch :-) |
closing as not planned. |
The terms aggregation now provides an 'other' bucket with a count, I'd like to see the same aggregations performed on the 'other' bucket. Eg if I'm doing a stats aggregation I have stats (sum) for docs with term
foo
andbar
but not for docs where the field is missing or has anull
value.This is really important for analytics type services as all the values must add up to 100% of the total.
There's quite a bit of discussion about it in #5324.
The text was updated successfully, but these errors were encountered: