Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Terms agg: calculate aggs on 'other' bucket #12411

Closed
j0hnsmith opened this issue Jul 23, 2015 · 15 comments
Closed

Terms agg: calculate aggs on 'other' bucket #12411

j0hnsmith opened this issue Jul 23, 2015 · 15 comments
Labels
:Analytics/Aggregations Aggregations >feature stalled Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo)

Comments

@j0hnsmith
Copy link

The terms aggregation now provides an 'other' bucket with a count, I'd like to see the same aggregations performed on the 'other' bucket. Eg if I'm doing a stats aggregation I have stats (sum) for docs with term foo and bar but not for docs where the field is missing or has a null value.

This is really important for analytics type services as all the values must add up to 100% of the total.

There's quite a bit of discussion about it in #5324.

@j0hnsmith j0hnsmith changed the title Calculate aggregations on 'other' bucket Terms agg: calculate aggs on 'other' bucket Jul 23, 2015
@clintongormley
Copy link
Contributor

@j0hnsmith calculating sub-aggs on an other bucket requires two round trips. The first calculates the top-ten terms (plus their sub-aggs). The second calculates the sub-aggs on everything except the top-ten terms.

To support this in Elasticsearch, we'd need to implement #12316 first. However this is something you can do yourself today.

@j0hnsmith
Copy link
Author

I know there are workarounds, but with every level of sub aggregation they get progressively more complex, this could simplify some very complex queries.

@vivekmoosani
Copy link

+1

4 similar comments
@PaulGrandperrin
Copy link

+1

@powermik
Copy link

+1

@dynomeat
Copy link

+1

@EdwardKaravakis
Copy link

+1

@markharwood
Copy link
Contributor

cc @elastic/es-search-aggs

@timroes
Copy link
Contributor

timroes commented Jul 1, 2019

I just wanted to check on in this issue and ping the current team @elastic/es-analytics-geo (since team names seem to have been changed).

We currently use that workaround described above in Kibana to calculate the "Other bucket" for terms aggregations and it's causing us a lot of pain. For one thing, it's the only thing we need to do a second request to gather all information needed for a visualization to render, thus needing some special handling in our infrastructure. Also since that "Other bucket" itself is not really a bucket in the terms of ES, we need a lot of special handling for those, e.g. the filter creation logic need to handle them individually. We also see a lot of issues where (our code) doesn't work properly when having more complex aggregation configurations for your visualization. And last but not least, since they are not real buckets we can never make them properly work with Bucket Script or Bucket Selector, which we want to implement in the future, but then need some special handling or disable them for Other again.

Having the "Other bucket" feature available in Elasticsearch (most importantly for Terms, but we've also seen user asking for it on the Filters Agg and Significant Terms aggs, and I assume in the future also on Rare Terms agg), would be one of the really huge wins for Kibana visualizations and its infrastructure. If there is anything we can support you with, please let me know.

@polyfractal
Copy link
Contributor

Not overly familiar with the issue, but is there a reason that the missing aggregation and/or missing parameter on terms agg aren't sufficient?

E.g. the missing aggregation can go next to the terms aggregation and give you all the documents that don't have the particular field which is being aggregated in the terms agg, and allow you to perform sub-aggs there.

Alternatively, you can set the missing param on a terms agg to something unique (__$MISSING$__ or whatever), and then use that bucket for sub-aggregations.

@timroes
Copy link
Contributor

timroes commented Jul 2, 2019

I think there is some confusing between missing and "Other documents" here :-) Missing is a bucket containing all the documents not having a value in that specific field and instead they will get that missing value. Other bucket should contain all documents not returned in as a bucket otherwise. These can totally fine have values for that field, but just not be under the top x terms requested. So if you request the top 5 terms, and a document is not in one of those buckets it should be in the other bucket. It could still be that this document has a value for that field, just not one of the top 5 common ones.

I can also explain shortly why that was such a highly requested feature in Kibana for a long time, that we decided to implement in (in the above described not very stable way) on our side: Pie charts :-)

If you want to draw the top 5 countries in a pie chart, without the other value, they would always make up 100% of the chart. That can be very confusing depending on what you're trying to visualize, because those countries are not making up at all 100% of the data, but maybe just 30%. Users want to see that, by having 70% drawn as others and these top 5 values as individual slices. Similar things apply for other chart types as well.

Just FYI: we also expose the Missing setting and use as you described above internally a unique identifier for that bucket, so we can find it later, but this is a completely different functionality:

screenshot-20190702-095745

@polyfractal
Copy link
Contributor

Gotcha, makes sense 👍 . Thanks for the extra details @timroes. Mark added the team-discuss label to this, so we'll chat about it in the next meeting.

@polyfractal
Copy link
Contributor

Chatted in analytics meeting, and unfortunately we're at the same roadblock as four years ago. We can't calculate an "other" bucket before we know the global top-n results... and at that point it's too late to build an "other" bucket because we are reducing on the coordinating node. To do this we need two-pass/multi-pass support in the aggregation framework which doesn't exist today (although we have been talking about how we could potentially do it). First pass to find top-n, second pass to collect everything else that wasn't in the top-n into a bucket.

How is Kibana performing the two-passes today? What does the structure of the second query look like (you can point me at code too, that's fine)? Trying to see if there is something we can do to help make the second pass easier until the agg framework has true multi-pass capability. Is it a filter agg for must_not: <top-n terms> or similar?

As a side note, when/if Jim's API (#26472) is implemented, external two-pass implementations will at least have a consistent view of the index which will make the situation a little better.

@timroes
Copy link
Contributor

timroes commented Jul 4, 2019

Yes Kibana will basically build a must_not: <top-n terms> from the first request.

Here is an example query where I enabled Other Bucket for two nested term queries with one sum metric:

Actual query
{
  "aggs": {
    "2": {
      "terms": {
        "field": "user",
        "order": {
          "1": "desc"
        },
        "size": 5
      },
      "aggs": {
        "1": {
          "sum": {
            "field": "number"
          }
        },
        "3": {
          "terms": {
            "field": "state",
            "order": {
              "1": "desc"
            },
            "size": 1
          },
          "aggs": {
            "1": {
              "sum": {
                "field": "number"
              }
            }
          }
        }
      }
    }
  },
  "size": 0,
  "_source": {
    "excludes": []
  },
  "stored_fields": [
    "*"
  ],
  "script_fields": {
    "is_bug": {
      "script": {
        "source": "return doc['labels'].contains('bug') ? 1 : 0",
        "lang": "painless"
      }
    }
  },
  "docvalue_fields": [
    {
      "field": "closed_at.time",
      "format": "date_time"
    },
    {
      "field": "created_at.time",
      "format": "date_time"
    },
    {
      "field": "last_crawled_at",
      "format": "date_time"
    },
    {
      "field": "updated_at.time",
      "format": "date_time"
    }
  ],
  "query": {
    "bool": {
      "must": [],
      "filter": [
        {
          "match_all": {}
        }
      ],
      "should": [],
      "must_not": []
    }
  }
}
Level 1 (user) other bucket query
{
  "aggs": {
    "other-filter": {
      "aggs": {
        "1": {
          "sum": {
            "field": "number"
          }
        },
        "3": {
          "terms": {
            "field": "state",
            "order": {
              "1": "desc"
            },
            "size": 1
          },
          "aggs": {
            "1": {
              "sum": {
                "field": "number"
              }
            }
          }
        }
      },
      "filters": {
        "filters": {
          "": {
            "bool": {
              "must": [
                {
                  "exists": {
                    "field": "user"
                  }
                }
              ],
              "filter": [],
              "should": [],
              "must_not": [
                {
                  "match_phrase": {
                    "user": {
                      "query": "spalger"
                    }
                  }
                },
                {
                  "match_phrase": {
                    "user": {
                      "query": "nreese"
                    }
                  }
                },
                {
                  "match_phrase": {
                    "user": {
                      "query": "cjcenizal"
                    }
                  }
                },
                {
                  "match_phrase": {
                    "user": {
                      "query": "kibanamachine"
                    }
                  }
                },
                {
                  "match_phrase": {
                    "user": {
                      "query": "stacey-gammon"
                    }
                  }
                }
              ]
            }
          }
        }
      }
    }
  },
  "size": 0,
  "_source": {
    "excludes": []
  },
  "stored_fields": [
    "*"
  ],
  "script_fields": {
    "is_bug": {
      "script": {
        "source": "return doc['labels'].contains('bug') ? 1 : 0",
        "lang": "painless"
      }
    }
  },
  "docvalue_fields": [
    {
      "field": "closed_at.time",
      "format": "date_time"
    },
    {
      "field": "created_at.time",
      "format": "date_time"
    },
    {
      "field": "last_crawled_at",
      "format": "date_time"
    },
    {
      "field": "updated_at.time",
      "format": "date_time"
    }
  ],
  "query": {
    "bool": {
      "must": [],
      "filter": [
        {
          "match_all": {}
        }
      ],
      "should": [],
      "must_not": []
    }
  }
}
Level 2 (State) other bucket query
{
  "aggs": {
    "other-filter": {
      "aggs": {
        "1": {
          "sum": {
            "field": "number"
          }
        }
      },
      "filters": {
        "filters": {
          "-spalger": {
            "bool": {
              "must": [
                {
                  "match_phrase": {
                    "user": {
                      "query": "spalger"
                    }
                  }
                },
                {
                  "exists": {
                    "field": "state"
                  }
                }
              ],
              "filter": [],
              "should": [],
              "must_not": [
                {
                  "match_phrase": {
                    "state": {
                      "query": "closed"
                    }
                  }
                }
              ]
            }
          },
          "-nreese": {
            "bool": {
              "must": [
                {
                  "match_phrase": {
                    "user": {
                      "query": "nreese"
                    }
                  }
                },
                {
                  "exists": {
                    "field": "state"
                  }
                }
              ],
              "filter": [],
              "should": [],
              "must_not": [
                {
                  "match_phrase": {
                    "state": {
                      "query": "closed"
                    }
                  }
                }
              ]
            }
          },
          "-cjcenizal": {
            "bool": {
              "must": [
                {
                  "match_phrase": {
                    "user": {
                      "query": "cjcenizal"
                    }
                  }
                },
                {
                  "exists": {
                    "field": "state"
                  }
                }
              ],
              "filter": [],
              "should": [],
              "must_not": [
                {
                  "match_phrase": {
                    "state": {
                      "query": "closed"
                    }
                  }
                }
              ]
            }
          },
          "-kibanamachine": {
            "bool": {
              "must": [
                {
                  "match_phrase": {
                    "user": {
                      "query": "kibanamachine"
                    }
                  }
                },
                {
                  "exists": {
                    "field": "state"
                  }
                }
              ],
              "filter": [],
              "should": [],
              "must_not": [
                {
                  "match_phrase": {
                    "state": {
                      "query": "closed"
                    }
                  }
                }
              ]
            }
          },
          "-stacey-gammon": {
            "bool": {
              "must": [
                {
                  "match_phrase": {
                    "user": {
                      "query": "stacey-gammon"
                    }
                  }
                },
                {
                  "exists": {
                    "field": "state"
                  }
                }
              ],
              "filter": [],
              "should": [],
              "must_not": [
                {
                  "match_phrase": {
                    "state": {
                      "query": "closed"
                    }
                  }
                }
              ]
            }
          },
          "-__other__": {
            "bool": {
              "must": [
                {
                  "exists": {
                    "field": "state"
                  }
                }
              ],
              "filter": [],
              "should": [],
              "must_not": [
                {
                  "bool": {
                    "should": [
                      {
                        "match_phrase": {
                          "user": "spalger"
                        }
                      },
                      {
                        "match_phrase": {
                          "user": "nreese"
                        }
                      },
                      {
                        "match_phrase": {
                          "user": "cjcenizal"
                        }
                      },
                      {
                        "match_phrase": {
                          "user": "kibanamachine"
                        }
                      },
                      {
                        "match_phrase": {
                          "user": "stacey-gammon"
                        }
                      }
                    ],
                    "minimum_should_match": 1
                  }
                },
                {
                  "match_phrase": {
                    "state": {
                      "query": "closed"
                    }
                  }
                }
              ]
            }
          }
        }
      }
    }
  },
  "size": 0,
  "_source": {
    "excludes": []
  },
  "stored_fields": [
    "*"
  ],
  "script_fields": {
    "is_bug": {
      "script": {
        "source": "return doc['labels'].contains('bug') ? 1 : 0",
        "lang": "painless"
      }
    }
  },
  "docvalue_fields": [
    {
      "field": "closed_at.time",
      "format": "date_time"
    },
    {
      "field": "created_at.time",
      "format": "date_time"
    },
    {
      "field": "last_crawled_at",
      "format": "date_time"
    },
    {
      "field": "updated_at.time",
      "format": "date_time"
    }
  ],
  "query": {
    "bool": {
      "must": [],
      "filter": [
        {
          "match_all": {}
        }
      ],
      "should": [],
      "must_not": []
    }
  }
}

In general you can see those requests when doing a terms aggregation for a visualization in Kibana, enable Other Buckets on that, and then use the Inspect button on top and switch from the tabular data view to the request view. It will show all requests done.

In general the implementation itself works that every aggregation can execute a post flight request, and the post flight request for terms can be found in the terms.js file with most of the actual merging and filtering logic happening in _terms_other_bucket_helper.js. If you inspect that code you'll also get a good feeling why I would prefer that logic to live inside Elasticsearch :-)

@wchaparro
Copy link
Member

closing as not planned.

@wchaparro wchaparro closed this as not planned Won't fix, can't repro, duplicate, stale Feb 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/Aggregations Aggregations >feature stalled Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo)
Projects
None yet
Development

No branches or pull requests