-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: Aggregation to produce buckets with a fixed number of documents in them #50120
Comments
Pinging @elastic/es-analytics-geo (:Analytics/Aggregations) |
Hm, interesting. How do you decide which documents make it into the bucket? E.g. if you specify We have two sampler aggs which might work for you: Sampler aggregation and Diversified sampler aggregation. Sampler selects Diversified is similar, except it limits the number of documents per "value" in a field. E.g. you set |
In the example I provided, the What I want to do is take a population of documents, sort them, and divide them into buckets with an equal number of buckets in each document. The sampler aggregation just shrinks the population; it does not support dividing up the population into equal-sized buckets. It is also important for the statistical properties of an NP-chart that each bucket be exactly the same size (apart from the remainder) and it looks like the sampler aggregation's |
Ah gotcha, I overlooked the In any case, I agree it'd be nice to have a fixed-width histogram. I'm not sure we will be able to support this at the moment though. Currently the aggregation framework can only do a single pass over the data. It has some facility to merge buckets as it proceeds (ala the We are discussing how to implement multiple passes, which I think is one way to implement this: first pass to find extent/bounds of data and summaries about how data is distributed, second pass to calculate actual aggs. Another approach could be a layer on top of the composite agg, which rolls up buckets as it pages through the results and returns the final result to the user. |
Regarding ties, I imagine it'd work the same way as all the other sorting ElasticSearch does - IIRC defaulting to document ID as you suggested. In fact I'd be okay if the only supported ordering was document ID, since my particular application would be sorting by date/time anyway and document ID is a reasonable proxy. Understood about the cross-shard ordering issue. One possible approach would be to ignore the issue and just return potentially overlapping buckets. It wouldn't be ideal, obviously, but it might be close enough. A more complex but presumably more accurate variant on that idea could be to add a |
Blocked by #50863 - multi-pass aggregation support |
At the review meeting today, we also looked at #50386 (also stalled on multi-pass aggregations). That sounds like a very similar use case to this, we might be able to fit both asks at the same time. |
I would like to create an NP chart and I can't find a way to do so with ElasticSearch currently.
An NP chart is a line chart in which the value of each point is a percentage of a fixed number of items that meets some criteria. For example, if my data looks like this:
Then I want to write a histogram like this:
Which should return buckets like this:
Obviously if there are this few documents I could load the documents into memory and parse them manually, but I'd like to have up to a few thousand documents per bucket and that's too much to process that way.
There's a variant of the NP chart where instead of dividing all the documents into groups of N, we first make a date histogram and then take a random sample of N documents from each day. The proposal above would support both cases.
I think this is different than other requests for variable width histograms I've been able to find but please correct me if this has been proposed elsewhere.
The text was updated successfully, but these errors were encountered: