Feature request: Aggregation to produce buckets with a fixed number of documents in them #50120

IceCreamYou · 2019-12-12T08:54:41Z

I would like to create an NP chart and I can't find a way to do so with ElasticSearch currently.

An NP chart is a line chart in which the value of each point is a percentage of a fixed number of items that meets some criteria. For example, if my data looks like this:

[
	{ date: '2019-01-01', failed: true },
	{ date: '2019-01-01', failed: true },
	{ date: '2019-01-02', failed: false },
	{ date: '2019-01-04', failed: false },
	{ date: '2019-01-05', failed: false },
	{ date: '2019-01-08', failed: true },
	{ date: '2019-01-08', failed: false },
	{ date: '2019-01-08', failed: false },
]

Then I want to write a histogram like this:

aggs: {
	np_chart: {
		fixed_size_buckets: {
			max_bucket_count: 10,
			max_bucket_documents: 3,
			sort: [{
				date: {
					order: 'asc',
				},
			}],
		},
		aggs: {
			failed_count: {
				filter: {
					term: {
						'failed': true,
					},
				},
			},
		},
	},
},

Which should return buckets like this:

[
	{
		key: ...,
		key_as_string: '2019-01-01',
		doc_count: 3,
		failed_count: { doc_count: 2 },
	},
	{
		key: ...,
		key_as_string: '2019-01-04',
		doc_count: 3,
		failed_count: { doc_count: 1 },
	},
	{
		key: ...,
		key_as_string: '2019-01-08',
		doc_count: 2,
		failed_count: { doc_count: 0 },
	},
]

Obviously if there are this few documents I could load the documents into memory and parse them manually, but I'd like to have up to a few thousand documents per bucket and that's too much to process that way.

There's a variant of the NP chart where instead of dividing all the documents into groups of N, we first make a date histogram and then take a random sample of N documents from each day. The proposal above would support both cases.

I think this is different than other requests for variable width histograms I've been able to find but please correct me if this has been proposed elsewhere.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-12-12T09:12:11Z

Pinging @elastic/es-analytics-geo (:Analytics/Aggregations)

polyfractal · 2019-12-12T14:39:11Z

Hm, interesting. How do you decide which documents make it into the bucket? E.g. if you specify max_bucket_documents: 3 but there are 1000 docs that match, which three make it into the bucket? Random? Best "scoring" in some way?

We have two sampler aggs which might work for you: Sampler aggregation and Diversified sampler aggregation. Sampler selects n documents (per shard) and sub-aggs of the sampler only see those documents.

Diversified is similar, except it limits the number of documents per "value" in a field. E.g. you set max_docs_per_value: 3 over color field, and sub-aggs will only receive three documents for each color.

IceCreamYou · 2019-12-12T21:17:06Z

How do you decide which documents make it into the bucket?

In the example I provided, the sort parameter defines that. Think of it like making a query where each bucket has one page of the response.

What I want to do is take a population of documents, sort them, and divide them into buckets with an equal number of buckets in each document. The sampler aggregation just shrinks the population; it does not support dividing up the population into equal-sized buckets. It is also important for the statistical properties of an NP-chart that each bucket be exactly the same size (apart from the remainder) and it looks like the sampler aggregation's shard_size property only limits the number of documents evaluated per shard, rather than directly affecting the size of the returned bucket. Also, in my particular use case I am actually working inside of a nested aggregation, and I'm not sure how a sampler aggregation would know how to sort child documents in that scenario.

polyfractal · 2019-12-12T23:02:52Z

Ah gotcha, I overlooked the sort parameter. Makes sense, although I think there are still some issues to resolve around breaking "ties". E.g. if you sort on date and want 10 documents per bucket, you might still have >10 documents with the same date. So we'd probably need some kind of tie-breaker like document ID or something as well.

In any case, I agree it'd be nice to have a fixed-width histogram. I'm not sure we will be able to support this at the moment though. Currently the aggregation framework can only do a single pass over the data.

It has some facility to merge buckets as it proceeds (ala the auto_date_histogram, and similar method will be used by the variable-width histo. Essentially a form of clustering). But I think this would require the opposite: splitting buckets once it finds new candidates that should reside inside an existing bucket but which is already "full". That's fine on a single shard, but once we ship the results off to the coordinator and start merging shard results, we can no longer split a bucket because the sub-agg data is only mergeable, not splittable (e.g. can't split an avg after it has been calculated).

We are discussing how to implement multiple passes, which I think is one way to implement this: first pass to find extent/bounds of data and summaries about how data is distributed, second pass to calculate actual aggs.

Another approach could be a layer on top of the composite agg, which rolls up buckets as it pages through the results and returns the final result to the user.

IceCreamYou · 2019-12-12T23:26:26Z

Regarding ties, I imagine it'd work the same way as all the other sorting ElasticSearch does - IIRC defaulting to document ID as you suggested. In fact I'd be okay if the only supported ordering was document ID, since my particular application would be sorting by date/time anyway and document ID is a reasonable proxy.

Understood about the cross-shard ordering issue. One possible approach would be to ignore the issue and just return potentially overlapping buckets. It wouldn't be ideal, obviously, but it might be close enough. A more complex but presumably more accurate variant on that idea could be to add a shard_factor parameter such that each shard produces buckets of max_bucket_documents / shard_factor documents and the coordinator merges them until max_bucket_documents is reached. So for example if max_bucket_documents = 1000 and shard_factor = 10, each shard could produce buckets of 100 and then the coordinator could merge 10 shard-buckets to produce returned buckets.

not-napoleon · 2020-06-25T20:19:10Z

Blocked by #50863 - multi-pass aggregation support

not-napoleon · 2020-07-23T16:32:26Z

At the review meeting today, we also looked at #50386 (also stalled on multi-pass aggregations). That sounds like a very similar use case to this, we might be able to fit both asks at the same time.

pgomulka added the :Analytics/Aggregations Aggregations label Dec 12, 2019

rjernst added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label May 4, 2020

not-napoleon added the stalled label Jun 25, 2020

$@polyfractal$ polyfractal mentioned this issue Jul 28, 2020

Freedman-Diaconis histogram #60312

Closed

wchaparro added the >enhancement label May 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: Aggregation to produce buckets with a fixed number of documents in them #50120

Feature request: Aggregation to produce buckets with a fixed number of documents in them #50120

IceCreamYou commented Dec 12, 2019 •

edited

Loading

elasticmachine commented Dec 12, 2019

polyfractal commented Dec 12, 2019

IceCreamYou commented Dec 12, 2019

polyfractal commented Dec 12, 2019

IceCreamYou commented Dec 12, 2019

not-napoleon commented Jun 25, 2020

not-napoleon commented Jul 23, 2020

Feature request: Aggregation to produce buckets with a fixed number of documents in them #50120

Feature request: Aggregation to produce buckets with a fixed number of documents in them #50120

Comments

IceCreamYou commented Dec 12, 2019 • edited Loading

elasticmachine commented Dec 12, 2019

polyfractal commented Dec 12, 2019

IceCreamYou commented Dec 12, 2019

polyfractal commented Dec 12, 2019

IceCreamYou commented Dec 12, 2019

not-napoleon commented Jun 25, 2020

not-napoleon commented Jul 23, 2020

IceCreamYou commented Dec 12, 2019 •

edited

Loading