Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail variable_width_histogram that collects from many #58619

Merged

Conversation

nik9000
Copy link
Member

@nik9000 nik9000 commented Jun 26, 2020

Adds an explicit check to variable_width_histogram to stop it from
trying to collect from many buckets because it can't. I tried to make it
do so but that is more than an afternoon's project, sadly. So for now we
just disallow it.

Relates to #42035

Adds an explicit check to `variable_width_histogram` to stop it from
trying to collect from many buckets because it can't. I tried to make it
do so but that is more than an afternoon's project, sadly. So for now we
just disallow it.

Relates to elastic#42035
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (:Analytics/Aggregations)

@elasticmachine elasticmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Jun 26, 2020
@nik9000 nik9000 requested a review from polyfractal June 26, 2020 19:37
@nik9000
Copy link
Member Author

nik9000 commented Jun 26, 2020

I think to fix this we'd want to think about replacing mergeBuckets(long[]) with mergeBuckets(LongFunction) and maybe to give it a similar treatment to variable_width_histogram where it uses "budgeting" mechanism to accumulate a pile of buckets to merge and then do the expensive merge process fewer times.

@jamesdorfman
Copy link
Contributor

Hey, good catch with this!

I think the main cause of this problem relates to how the initial_buffer is initialized.
In VariableWidthHistogramAggregatorBuilder we do

private int initialBuffer = Math.min(10 * this.shardSize, 50000);

This will certainly lead to an explosion of bucket if the aggregation is nested, since each sub aggregation could have 50k buckets.

One way to fix this could be to just lower the default size of the initial_buffer parameter.
Math.min(10 * this.shardSize, 50000) seems very high, and if I recall correctly I chose it fairly arbitrarily.

I'm not quite sure what you mean by

replacing mergeBuckets(long[]) with mergeBuckets(LongFunction)

would you mind explaining?

@nik9000
Copy link
Member Author

nik9000 commented Jun 29, 2020

I'm not quite sure what you mean by

replacing mergeBuckets(long[]) with mergeBuckets(LongFunction)

would you mind explaining?

Its pretty much the TODO that you left. I'm not 100% sure it is needed, but it feels like it'd be fairly important.

The trouble is right here. That assertion isn't valid when collecting from many buckets. The second parameter which I usually call owningBucketOrd is set to the ordinal of the bucket that "owns" this collection. Its only 0 for the first owning bucket. Meaning, if it is at the top level, it is always 0. Aggregations, post the removal of asMultiBucketAggregator have to be "aware" of the bucket they are collecting.

It'd be fairly simple to handle this by changing the collector member to an ObjectArray<CollectionStrategy> and looking up the strategy with the owningBucketOrd. That works fine. You have to "shift" the ordinals of the buckets that you collect, but that is fine because the agg has a maximum number of buckets that each owningBucketOrd can collect - so it is enough to shift by owningBucketOrd * numBuckets.

The trouble comes when merging buckets. Its both tricky to implement and inefficient. Tricky because the mergeMap that you get is only for the owningBucketOrd that you are collecting, but the rewrites have to be for all of the owningBucketOrds that you have already collected. So you have to build the right mergeMap. Its inefficient because that mergeMap is quite large and because the rewrites have to rewrite all of the buckets, not just the buckets in this owningBucketOrd.

auto_date_histogram deals with the inefficiency by only rewriting when there are a certain number of "wasted" buckets. It has a fairly high budget and it doubles it every time it rewrites so it runs an amortized linear number of rewrites no matter how many buckets it sees. You could do that too, but I'm not sure it is worth it. auto_date_histogram can come by this behavior fairly naturally. You'd need some juggling here to get it.

I think it'd be ok doing it the "inefficient but correct" way. But just getting the correct way is tricky.

And now we come to my comment about mergeBuckets - doing it the inefficient but correct way is going to make a zillion mergeMaps, and they are going to be big. But a LongFunction or, rather, a LongBinaryOperator is could replace it. In the worst case you'd wrap the mergeMap, maybe even wrapping a "smaller" merge map. In the best case, like for moveLastCluster it would be something like i -> i < index ? i : i + 1. So, it isn't required, but it feels nice.

Copy link
Contributor

@polyfractal polyfractal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍, and cheers for the explanation. Makes sense to lock this down for now until we modify it to handle multiple buckets. It does seem reasonably tricky to do that efficiently, compared to auto-date

@jamesdorfman
Copy link
Contributor

Thanks for the detailed explanation! That makes a lot of sense :)

@nik9000 nik9000 merged commit 32bdf85 into elastic:master Jun 30, 2020
nik9000 added a commit to nik9000/elasticsearch that referenced this pull request Jun 30, 2020
Adds an explicit check to `variable_width_histogram` to stop it from
trying to collect from many buckets because it can't. I tried to make it
do so but that is more than an afternoon's project, sadly. So for now we
just disallow it.

Relates to elastic#42035
nik9000 added a commit that referenced this pull request Jun 30, 2020
Adds an explicit check to `variable_width_histogram` to stop it from
trying to collect from many buckets because it can't. I tried to make it
do so but that is more than an afternoon's project, sadly. So for now we
just disallow it.

Relates to #42035
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/Aggregations Aggregations >non-issue Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v7.9.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants