-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve Memory usage + performance with large numbers of groups / High Cardinality Aggregates #6937
Comments
I did some tests just based on a heuristic (e.g. number of columns in input / group by) in #6938 but saw both perf. improvements (likely the high cardinality queries) and degradations (it also seems hash partitioning is not really fast ATM). Also for distributed systems like Ballista, the partial / final approach probably works better in most cases (even for higher cardinality ones), so I think we would have to make this behaviour configurable. |
Yes -- here are some ideas to improve things:
|
I think we can skip the |
DuckDB's partitioned hash table only take place with its pipeline execution model. |
Yes I think this is a good strategy 👍 |
I think this PR contains some nice ideas |
I did some experimentation with emitting of rows once every 10K is hit (but keep grouping) in this branch: https://github.com/Dandandan/arrow-datafusion/tree/emit_rows_aggregate Doing that reduces the memory usage, but often with higher cost, which can be seen in the benchmark:
|
Maybe we can get the performance back somehow (like make the output creation faster somehow) 🤔 Alternately, we could consider making a single group operator that does the two phase grouping within itself so instead of
We would have
And do the repartitioning within the operator itself (and thus if the first phase isn't helping, we can switch to the second phase) This might impact downstream projects like ballista that want to distribute the first phase, however 🤔 |
Is your feature request related to a problem or challenge?
When running a query with "high cardinality" grouping in DataFusion, the memory usage increases linearly both with the number of groups (expected) but also with the number of cores.
Is the root cause of @ychen7's observation that ClickBench q32 fails As #5276 (comment)
To reproduce, get the ClickBench data https://github.com/ClickHouse/ClickBench/tree/main#data-loading and run this:
This is what the memory usage looks like:
The reason for this behavior can be found in the plan and the multi-stage hash grouping that is done:
Specifically since the groups are arbitrarily distributed in the files, the first
AggregateExec: mode=Partial
has to build a hash table that has entries for all groups. As the number of target partitions goes up, the number ofAggregateExec: mode=Partial
goes up to and thus so does the number of copies of the dataThe
AggregateExec: mode=FinalPartitioned
only see a distinct subset of the keys and thus as the number of target partitions goes up there are moreAggregateExec: mode=FinalPartitioned
s each sees a smaller and smaller subset of the group keysIn pictures:
Some example data:
Describe the solution you'd like
TLDR is I would like to propose updating the
AggregateExec: mode=Partial
to emit their hash tables if they see more than some fixed size number of groups (I think @mingmwang said DuckDB uses a value of10,000
for this)This approach bounds the memory usage (to some fixed constant * the target partitioning) and also should perform quite well
In the literature I think this approach could be called "dynamic partitioning" as it switches approaches based on the actual cardinality of the groups in the dataset
Describe alternatives you've considered
One potential thing that might be suggested is simply to repartition the input to
AggregateExec: mode=Partial
This approach would definitely reduce the memory requirements, but it would mean that we would have to hash repartition all the input rows so the number of input values that need to be hashed / copied would likely be much higher (at least as long as the group hasher and hash repartitioner can't share the hashes, which is the case today)
The current strategy actually works very well for low cardinality group bys because the
AggregateExec: mode=Partial
can reduce the size of the intermediate result that needs to be hash repartitioned to a very small sizeAdditional context
We saw this in IOx while working on some tracing queries that look very similar to the ClickBench query, something like the following to get the top ten traces
The text was updated successfully, but these errors were encountered: