-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement special Groups for StringViews #12771
Labels
enhancement
New feature or request
Comments
14 tasks
alamb
changed the title
Implement special Groups or StringViews
Implement special Groups for StringViews
Oct 5, 2024
FYI @Rachelint / @jayzhan211 -- this might be an interesting project |
This was referenced Oct 5, 2024
Actually interesting, I am willing to help push it forward |
take |
I think it will be quite a cool optimization -- specifically checking for equal values can likely be optimized using the inlined prefix |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Is your feature request related to a problem or challenge?
Part of #12680
In #12269 @jayzhan211 made significant improvements to how group values are stored in multi-column aggregations. This requires specialized implementations for different column types
His initial PR has implementations for PrimitiveArray and String/Binary. However it does not have a specialization for
StringView
So that means that queries that group on multiple columns are even faster. This shows up by effectively slowing down some clickbench queries when they are run with StringView:
For example, this query is 10% slower with StringView
Describe the solution you'd like
I would like to make this (and similar) query faster when string view is enabled :
Note this is grouping by 2 columns
Here is how to reproduce the issue
Step 1. Get
hits.parquet
usingbench.sh
:cd benchmarks ./bench.sh data clickbench_1
Step 2: Prepare a script with reproducer query:
Step 3: Run query
set datafusion.execution.parquet.schema_force_view_types = true;
--> Elapsed 0.688 seconds.set datafusion.execution.parquet.schema_force_view_types = false;
--> Elapsed 0.565 seconds.Describe alternatives you've considered
I suggest implementing something like
ByteViewGroupValueBuilder
following the model ofByteGroupValueBuilder
datafusion/datafusion/physical-plan/src/aggregates/group_values/group_column.rs
Line 177 in 6f8c74c
The in progress values would be
u128
s and some buffers (maybe 2MB?)implementing
equal_to
can take advantage of the inlined prefix optimization (aka compare the prefix inlined in the u128 and only check the value in the buffer if that is already equal)Additional context
No response
The text was updated successfully, but these errors were encountered: