-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve performance of ClickBench Q18, Q35, #13449
Comments
Perhaps we can use some variant of the topk grouping: https://github.com/apache/datafusion/tree/main/datafusion/physical-plan/src/aggregates/topk |
First step for this project would be to profile the queries and see what we can improve |
I run them and store here. Can download it for analysis |
It seems
|
Note that q18 and q35 got some speedup in #12996 |
For q18, I found string view lead to some regression in my local?
|
Yes, I profiled a verbose version flamegraph for I am working on improving |
Also I added the following items to query 18 in the issue description: |
For the queries it seems also possible (but tricky) if the cardinality is high enough (i.e. copying into aggregation columns doesn't reduce memory usage very much), to first execute the aggregation, keep original data in |
Possible improvement for date_part (required to upstream chrono crate) apache/arrow-rs#6746 |
I think we can optimize the plan to improve q35. I found clickhouse has following optimization:
I am trying it in #13617 |
Amazing! q35 get 1.35x faster in my local, when add this optimize rule!
But it still confuse me, why duckdb run q35 so fast. As I know, duckdb will not optimize the plan like clickhouse. |
It seems like a hack on specific query 😆 but still great 👍🏻 ! |
Similar optimization proposals were found in ClickHouse: |
|
Is your feature request related to a problem or challenge?
While looking at the results of the most recent clickbench run
43.0.0
#13099Here is the ClickBench page (link)
I see there are a few queries where DataFusion is significantly slower
The queries are:
Q18:
datafusion/benchmarks/queries/clickbench/queries.sql
Line 19 in 73507c3
GroupColumn
#13275Q35:
datafusion/benchmarks/queries/clickbench/queries.sql
Line 36 in 73507c3
Describe the solution you'd like
I would like the queries to go faster
Describe alternatives you've considered
Both queries look like
In other words they are "top 10 count" style queries
By default, DataFusion will compute the counts for all groups, and then pick only the top 10.
I suspect there is some fancier way to do this, perhaps by finding the top 10 values of count when emitting from the group operator or something. It would be interesting to see if we can see what other engines like DuckDB do with this query
Additional context
No response
The text was updated successfully, but these errors were encountered: