-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve DataFusion scalability as more cores are added #5999
Comments
Possibly related: #5995 |
Suggestion I posted in slack: I wonder whether there might be some regressions here wrt scalability with the nr of cores - #4219 comes to mind which makes building the hashmap for the build-side done in a single thread for smaller tables |
For tpch-h q1 (SF=1) the metrics are as follows for the complete plan with metrics is as follows:
Relevant metrics:
|
I ran the benchmarks with a simpler version of q1 with just the table scan and filter: select
*
from
lineitem
where
l_shipdate <= date '1998-12-01' - interval '113 days'; Here are the results:
|
Here are the results for just the table scan: select
*
from
lineitem;
This screenshot shows CPU usage with the DataFusion run on the left and the DuckDB run on the right, both using 32 cores. Here is the directory listing of my lineitem directory:
|
@Dandandan I added more benchmark results. The issues appear to be related to the |
I could also replicate the issue of non-perfect scaling with loading the tables in memory. One thing I noticed is that the current round-robin
It has a bias for the starting partitions as the outputs always go first to those channels. |
Also, I wonder if |
Definitely possible, even if it's only due to changes it triggers in tokio's scheduler. On the other hand, tokio's scheduler isn't really optimized for the DataFusion use case in the first place (as @tustvold probably pointed out before). |
There is a small application written in Rust called odbc2parquet that reads data from databases and exports parquet files. |
Here are latest results with DataFusion Python 23, which fixes a bug around how the Tokio runtime was being managed.
|
Interestingly, the speedup seems bigger for fewer cores. There have been quite a few improvement for Q1 between 21 and 23 for aggregates and casts, that might explain most of the difference. |
I believe that #6937 (basically that the fact we are doing much more work as the cores go up (as each group gets processed in each core) contributes to this problem |
I produced the chart using scripts at https://github.com/sql-benchmarks/sqlbench-runners/blob/main/scripts |
The regression between 32 and 33 seems to be caused by my "improvement" to filter statistics, although we saw that others saw big improvements with 33 (such as https://x.com/mim_djo/status/1725510410567307702?s=46&t=8JleG5FR5SAVU1vl35CPLg) so maybe this is environment specific. I will continue to debug this in the next few days.
|
I wonder if this is just bad configuration? |
@Dandandan was correct. It was the configs that were the issue. Here is a run of just versions 32 and 33 using default configs. |
Is your feature request related to a problem or challenge?
I ran some benchmarks in constrained Docker containers and found that DataFusion is pretty close to DuckDB speed when running on a single core but does not scale as well as DuckDB when more cores are added.
At 16 cores, DuckDB was ~2.9x faster than DataFusion for this particular test.
Describe the solution you'd like
I would like DataFusion to scale as well as DuckDB as more cores are added.
Describe alternatives you've considered
No response
Additional context
Instructions for creating the Docker images can be found at https://github.com/sql-benchmarks/sqlbench-runners
The text was updated successfully, but these errors were encountered: