-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update benchmarks on clickbench #5276
Comments
I think this is something we should do as soon as possible. We should not needlessly create a false impression of "slow" in the eyes of the community at large. Why try to correct it later when we can avoid it? Our benchmark was not run on a comparable machine, and it is a very old version. I suggest we run it on We (the Synnada team) is happy to help with investigations and take on improvement tasks. |
I agree this would be great to do. @waitingkuo and @andygrove have worked on this in the past and perhaps they have thoughts. |
I wasn't really too involved with this. I have been focussing more on TPC-H and TPC-DS, and I also see DuckDB performing about 5x faster than DataFusion. I have not spent time determining why this is the case yet. The latest results are here. This is yet another half-finished project of mine, so I apologize for the lack of configuration info. I am mostly running with default configs. https://sqlbenchmarks.io/sqlbench-h/results/env/workstation/sf10/single_node/ |
the codes to reproduce the benchmark are here https://github.com/ClickHouse/ClickBench/tree/main/datafusion The current result listed in the website was benchmarked by datafusion v11, which was release around 6 months ago. to imporve:
this is the original sql queries SELECT URLHash, EventDate, COUNT(*) AS PageViews FROM ... https://github.com/ClickHouse/ClickBench/blob/main/duckdb-parquet/queries.sql#L41 this is the modified version for datafusion SELECT "URLHash", "EventDate"::INT::DATE, COUNT(*) AS PageViews FROM https://github.com/ClickHouse/ClickBench/blob/main/datafusion/queries.sql#L41 I did modify some quries so that it works in datafusion. To fix this, we need to verify whether the new datafusion work or not. If so we could update the quries. If not, we could fire the issue to improve I'll do the first option soon (update to v18), it should be a quick improvement. |
@waitingkuo, if you create the issues that identify what is required to run the queries properly, we can help with addressing them. |
@ozankabak thank you, i'll submit another ticket to do so. I just submit the pr to update it to v18 here's the summary: 1 of the query get 20 times boost except from these 3 outliers, we have 20% improvement in average the histogram chart here's the result query by query Query 0: 0.9776802049030369
Query 1: 1.707491082045184
Query 2: 1.2640586797066016
Query 3: 0.9203807875378623
Query 4: 0.9241755388210855
Query 5: 0.94172723106135
Query 6: 1.649497487437186
Query 7: 1.8705738705738708
Query 8: 0.9532640949554897
Query 9: 0.09891081294396213
Query 10: 0.9790727043117446
Query 11: 1.0668953687821612
Query 12: 1.0256627922162727
Query 13: 0.9520421488719772
Query 14: 1.0535746389404925
Query 15: 1.1839474435875463
Query 16: 1.2163045644807655
Query 17: 1.0196377825847225
Query 18: 1.2491737143968116
Query 19: 0.9741868869385649
Query 20: 1.2916241203256522
Query 21: 0.9167798624671026
Query 22: 0.9743192470077857
Query 23: 1.0809550243337944
Query 24: 1.0715070085911764
Query 25: 1.868698910081744
Query 26: 1.8450380677343134
Query 27: 0.8860306599462013
Query 28: 18.859884645982497
Query 29: 2.9667709147771695
Query 30: 1.0043050430504306
Query 31: 0.9922544851934578
Query 32: doesn't work
Query 33: 1.2395144767868875
Query 34: 1.2316816529558063
Query 35: 1.1153739994479712
Query 36: 0.9655502392344499
Query 37: 0.9329896907216495
Query 38: 0.979089790897909
Query 39: 1.0419161676646707
Query 40: 0.9942028985507246
Query 41: 1.028301886792453
Query 42: 1.039426523297491 outlier queries: Query 9: 0.09891081294396213 -> 10 times worse SELECT "RegionID", SUM("AdvEngineID"), COUNT(*) AS c, AVG("ResolutionWidth"), COUNT(DISTINCT "UserID") FROM hits GROUP BY "RegionID" ORDER BY c DESC LIMIT 10; i have no clue for now, perhaps due to Query 28: 18.859884645982497 -> almost 20 times better SELECT REGEXP_REPLACE("Referer", '^https?://(?:www\.)?([^/]+)/.*$', '\1') AS k, AVG(length("Referer")) AS l, COUNT(*) AS c, MIN("Referer") FROM hits WHERE "Referer" <> '' GROUP BY k HAVING COUNT(*) > 100000 ORDER BY l DESC LIMIT 25; i think this is benifit from the improvement of Query 32: query doesn't work SELECT "WatchID", "ClientIP", COUNT(*) AS c, SUM("IsRefresh"), AVG("ResolutionWidth") FROM hits GROUP BY "WatchID", "ClientIP" ORDER BY c DESC LIMIT 10; when i run it, it's killed (same as the previous version) willy@willybench:~/ClickBench/datafusion$ datafusion-cli -f create.sql q33.sql
DataFusion CLI v18.0.0
0 rows in set. Query took 0.039 seconds.
Killed |
With results from @andygrove being consistent with clickbench (~5x difference compared to duckdb), I think it make more sense to focus more on profiling to narrow the gap, rather than re running the benchmarks. Also when running the benchmarks, make sure you enable important config options mentioned by @korowa: #5142 (comment) |
@waitingkuo I can try to experiment if this is due to distinct. |
@waitingkuo, I have a few quick questions: (1) Which machine did your new results run on, is it the same machine? (2) Which features are missing to run the queries as is? |
@ozankabak i used azure's Standard_F16s_v2 vm to do so which has 16 vcpu and 32GB memory i didnt looks into the issue that the process killed yet. my quick guess is that it runs out of memory |
@comphead thank you and then # Download benchmark target data
wget --continue https://datasets.clickhouse.com/hits_compatible/hits.parquet
# launch datafusion-cli
datafusion-cli
DataFusion CLI v18.0.0
> CREATE EXTERNAL TABLE hits STORED AS PARQUET LOCATION 'hits.parquet';
> SELECT "RegionID", SUM("AdvEngineID"), COUNT(*) AS c, AVG("ResolutionWidth"), COUNT(DISTINCT "UserID") FROM hits GROUP BY "RegionID" ORDER BY c DESC LIMIT 10; to run the full benchmark, you could do
|
AFAIK one big difference regarding performance is that most other engines in the clickbench suite first load the data into memory or some other internal format, while datafusion scans directly from the parquet files. Extending the benchmark runner to support both use cases and report separate results would be very useful. The tcph benchmark runner for example has such a feature |
@jhorstmann thank you for pointing out this some system like duckdb or clickbench does provide two kinds of setting. e.g. the original comparison @dudzicp posted is to compare duckdb(parquet) with datafusion. I think it's fair enough. Perhaps we could rename datafusion to datafusion(parquet) so that other people could easily understanding what they're comparing. (btw, we can judge it by loading time and datasize as well) |
I think this would be very helpful. Yes I think we should be only comparing datafusion performance to parquet files |
@comphead thank you -- it would be super helpful if you could open a ticket with your analysis (if you haven't already) I think doing something like posting the EXPLAIN plan and the output of |
@alamb I'm running this on my machine now, not sure if its DISTINCT, once I have more details I'll file a ticket. But anyway |
I'll file a ticket if that possible to squeeze something from current distinct computation
|
Makes sense Also in general this query looks like it is a test of how fast grouping can be done (in addition to the fact that our DISTINCT aggregator is pretty slow at the moment) There are a bunch of ideas floating around about how to make Grouping faster -- for example #4973 and some details on #1570 I think someone just has to commit the time to hammering it out |
One thing I noticed is that duckdb loads partitioned parquet files: https://github.com/ClickHouse/ClickBench/blob/main/duckdb-parquet/benchmark.sh as opposed to datafusion: https://github.com/ClickHouse/ClickBench/blob/main/datafusion/benchmark.sh. I guess this can impact parallelism? |
Not sure about DuckDB -- it seems to be able to parallelise reads of single file (at least latest version), but it definitely caused lack of parallelism for Datafusion |
The benchmark though should add parallelism by repartitioning after scanning - so I would expect aggregates to be executed in parallel (even though the scan won't be). Distinct count might be mostly slow as the current implementation is very allocation-heavy. An additional suggestion to better performance on the benchmark is to add a target-cpu configuration which might yield a ~10% improvement - currently it compiles with the default target. |
@korowa s added the feature to parallelize reads of a single parquet file in DataFusion. #5295 proposes enabling this by default
👍 |
BTW When I ran
Locally using |
I have profiled the slowest query locally:
@Dandandan you are correct - we are repartitioning after scanning which is consistent with all cpus working at 100% just after the query was submitted. One thing I dont understand - shouldnt the repartitioning be based on grouping key? Also here is the output of
from duckdb. I guess @alamb is correct and grouping/aggregation code needs improvements. |
@dudzicp In certain cases it might be beneficial to do hash repartition before the (partial) |
The last suggestion might be better for running in a non-distributed execution, the current rule is better optimized for distributed execution (as shuffling data might be expensive), so if we get some improvements there, we'll need to make that configurable. |
This is one of the last areas I think there are factors of 2 or more performance to be extracted given the current architecture 👍 |
I've re-run the benchmark on a c6a.4xlarge instance similar to the other stateless engines in the benchmark. Submitted it here -> ClickHouse/ClickBench#81 . For nearly all requests, it performs significantly better, but I wonder if that's due in large part to the |
It would not at all surprise me Thank you @kmitchener |
That's right, duckdb has very impressive grouping/aggregation improvements. Details in blog. ClickBench is almost for |
I recently look at Datafusion vs DuckDB for our use cases and one of our evaluation queries is a variant of ClickBench q23.
it aligns with "scan is not in parallel yet"? More detail here: #5404 |
Filed #5547 to track improving COUNT DISTINCT query performance |
I just run a benchmark on Datafusion v22 (use default setting): ClickHouse/ClickBench#97. I am thinking to rename the result from |
🥂 cheers that #5325 fix the regression on
I also create an issue for the non-working q32: #5969
Summary of queries that is 2x slower than DuckDB
|
yes, it is release mode. For detail, can refer to 'benchmark.sh' in https://github.com/ClickHouse/ClickBench/pull/97/files. btw, q32 not working is a known problem. I didn't find related github issue, so create #5969 to track it |
I expect DataFusion 28 to perform much better (and q32 now works) due to #6904 |
I filed #7108 to track updating the clickbench benchmarks again, let's continue the conversation there. Closing this ticket |
In the clickbench Datafusion is currently ~5x slower than DuckDB on comparable machine. I think it would make sense to update the benchmark results with the latest improvements so that datafusion will be a viable/reasonable option for new projects
The text was updated successfully, but these errors were encountered: