[DISCUSSION] Make DataFusion the fastest engine for querying parquet data in ClickBench #12821

alamb · 2024-10-08T20:48:18Z

Is your feature request related to a problem or challenge?

I am mostly writing this up to record what I think is an ongoing work with @jayzhan211 @Rachelint @korowa and myself

TLDR, we are working on (and getting pretty close) to having DataFusion be the fastest single node engine for querying parquet files in ClickBench

Background:

https://benchmark.clickhouse.com/ shows the results of ClickBench

ClickBench the benchmark and is described here https://github.com/ClickHouse/ClickBench. I am not personally interested in proprietary file formats that require special loading

Here is the current leaderboard for partitioned parquet reflecting DataFusion 40.0.0:

Describe the solution you'd like

I would like DataFusion to be the fastest

Describe alternatives you've considered

No response

Additional context

This is also inspired by @ozankabak 's call to action on #11442

The scripts to run with datafusion are here: https://github.com/ClickHouse/ClickBench/tree/main/datafusion

Last update is here: ClickHouse/ClickBench#210

alamb · 2024-10-08T20:53:09Z

Changes I think will make these queries significantly faster:

Skipping partial aggregation when it is not helping for high cardinality aggregates #11627 - @korowa
Avoid RowConverter for multi column grouping (10% faster clickbench queries) #12269 - @jayzhan211 @Rachelint
Enable reading StringView by default from Parquet (schema_force_string_view) by default #11682 - team effort (we are close)
@Rachelint also has another potential ~10% faster with Sketch for aggregation intermediate results blocked management #11943
Enable parquet filter pushdown by default #3463

These optimizations are general purpose, not specific to Clickhouse I don't think

jayzhan211 · 2024-10-09T07:19:21Z

Reuse hash for repartition #12526 and avoid copy in coalesce #7957 could probably also provide some improvement

Dandandan · 2024-10-09T07:26:10Z

Nice!

I think one bigger future interesting direction would be further vectorization of core hash aggregate algorithm (i.e. treating matches as candidates and doing e.g. equality checks in a vectorized way to allow for more specialization / more efficient code).

Rachelint · 2024-10-09T09:47:41Z

🤔 As reviewing #12697 , seems we can still continue to improve partial skipping?
Now we can modify threshold to get performance improvement, but it may be a bit tricky?

And I think maybe we can make clearer about when partial can help, and when partial will even get slower?

alamb · 2024-10-09T21:33:01Z

And I think maybe we can make clearer about when partial can help, and when partial will even get slower?

In my mind the challenge with tweaking the "switch to partial mode" threshold setting is that some queries will likely get faster and some will likely get slower. If we can justify changing the default setting to some different constant I think it will be fine. However, if we are going to add more complex logic to decide when to switch modes in my opinion it needs to be significantly better than a static threshold (where significantly means "always better" or close to it)

Rachelint · 2024-10-11T02:59:44Z

And I think maybe we can make clearer about when partial can help, and when partial will even get slower?

In my mind the challenge with tweaking the "switch to partial mode" threshold setting is that some queries will likely get faster and some will likely get slower. If we can justify changing the default setting to some different constant I think it will be fine. However, if we are going to add more complex logic to decide when to switch modes in my opinion it needs to be significantly better than a static threshold (where significantly means "always better" or close to it)

Got it, @jayzhan211 have tried some other values of skip_partial_aggregation_probe_ratio_threshold and skip_partial_aggregation_probe_rows_threshold, some queries seems improve obviously in #12697

And I have some thoughs like removing the is_locked field?

Now, we take skip_partial_aggregation_probe_rows_threshold as a sample to define if we need to skip, when exceed we will not check this again).
But I found some partial operator can get improvement from skipping, but have no chance to switch to due to is_locked.

jayzhan211 · 2024-10-11T03:14:29Z

#12697 (comment) Only Q0 slows down, but given it has nothing to do with grouping, I think we can ignore it.

This number is run on another branch that only change the configuration value, so I think another approach is to remove skip_partial_aggregation_probe_rows_threshold and related logic entirely and set skip_partial_aggregation_probe_ratio_threshold to 0.1.

jayzhan211 · 2024-10-12T11:33:33Z

I think one bigger future interesting direction would be further vectorization of core hash aggregate algorithm

Can we use nightly rust that enable std::simd for vectorization? Although in arrow-rs, the simd code is rewritten with auto-vectorization, but when I check the generated asm, I didn't see vector instruction for all the function (some exists, some doesn't). I think it would be nice to have explicitly simd to ensure the code is always vectorized and not disappear because of the code change or the llvm change.

jonathanc-n · 2024-10-12T17:48:51Z

@jayzhan211 Yeah, this sounds like a good idea. We could start stepping into a direction to make the execution engine as performant as Velox. Especially having arrow be the format should allow us to maximize our use of vectorized execution.
Should I open an issue for this?

alamb · 2024-10-13T12:31:59Z

Can we use nightly rust that enable std::simd for vectorization? Although in arrow-rs, the simd code is rewritten with auto-vectorization, but when I check the generated asm, I didn't see vector instruction for all the function (some exists, some doesn't). I think it would be nice to have explicitly simd to ensure the code is always vectorized and not disappear because of the code change or the llvm change.

I think @tustvold found that using manually written simd kernels is quite hard to get faster than the auto vectorized code (aka using the vector instructions) made by LLVM and also harder to maintain

If possible I would suggest we instead focus on improving the code so that LLVM is better able to auto vectorize code. This is some combination of looking at the resulting assembly code, and then making the inner loops simpler (e.g. via #[inline] and removing bounds checks get_unchecked, special cases for not checking Option, etc)

tustvold · 2024-10-13T12:46:21Z

I found that LLVM is relatively good at vectorizing vertical operations provided:

There are no conditionals within the loop body
You've been careful to avoid inlining too much, as the vectorizer gives up if the code is too complex
You aren't doing bitwise horizontal reductions or masking (although FWIW std::simd struggles with this as well)
You've enabled SIMD instructions in the target ISA

This last point is likely why you aren't seeing anything, the default x86 ISA is over a decade old at this point and doesn't support pretty much any SIMD instructions. See the Performance Tips section at the end of - https://crates.io/crates/arrow

My 2 cents is to get as far as you can without reaching for std::simd, there is a massive maintainance overhead and with care LLVM can produce code that performs better than naively written manual SIMD. We used to have a fair bit of manual SIMD in arrow-rs, and over time we've removed it as the auto-vectorized code was faster.

I'd recommend getting familiar with tools like https://rust.godbolt.org/ (again being sure to set RUSTFLAGS) and only once you've exhausted that avenue think of reaching for SIMD. Generally the hard part is getting the algorithm structured in such a way that it can be vectorised, regardless of what goes and generates those instructions.

alamb · 2024-10-13T13:03:58Z

Thank you @tustvold -- that content is so good I made a PR to propose putting it in the readme of arrow-rs: apache/arrow-rs#6554

alamb · 2024-10-15T10:41:38Z

After a few more PRs for StringView I think we are pretty close: #12092 (comment)

I'll try and run the numbers at some point to compare to duckdb, but DataFusion is certainly quite a bit faster than 40.0.0 now and will be even more so once we complete the StringView work

alamb · 2024-10-30T18:23:14Z

StringView by default is finally merged into DataFusion: #13101

alamb · 2024-11-04T18:56:49Z

@Rachelint has another non trivial group by performance improvement that is very close: #12996

alamb · 2024-11-15T10:44:34Z

Update here: the results from @pmcgleenon are looking really nice: #13099 (comment)

386562203-8029e9c7-e6d3-4e7e-8273-725472aeeeb9

Also, BTW, 43.0.0 doesn't include the work from @Rachelint that will likely improve things a few more percent overall (substantially for some queries):

@Rachelint has another non trivial group by performance improvement that is very close: #12996

ozankabak · 2024-11-15T11:42:33Z

Love this 🚀 🚀 🚀

alamb · 2024-11-16T12:50:35Z

Here is a fun challenge:

[DISCUSSION] Challenge: Make DataFusion the fastest engine in ClickBench with custom file format #13448

alamb · 2024-11-22T15:33:29Z

While there is definitely more we can do to improve performance, for now I am going to claim we are done here.

The blog is live: https://datafusion.apache.org/blog/2024/11/18/datafusion-fastest-single-node-parquet-clickbench

🚀

alamb added the enhancement New feature or request label Oct 8, 2024

alamb changed the title ~~[DISCUSSION] Make DataFusion is the fastest engine for querying parquet data in ClickBench~~ [DISCUSSION] Make DataFusion the fastest engine for querying parquet data in ClickBench Oct 8, 2024

alamb mentioned this issue Oct 9, 2024

Release DataFusion 42.1.0 #12813

Closed

4 tasks

alamb mentioned this issue Oct 13, 2024

Minor: Document SIMD rationale and tips apache/arrow-rs#6554

Merged

alamb mentioned this issue Oct 16, 2024

Oct 16, 2024: This week in DataFusion #12973

Closed

Rachelint mentioned this issue Oct 19, 2024

Support vectorized append and compare for multi group by #12996

Merged

This was referenced Oct 21, 2024

Oct 21, 2024: This week in DataFusion #13035

Closed

Update ClickBench benchmarks with DataFusion 43.0.0 #13099

Closed

alamb mentioned this issue Oct 29, 2024

Oct 28, 2024: This week in DataFusion #13167

Closed

3 tasks

alamb mentioned this issue Nov 5, 2024

Nov 5. 2024: This week in DataFusion #13265

Closed

3 tasks

This was referenced Nov 15, 2024

Blog post: How DataFusion became the fastest engine for querying parquet (according to Clickbench) #13436

Closed

[DISCUSSION] Challenge: Make DataFusion the fastest engine in ClickBench with custom file format #13448

Open

alamb closed this as completed Nov 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DISCUSSION] Make DataFusion the fastest engine for querying parquet data in ClickBench #12821

[DISCUSSION] Make DataFusion the fastest engine for querying parquet data in ClickBench #12821

alamb commented Oct 8, 2024 •

edited

Loading

alamb commented Oct 8, 2024 •

edited

Loading

jayzhan211 commented Oct 9, 2024

Dandandan commented Oct 9, 2024 •

edited

Loading

Rachelint commented Oct 9, 2024 •

edited

Loading

alamb commented Oct 9, 2024

Rachelint commented Oct 11, 2024 •

edited

Loading

jayzhan211 commented Oct 11, 2024 •

edited

Loading

jayzhan211 commented Oct 12, 2024

jonathanc-n commented Oct 12, 2024

alamb commented Oct 13, 2024 •

edited

Loading

tustvold commented Oct 13, 2024

alamb commented Oct 13, 2024

alamb commented Oct 15, 2024 •

edited

Loading

alamb commented Oct 30, 2024

alamb commented Nov 4, 2024

alamb commented Nov 15, 2024

ozankabak commented Nov 15, 2024

alamb commented Nov 16, 2024

alamb commented Nov 22, 2024

[DISCUSSION] Make DataFusion the fastest engine for querying parquet data in ClickBench #12821

[DISCUSSION] Make DataFusion the fastest engine for querying parquet data in ClickBench #12821

Comments

alamb commented Oct 8, 2024 • edited Loading

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

alamb commented Oct 8, 2024 • edited Loading

jayzhan211 commented Oct 9, 2024

Dandandan commented Oct 9, 2024 • edited Loading

Rachelint commented Oct 9, 2024 • edited Loading

alamb commented Oct 9, 2024

Rachelint commented Oct 11, 2024 • edited Loading

jayzhan211 commented Oct 11, 2024 • edited Loading

jayzhan211 commented Oct 12, 2024

jonathanc-n commented Oct 12, 2024

alamb commented Oct 13, 2024 • edited Loading

tustvold commented Oct 13, 2024

alamb commented Oct 13, 2024

alamb commented Oct 15, 2024 • edited Loading

alamb commented Oct 30, 2024

alamb commented Nov 4, 2024

alamb commented Nov 15, 2024

ozankabak commented Nov 15, 2024

alamb commented Nov 16, 2024

alamb commented Nov 22, 2024

alamb commented Oct 8, 2024 •

edited

Loading

alamb commented Oct 8, 2024 •

edited

Loading

Dandandan commented Oct 9, 2024 •

edited

Loading

Rachelint commented Oct 9, 2024 •

edited

Loading

Rachelint commented Oct 11, 2024 •

edited

Loading

jayzhan211 commented Oct 11, 2024 •

edited

Loading

alamb commented Oct 13, 2024 •

edited

Loading

alamb commented Oct 15, 2024 •

edited

Loading