[EPIC] Improve shuffle performance #1123

andygrove · 2024-11-26T18:59:36Z

This epic is for improving shuffle / ScanExec performance.

Issues

Context

I have been comparing Comet and Ballista performance for TPC-H q3. Both execute similar native plans. I am using the comet-parquet-exec branch which uses DataFusion's ParquetExec.

Ballista is approximately 3x faster than Comet. Given that they are executing similar DataFusion native plans, I would expect performance to be similar.

The main difference between Comet and Ballista is that Comet transfers batches between JVM and native code during shuffle operations.

Most of the native execution time in Comet is spent in ScanExec which is reading Arrow batches from the JVM using Arrow FFI. This time was not included in our metrics prior to #1128 and #1111.

The text was updated successfully, but these errors were encountered:

viirya · 2024-11-27T01:24:04Z

Most of the native execution time in Comet is spent in ScanExec which is reading Arrow batches from the JVM using Arrow FFI.

Do you mean when ScanExec is used as pseudo scan node to read shuffled data into next native execution?

If so, it makes some sense because Comet shuffle reader is still JVM based. We should make it native eventually to boost shuffle reading performance. It is on the early roadmap when we started on Comet shuffle. Although it is not super urgent and high priority at that moment. But now I think it is the time we can begin to work on this.

Opened: #1125

andygrove · 2024-11-27T16:05:04Z

Do you mean when ScanExec is used as pseudo scan node to read shuffled data into next native execution?

Yes, exactly.

andygrove · 2024-12-01T23:23:07Z

If so, it makes some sense because Comet shuffle reader is still JVM based.

This is also an issue for shuffle writes. The child node of ShuffleWriterExec is always a ScanNode reading the output from a native plan.

We pay the FFI cost twice - once to import from native plan to JVM, then again to export to the shuffle write native plan. We have the cost of schema serde in both directions. Perhaps there is a way to shortcut this and avoid a full serde because we do not need to read the batch in the JVM in this case, just pass it from one native plan to another.

andygrove · 2024-12-02T14:50:05Z

I created a Google document for collaborating on ideas around improving shuffle performance:

https://docs.google.com/document/d/1rx1ue7UZ4ljzic9Rc2kT-v35bfLB7Rhhe5FW1d0Sw4I/edit?usp=sharing

andygrove · 2024-12-02T18:03:52Z

Update: The ScanExec time is from calling CometBatchIterator.next() so includes the time for it to fetch input batches as well as the FFI cost. It looks like the cost of fetching the batches is much more than the FFI cost. Perhaps this is really measuring the execution time of the input query, so could be misleading

andygrove · 2024-12-03T15:48:30Z

Here is an updated diagram showing that most of the native time is spent waiting for batches from the JVM and that the FFI overhead it not an issue.

andygrove · 2024-12-03T16:08:47Z

Breakdown of ScanExec by source:

viirya · 2024-12-03T17:57:48Z

Here is an updated diagram showing that most of the native time is spent waiting for batches from the JVM and that the FFI overhead it not an issue.

It makes more sense. I'm used to doubt that FFI overhead could be significant on performance number. It is designed to be lightweight to pass Arrow vectors around processes.

andygrove · 2024-12-11T15:04:48Z

I ran some benchmarks this morning and found that with their respective shuffle managers disabled, Comet and Gluten(+Velox) have pretty similar performance (Gluten was 15% faster) but with shuffle managers enabled, Gluten was 176% faster. I am going to try and learn more about how Gluten+Velox implements shuffle as a background task. I think this validates that DataFusion and Velox likely have similar performance.

viirya · 2024-12-11T16:06:48Z

It is interesting. For background task, do you see Gluten+Velox's shuffle is running as that (async?) and different to Spark/Comet?

andygrove · 2024-12-16T22:37:15Z

Here is a comparison of shuffle write metrics between Gluten and Comet. One thing I noticed is that Gluten is writing twice the amount of data when compared to Comet, so I wonder if there is a difference in compression or encoding that accounts for some of the time difference. I will keep investigating.

Dandandan · 2024-12-17T19:26:47Z

One thing to try might be moving from zstd to lz4. The default zstd level (3 I believe) is quite slow for compression. Zstd can be tuned to be comparable in speed (fast mode), but then compression ratio will be lower as well.

viirya · 2024-12-17T19:31:05Z

Spark by default uses lz4. shuffle codec is not configurable yet in Comet and it uses zstd. @andygrove Do you use same codec in the comparison?

andygrove · 2024-12-17T20:40:35Z

Thanks @Dandandan. I just discovered this morning that Gluten is using lz4. Comet is using zstd. I plan on trying lz4 in Comet as the next step.

andygrove · 2024-12-17T20:41:15Z

Also here are updated metrics that now has the correct shuffle write time and also has the encoding time.

andygrove added enhancement New feature or request performance labels Nov 26, 2024

andygrove closed this as completed Nov 26, 2024

andygrove changed the title ~~Improve shuffle performance~~ [EPIC] Improve shuffle performance Nov 26, 2024

andygrove reopened this Nov 26, 2024

This was referenced Nov 26, 2024

[EPIC] Improve performance of TPC-H queries #391

Open

[EPIC] Improve performance of TPC-DS queries #858

Open

andygrove mentioned this issue Nov 27, 2024

with datafusion comet，no performance improvement. #1084

Open

andygrove mentioned this issue Dec 1, 2024

fix: Avoid to call import and export Arrow array for native execution #1055

Closed

andygrove added this to the 0.5.0 milestone Dec 2, 2024

andygrove self-assigned this Dec 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[EPIC] Improve shuffle performance #1123

[EPIC] Improve shuffle performance #1123

andygrove commented Nov 26, 2024 •

edited

Loading

viirya commented Nov 27, 2024 •

edited

Loading

andygrove commented Nov 27, 2024

andygrove commented Dec 1, 2024 •

edited

Loading

andygrove commented Dec 2, 2024

andygrove commented Dec 2, 2024

andygrove commented Dec 3, 2024

andygrove commented Dec 3, 2024

viirya commented Dec 3, 2024

andygrove commented Dec 11, 2024

viirya commented Dec 11, 2024

andygrove commented Dec 16, 2024

Dandandan commented Dec 17, 2024

viirya commented Dec 17, 2024

andygrove commented Dec 17, 2024

andygrove commented Dec 17, 2024

[EPIC] Improve shuffle performance #1123

[EPIC] Improve shuffle performance #1123

Comments

andygrove commented Nov 26, 2024 • edited Loading

Issues

Context

viirya commented Nov 27, 2024 • edited Loading

andygrove commented Nov 27, 2024

andygrove commented Dec 1, 2024 • edited Loading

andygrove commented Dec 2, 2024

andygrove commented Dec 2, 2024

andygrove commented Dec 3, 2024

andygrove commented Dec 3, 2024

viirya commented Dec 3, 2024

andygrove commented Dec 11, 2024

viirya commented Dec 11, 2024

andygrove commented Dec 16, 2024

Dandandan commented Dec 17, 2024

viirya commented Dec 17, 2024

andygrove commented Dec 17, 2024

andygrove commented Dec 17, 2024

andygrove commented Nov 26, 2024 •

edited

Loading

viirya commented Nov 27, 2024 •

edited

Loading

andygrove commented Dec 1, 2024 •

edited

Loading