perf: Improve cloud scan performance #19728

nameexhaustion · 2024-11-11T09:15:31Z

We were previously slow either due to making a single very large request, or making thousands of tiny requests. This PR splits/combines the range requests to make them evenly distributed with a reasonable chunk size.

Benchmarks - the source file is 12,000 columns x 24,000 rows on S3 (see linked issue for generator). Tested on EC2.

query	time (1.12.0, seconds)	time (this PR)	speedup
pl.read_parquet(path)	29.020907	2.771221	10.5x
pl.scan_parquet(path).select(cs.by_index(range(6_000))).collect()	14.717146	1.75916	8.4x
pl.scan_parquet(path).select(cs.by_index(range(0, 12_000, 5))).collect()	28.716227	2.414139	11.9x
pl.scan_parquet(path).select(cs.by_index(range(0, 12_000, 3))).collect()	28.815121	2.515088	11.5x
pl.scan_parquet(path).collect(new_streaming=True)	28.939848	3.001616	9.6x

nameexhaustion · 2024-11-11T09:18:26Z

crates/polars-io/src/cloud/polars_object_store.rs

+
+        // Dropping is delayed for tokio async files so we need to explicitly
+        // flush here (https://github.com/tokio-rs/tokio/issues/2307#issuecomment-596336451).
+        file.sync_all().await.map_err(PolarsError::from)?;


drive-by - moved here from the callsite at file_cache

crates/polars-io/src/parquet/read/async_impl.rs

nameexhaustion · 2024-11-11T09:24:29Z

crates/polars-stream/src/nodes/parquet_source/row_group_data_fetch.rs

-                        // We have a dedicated code-path for a full projection that performs a
-                        // single range request for the entire row group. During testing this
-                        // provided much higher throughput from cloud than making multiple range
-                        // request with `get_ranges()`.


We had this previously, which was fast if the file contained many nicely sized row groups, but not for the 12,000 column single row group file.

We now just use get_ranges(), which handles the download optimization for us

ritchie46 · 2024-11-11T09:32:20Z

Great speedups! 🙌

crates/polars-io/src/cloud/polars_object_store.rs

nameexhaustion added 2 commits November 11, 2024 19:36

c

8099aef

c

29eed43

github-actions bot added performance Performance issues or improvements python Related to Python Polars rust Related to Rust Polars labels Nov 11, 2024

nameexhaustion commented Nov 11, 2024

View reviewed changes

crates/polars-io/src/parquet/read/async_impl.rs Show resolved Hide resolved

nameexhaustion commented Nov 11, 2024

View reviewed changes

c

8a6da81

nameexhaustion marked this pull request as ready for review November 11, 2024 09:35

nameexhaustion requested review from ritchie46, orlp and c-peters as code owners November 11, 2024 09:35

nameexhaustion commented Nov 11, 2024

View reviewed changes

crates/polars-io/src/cloud/polars_object_store.rs Show resolved Hide resolved

ritchie46 merged commit c4f0cc2 into pola-rs:main Nov 11, 2024
21 of 22 checks passed

nameexhaustion mentioned this pull request Nov 12, 2024

perf: Adjust coalesce for [<tiny range>, <massive range>] #19730

Merged

nameexhaustion deleted the get-ranges-split branch November 18, 2024 08:21

c-peters added the accepted Ready for implementation label Nov 18, 2024

c-peters assigned nameexhaustion Nov 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: Improve cloud scan performance #19728

perf: Improve cloud scan performance #19728

nameexhaustion commented Nov 11, 2024 •

edited

Loading

nameexhaustion Nov 11, 2024

nameexhaustion Nov 11, 2024

ritchie46 commented Nov 11, 2024

perf: Improve cloud scan performance #19728

perf: Improve cloud scan performance #19728

Conversation

nameexhaustion commented Nov 11, 2024 • edited Loading

nameexhaustion Nov 11, 2024

Choose a reason for hiding this comment

nameexhaustion Nov 11, 2024

Choose a reason for hiding this comment

ritchie46 commented Nov 11, 2024

nameexhaustion commented Nov 11, 2024 •

edited

Loading