perf: Combine small chunks in sinks for streaming pipelines #14346

itamarst · 2024-02-07T14:40:21Z

Small chunks add significant overhead. This PR merges them in the sink. If there are lots of small chunks, this should make things faster. If there are lots of large chunks, it adds a little overhead but it's fixed and small per chunk, so that's fine. There may however be a slow down in some midway edge cases where the chunk merging involves copying extra data.

This doesn't fix the fact that small chunks are being generated, though... I will file a separate PR for the specific edge case in the reproducer script, and file an issue for another likely case if I can reproduce it.

Benchmark

Before:

$ python ../11699.py 
SINK 2.8767316341400146
COLLECT+WRITE 0.9260973930358887
STREAMING COLLECT+WRITE 4.061856031417847

After:

$ python ../11699.py
SINK 1.064814805984497
COLLECT+WRITE 0.9025135040283203
STREAMING COLLECT+WRITE 1.1788344383239746

Using this script:

from datetime import datetime

import polars as pl

start_datetime = datetime(2020, 1, 1)
end_datetime = datetime(
    2024,
    1,
    1,
)
N_ids = 10
interval = "1m"
df = (
    pl.LazyFrame(
        {
            "time": pl.datetime_range(
                start_datetime, end_datetime, interval=interval, eager=True
            )
        }
    )
    .join(
        pl.LazyFrame(
            {"id": [f"id{i:05d}" for i in range(N_ids)], "value": list(range(N_ids))}
        ),
        how="cross",
    )
    .select("time", "id", "value")
)

# print(df.profile(streaming=True))
from time import time

start = time()
df.sink_parquet("/tmp/out.parquet")
print("SINK", time() - start)

start = time()
df.collect().write_parquet("/tmp/out2.parquet")
print("COLLECT+WRITE", time() - start)

start = time()
df.collect(streaming=True).write_parquet("/tmp/out3.parquet")
print("STREAMING COLLECT+WRITE", time() - start)

itamarst · 2024-02-07T15:25:36Z

Updates:

The failing test is plausibly not this PR's fault?
I tried to see if filtering could result in empirically measurable slowness. My initial attempt failed to show any effects, so I won't be filing a follow-up.

ritchie46

Nice improvement. I have left some comments.

crates/polars-pipe/Cargo.toml

crates/polars-pipe/src/operators/chunks.rs

crates/polars-pipe/src/executors/sinks/ordered.rs

crates/polars-pipe/src/executors/sinks/output/file_sink.rs

itamarst · 2024-02-12T19:10:06Z

@ritchie46 OK hopefully I've finally understood you, sorry it took so long. If I didn't, please feel free to finish this PR.

Final numbers:

SINK 1.0646779537200928
COLLECT+WRITE 0.9343609809875488
STREAMING COLLECT+WRITE 1.083845853805542

I.e. the collect(streaming=True).write_parquet() case is slightly faster due to the as_single_chunk_par() change.

ritchie46 · 2024-02-13T18:49:12Z

crates/polars-pipe/src/executors/sinks/ordered.rs

@@ -57,32 +55,7 @@ impl Sink for OrderedSink {
        self.sort();

        let chunks = std::mem::take(&mut self.chunks);
-        let mut combiner = StreamingVstacker::default();


If we would rechunk, we could simply rechunk here. But I don't want to do that as that should be left to the consumer of the streaming engine.

In that case it should probably be done in write_parquet(), otherwise the collect(streaming=True).write_parquet() case will continue to be slow.

Which would requite the new struct be e.g. moved into the polars-core crate and made public.

Here's the runtime with latest commit on my computer, last case is slow again:

SINK 1.0721302032470703 COLLECT+WRITE 0.9235994815826416 STREAMING COLLECT+WRITE 4.254129648208618

But maybe also write_feather(). or the cloud parquet writer. etc.. (Having it in OrderedSink seemed like a low-cost smoothing of performance bumps, limited to a single place.)

Then that logic should be in write_parquet indeed. That writer should check the chunk sizes.

I will first merge this one, and then we can follow up with the write_parquet optimization.

But maybe also write_feather(). or the cloud parquet writer. etc.. (Having it in OrderedSink seemed like a low-cost smoothing of performance bumps, limited to a single place.)

Yes, but it is more expensive for other operations. Operations themselves should knie their best chunking strategies.

ritchie46 · 2024-02-13T20:02:22Z

Thanks for this @itamarst. I think this can be used by more writers. Are you interested in the follow up PR?

itamarst · 2024-02-13T23:49:30Z

Thanks for all the feedback and help, and yes, I'm happy to do follow-up PR for write_parquet().

pythonspeed added 8 commits February 5, 2024 09:49

perf: Make sure we don't write tiny chunks to a file (pola-rs#11699)

de43480

Refactor to be agnostic to the number of threads.

8e82812

Choose a faster constant.

0f221d2

A better way to deal with interleaved small and large chunks.

64b039c

Sketch of adding buffering to OrderedSink.

21e98e1

A nicer implementation.

92d7d4d

Move to a better location, better docs.

424da2f

Add a test.

27a7060

itamarst requested review from ritchie46, stinodego, orlp and c-peters as code owners February 7, 2024 14:40

github-actions bot added performance Performance issues or improvements python Related to Python Polars rust Related to Rust Polars labels Feb 7, 2024

itamarst mentioned this pull request Feb 7, 2024

perf: Fix cross join batch size when one of the DataFrames is tiny #14347

Merged

Fix formatting.

60929c9

itamarst changed the title ~~perf: Combine small chunks in sinks for streaming pipelines (#11699)~~ perf: Combine small chunks in sinks for streaming pipelines Feb 7, 2024

ritchie46 reviewed Feb 7, 2024

View reviewed changes

pythonspeed added 2 commits February 7, 2024 10:48

Switch away from a const generic.

e571f0b

Restrict vstacking to OrderedSink only.

e62f122

ritchie46 reviewed Feb 7, 2024

View reviewed changes

crates/polars-pipe/src/executors/sinks/ordered.rs Outdated Show resolved Hide resolved

ritchie46 reviewed Feb 7, 2024

View reviewed changes

crates/polars-pipe/src/executors/sinks/output/file_sink.rs Show resolved Hide resolved

pythonspeed added 2 commits February 12, 2024 12:48

Drop proptest.

6030433

Making chunks contiguous is now the caller's responsibility.

463568f

itamarst requested a review from ritchie46 February 12, 2024 19:08

only in file-writer

db50a74

ritchie46 reviewed Feb 13, 2024

View reviewed changes

feature gate

126ccc1

ritchie46 merged commit 921ddea into pola-rs:main Feb 13, 2024
17 checks passed

itamarst deleted the 11699-combine-small-chunks-when-writing branch February 13, 2024 23:49

itamarst mentioned this pull request Feb 14, 2024

collect(streaming=True).write_parquet() is slow when there are many small chunks #14484

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: Combine small chunks in sinks for streaming pipelines #14346

perf: Combine small chunks in sinks for streaming pipelines #14346

itamarst commented Feb 7, 2024

itamarst commented Feb 7, 2024

ritchie46 left a comment

itamarst commented Feb 12, 2024

ritchie46 Feb 13, 2024

itamarst Feb 13, 2024

itamarst Feb 13, 2024

itamarst Feb 13, 2024

ritchie46 Feb 13, 2024

ritchie46 Feb 13, 2024

ritchie46 commented Feb 13, 2024

itamarst commented Feb 13, 2024

perf: Combine small chunks in sinks for streaming pipelines #14346

perf: Combine small chunks in sinks for streaming pipelines #14346

Conversation

itamarst commented Feb 7, 2024

Benchmark

itamarst commented Feb 7, 2024

ritchie46 left a comment

Choose a reason for hiding this comment

itamarst commented Feb 12, 2024

ritchie46 Feb 13, 2024

Choose a reason for hiding this comment

itamarst Feb 13, 2024

Choose a reason for hiding this comment

itamarst Feb 13, 2024

Choose a reason for hiding this comment

itamarst Feb 13, 2024

Choose a reason for hiding this comment

ritchie46 Feb 13, 2024

Choose a reason for hiding this comment

ritchie46 Feb 13, 2024

Choose a reason for hiding this comment

ritchie46 commented Feb 13, 2024

itamarst commented Feb 13, 2024