feat(parquet): Target parquet writes by size bytes instead of rows #3457

colin-ho · 2024-12-01T00:45:07Z

Addresses #3443

The write operator in swordfish unnecessarily buffers data into morsel_size rows, when the subsequent writers (particularly the parquet writer when writing row groups) does its own buffering.
The estimation of parquet row group / file sizes can be done better by looking at the size_bytes of each micropartition, instead of trying to estimate the row size based on the schema. This is a problem because we could unintentionally write very very large row groups, and consume a lot of memory to do this.
The inflation factors can be updated adaptively, based on written row groups / rows.

This is a before and after of the tensor[uint8] example in the tagged issue with the fixes implemented. (Running swordfish on a 128cpu machine)

codspeed-hq · 2024-12-01T00:53:27Z

CodSpeed Performance Report

Merging #3457 will not alter performance

_{Comparing colin/swordfish-parquet-size-based-writes (a89724a) with main (8009155)}

Summary

✅ 17 untouched benchmarks

codecov · 2024-12-01T01:40:08Z

Codecov Report

Attention: Patch coverage is 89.21833% with 40 lines in your changes missing coverage. Please review.

Project coverage is 77.60%. Comparing base (6d30e30) to head (a89724a).
Report is 17 commits behind head on main.

Files with missing lines	Patch %	Lines
daft/io/writer.py	0.00%	26 Missing ⚠️
src/daft-writers/src/partition.rs	50.00%	6 Missing ⚠️
src/daft-writers/src/file.rs	96.38%	3 Missing ⚠️
src/daft-writers/src/lance.rs	70.00%	3 Missing ⚠️
src/daft-writers/src/batch.rs	99.25%	1 Missing ⚠️
src/daft-writers/src/lib.rs	98.64%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3457      +/-   ##
==========================================
+ Coverage   77.00%   77.60%   +0.59%     
==========================================
  Files         696      706      +10     
  Lines       86039    86225     +186     
==========================================
+ Hits        66256    66916     +660     
+ Misses      19783    19309     -474

Files with missing lines	Coverage Δ
src/daft-local-execution/src/pipeline.rs	`93.76% <100.00%> (-0.70%)`	⬇️
src/daft-local-execution/src/sinks/write.rs	`100.00% <100.00%> (ø)`
src/daft-writers/src/pyarrow.rs	`99.41% <100.00%> (+0.03%)`	⬆️
src/daft-writers/src/test.rs	`96.36% <100.00%> (+0.44%)`	⬆️
src/daft-writers/src/batch.rs	`99.10% <99.25%> (-0.11%)`	⬇️
src/daft-writers/src/lib.rs	`98.33% <98.64%> (+2.25%)`	⬆️
src/daft-writers/src/file.rs	`97.80% <96.38%> (-1.46%)`	⬇️
src/daft-writers/src/lance.rs	`95.16% <70.00%> (-4.84%)`	⬇️
src/daft-writers/src/partition.rs	`86.79% <50.00%> (-5.13%)`	⬇️
daft/io/writer.py	`0.00% <0.00%> (ø)`

... and 99 files with indirect coverage changes

samster25 · 2024-12-02T20:30:32Z

src/daft-writers/src/batch.rs

+            .expect("Micropartitions in target batch writer must be loaded");
+
+        if let Some(leftovers) = self.leftovers.take() {
+            input = MicroPartition::concat([leftovers, input])?.into();


if we had a bunch of small morsels, we can potentially perform this concat every iteration. Ideally we can just keep a buffer and keep track of the count number of bytes / rows that we have

Makes sense, implemented a simple buffer for this.

samster25 · 2024-12-02T20:45:26Z

src/daft-writers/src/lib.rs

-                cfg.parquet_target_row_group_size as f64,
-                cfg.parquet_inflation_factor,
-            );
+            let (target_in_memory_file_size, target_in_memory_row_group_size) =


discussed offline but we can expose a tell method on the writer trait and update these values as we write.

Nice suggestion, implemented. Works for s3 as well

colin-ho · 2024-12-03T04:55:38Z

Results of doing adaptive inflation factor calculations when writing a scale factor 1 lineitem to s3.

Native runner: 2 files. File 1 (371 mb). 1 rg with 79 mb, and 4 row groups of 241mb. File 2: (320 mb). 1rg with 122mb, and 2 row groups of 242mb.

PyRunner: 8 files (due to scan task splitting), each with 2 rg of 90mb each, except the last one which has 1 rg with 12mb.
When doing into_partitions(1) and then write, the result is 2 files with four 150mb row groups and one 55mb row group each.

src/daft-writers/src/batch.rs

samster25 · 2024-12-03T22:58:32Z

src/daft-writers/src/batch.rs

+                }
+            }
+            // Else, we need to split the table
+            else {


intent is weird

Could you clarify a bit? Is it this else case or the whole loop logic?

whoops made a typo! I meant to say indent is weird. The else case is on a new line rather than with the if close bracket

src/daft-writers/src/file.rs

src/daft-writers/src/pyarrow.rs

samster25 · 2024-12-03T23:24:21Z

src/daft-writers/src/lib.rs

+            estimate_in_memory_size_bytes as f64 / actual_on_disk_size_bytes as f64;
+        let new_num_samples = self.num_samples.fetch_add(1, Ordering::Relaxed) + 1;
+
+        let current_factor =


this is pretty bootleg and will prob cause bugs in the future. especially if folks try to use current_factor directly and get some huge number from it.

You are also using 2 atomics which makes the current factor not atomic.

I would just recommend using a Mutex<f32> and doing a mul_add on that. You're not gonna be holding on it for long anyways and we're bottle necked by the GIL mutex anyways.

f32 or f64? The default inflation factors in execution config come as f64

either is fine. we don't need the precision of f64 but it doesn't hurt

samster25 · 2024-12-05T22:28:19Z

src/daft-writers/src/lib.rs


    /// Close the file and return the result. The caller should NOT write to the file after calling this method.
    fn close(&mut self) -> DaftResult<Self::Result>;
+
+    /// Return the current position of the file, if applicable.


update this comment

Colin Ho and others added 3 commits November 29, 2024 15:31

chunk by size not rows

a0336a2

chunk by size not rows

ef60bd6

clean up

ee4b2f0

github-actions bot added the enhancement New feature or request label Dec 1, 2024

style

cda2da7

set file size first

f7432f5

colin-ho requested a review from samster25 December 2, 2024 17:18

samster25 reviewed Dec 2, 2024

View reviewed changes

Colin Ho and others added 5 commits December 2, 2024 14:43

adaptively update inflation factor + buffer batch writer better

a4a416e

fix logic

97c939e

fix test

8d86636

clean up file size writer

836a84c

clean up file size writer

0d930d5

EC2 Default User added 2 commits December 3, 2024 06:13

add comments

6c1a12c

catch empty buffer

d1c6eb4

colin-ho requested review from samster25 and desmondcheongzx December 3, 2024 17:55

samster25 reviewed Dec 3, 2024

View reviewed changes

Colin Ho added 5 commits December 3, 2024 16:52

add bytes written, use rw lock instead of atmoics

dfaa8cd

Merge branch 'main' into colin/swordfish-parquet-size-based-writes

f5693e1

Merge branch 'main' into colin/swordfish-parquet-size-based-writes

c383da6

use mutex

d35b1b0

refactor size based buffer

1862548

colin-ho requested a review from samster25 December 5, 2024 21:39

samster25 approved these changes Dec 5, 2024

View reviewed changes

update bytes written comment

a89724a

colin-ho changed the title ~~[FEAT] Target parquet writes by size bytes instead of rows~~ feat(parquet): Target parquet writes by size bytes instead of rows Dec 6, 2024

github-actions bot added the feat label Dec 6, 2024

colin-ho merged commit 528b797 into main Dec 6, 2024
43 of 44 checks passed

colin-ho deleted the colin/swordfish-parquet-size-based-writes branch December 6, 2024 04:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(parquet): Target parquet writes by size bytes instead of rows #3457

feat(parquet): Target parquet writes by size bytes instead of rows #3457

colin-ho commented Dec 1, 2024 •

edited

Loading

codspeed-hq bot commented Dec 1, 2024 •

edited

Loading

codecov bot commented Dec 1, 2024 •

edited

Loading

samster25 Dec 2, 2024

colin-ho Dec 3, 2024

samster25 Dec 2, 2024

colin-ho Dec 3, 2024

colin-ho commented Dec 3, 2024 •

edited

Loading

samster25 Dec 3, 2024

colin-ho Dec 4, 2024

samster25 Dec 5, 2024 •

edited

Loading

samster25 Dec 3, 2024

colin-ho Dec 4, 2024

samster25 Dec 5, 2024

samster25 Dec 5, 2024

feat(parquet): Target parquet writes by size bytes instead of rows #3457

feat(parquet): Target parquet writes by size bytes instead of rows #3457

Conversation

colin-ho commented Dec 1, 2024 • edited Loading

codspeed-hq bot commented Dec 1, 2024 • edited Loading

Merging #3457 will not alter performance

Summary

codecov bot commented Dec 1, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

colin-ho commented Dec 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

samster25 Dec 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

colin-ho commented Dec 1, 2024 •

edited

Loading

codspeed-hq bot commented Dec 1, 2024 •

edited

Loading

codecov bot commented Dec 1, 2024 •

edited

Loading

colin-ho commented Dec 3, 2024 •

edited

Loading

samster25 Dec 5, 2024 •

edited

Loading