-
Notifications
You must be signed in to change notification settings - Fork 179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(parquet): Target parquet writes by size bytes instead of rows #3457
Conversation
CodSpeed Performance ReportMerging #3457 will not alter performanceComparing Summary
|
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3457 +/- ##
==========================================
+ Coverage 77.00% 77.60% +0.59%
==========================================
Files 696 706 +10
Lines 86039 86225 +186
==========================================
+ Hits 66256 66916 +660
+ Misses 19783 19309 -474
|
src/daft-writers/src/batch.rs
Outdated
.expect("Micropartitions in target batch writer must be loaded"); | ||
|
||
if let Some(leftovers) = self.leftovers.take() { | ||
input = MicroPartition::concat([leftovers, input])?.into(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we had a bunch of small morsels, we can potentially perform this concat every iteration. Ideally we can just keep a buffer and keep track of the count number of bytes / rows that we have
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, implemented a simple buffer for this.
src/daft-writers/src/lib.rs
Outdated
cfg.parquet_target_row_group_size as f64, | ||
cfg.parquet_inflation_factor, | ||
); | ||
let (target_in_memory_file_size, target_in_memory_row_group_size) = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
discussed offline but we can expose a tell
method on the writer trait and update these values as we write.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice suggestion, implemented. Works for s3 as well
Results of doing adaptive inflation factor calculations when writing a scale factor 1 lineitem to s3. Native runner: 2 files. File 1 (371 mb). 1 rg with 79 mb, and 4 row groups of 241mb. File 2: (320 mb). 1rg with 122mb, and 2 row groups of 242mb. PyRunner: 8 files (due to scan task splitting), each with 2 rg of 90mb each, except the last one which has 1 rg with 12mb. |
} | ||
} | ||
// Else, we need to split the table | ||
else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
intent is weird
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you clarify a bit? Is it this else case or the whole loop logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whoops made a typo! I meant to say indent is weird. The else case is on a new line rather than with the if close bracket
src/daft-writers/src/lib.rs
Outdated
estimate_in_memory_size_bytes as f64 / actual_on_disk_size_bytes as f64; | ||
let new_num_samples = self.num_samples.fetch_add(1, Ordering::Relaxed) + 1; | ||
|
||
let current_factor = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is pretty bootleg and will prob cause bugs in the future. especially if folks try to use current_factor
directly and get some huge number from it.
You are also using 2 atomics which makes the current factor not atomic.
I would just recommend using a Mutex<f32>
and doing a mul_add on that. You're not gonna be holding on it for long anyways and we're bottle necked by the GIL mutex anyways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
f32
or f64
? The default inflation factors in execution config come as f64
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
either is fine. we don't need the precision of f64
but it doesn't hurt
src/daft-writers/src/lib.rs
Outdated
|
||
/// Close the file and return the result. The caller should NOT write to the file after calling this method. | ||
fn close(&mut self) -> DaftResult<Self::Result>; | ||
|
||
/// Return the current position of the file, if applicable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update this comment
Addresses #3443
morsel_size
rows, when the subsequent writers (particularly the parquet writer when writing row groups) does its own buffering.size_bytes
of each micropartition, instead of trying to estimate the row size based on the schema. This is a problem because we could unintentionally write very very large row groups, and consume a lot of memory to do this.This is a before and after of the
tensor[uint8]
example in the tagged issue with the fixes implemented. (Running swordfish on a 128cpu machine)