Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: more efficient parquet writer and more statistics #1397

Merged
merged 4 commits into from
May 28, 2023

Conversation

wjones127
Copy link
Collaborator

@wjones127 wjones127 commented May 27, 2023

Description

  • Removed the data copies in a tight loop, which were extremely bad for performance when writing files > 100MB.
  • Rewrote statistics handling to collect null values from metadata, just like min and max.
  • Added support for more types in statistics.

Related Issue(s)

Documentation

@github-actions github-actions bot added binding/rust Issues for the Rust crate rust labels May 27, 2023
@@ -324,40 +318,27 @@ impl PartitionWriter {
}

fn write_batch(&mut self, batch: &RecordBatch) -> DeltaResult<()> {
// copy current cursor bytes so we can recover from failures
// TODO is copying this something we should be doing?
let buffer_bytes = self.buffer.to_vec();
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the offending line causing a lot of data copying.

@@ -528,7 +698,7 @@ mod tests {
("some_bool", ColumnCountStat::Value(v)) => assert_eq!(100, *v),
("some_string", ColumnCountStat::Value(v)) => assert_eq!(100, *v),
("some_list", ColumnCountStat::Value(v)) => assert_eq!(100, *v),
("some_nested_list", ColumnCountStat::Value(v)) => assert_eq!(0, *v),
("some_nested_list", ColumnCountStat::Value(v)) => assert_eq!(100, *v),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this test was incorrect, given the null values here:

"some_nested_list": [[42], null],

@wjones127 wjones127 marked this pull request as ready for review May 27, 2023 23:41
rtyler
rtyler previously approved these changes May 28, 2023
roeap
roeap previously approved these changes May 28, 2023
Copy link
Collaborator

@roeap roeap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really looking forward to see this in action! Feels like this a great step forward for our write experience!

@@ -448,6 +615,8 @@ mod tests {
let mut null_count_keys = vec!["some_list", "some_nested_list"];
null_count_keys.extend_from_slice(min_max_keys.as_slice());

dbg!(&stats);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

debug artifact?

}

impl AddAssign for AggregatedStats {
fn add_assign(&mut self, rhs: Self) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

roeap
roeap previously approved these changes May 28, 2023
@wjones127 wjones127 merged commit b8bc0f7 into delta-io:main May 28, 2023
@wjones127 wjones127 deleted the writer-perf branch May 28, 2023 17:50
roeap pushed a commit to roeap/delta-rs that referenced this pull request Jun 2, 2023
# Description

* Removed the data copies in a tight loop, which were extremely bad for
performance when writing files > 100MB.
* Rewrote statistics handling to collect null values from metadata, just
like min and max.
* Added support for more types in statistics.

# Related Issue(s)

- closes delta-io#1394
- closes delta-io#1209 
- closes delta-io#1208


# Documentation

<!---
Share links to useful documentation
--->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/rust Issues for the Rust crate rust
Projects
None yet
3 participants