-
Notifications
You must be signed in to change notification settings - Fork 787
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix writing of invalid Parquet ColumnIndex when row group contains null pages #6319
Conversation
parquet/src/file/metadata/writer.rs
Outdated
// assert_eq!(left.offset_index_length(), right.offset_index_length()); | ||
// assert_eq!(left.column_index_length(), right.column_index_length()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With just the changes to this file (so the new test data) this fails on master
because of #6316, so I commented it out for now and if:
- Pass empty vectors as min/max for all null pages when building ColumnIndex #6316 gets merged first I can rebase and re-enable the code
- If this gets merged first then Pass empty vectors as min/max for all null pages when building ColumnIndex #6316 can uncomment and re-enable this code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re-enabled in 8eea1e5
let writer_props = writer_props_builder | ||
.set_max_row_group_size(4) | ||
.set_data_page_row_count_limit(2) | ||
.set_write_batch_size(2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recognize that this is forcing things a bit. I leaned towards this approach because (1) it felt like a small step from where we are now and (2) I'm not that familiar with the lower level APIs that would be required if we wanted to build a ParquetMetadata
by hand and use that for tests instead of relying on file writing to build one for us. I think a reasonable piece of feedback to this PR would be that now is the time to refactor the tests to build the ParquetMetadata
by hand since we now care about getting pages of a certain size, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thik this is reasonable
// make sure each row group has 2 pages by checking the first column | ||
// page counts for each column for each row group, should all be the same and there should be | ||
// 12 pages in total across 6 row groups / 1 column | ||
let mut page_counts = vec![]; | ||
for row_group in column_indexes { | ||
for column in row_group { | ||
match column { | ||
Index::INT32(column_index) => { | ||
page_counts.push(column_index.indexes.len()); | ||
} | ||
_ => panic!("unexpected column index type"), | ||
} | ||
} | ||
} | ||
assert_eq!(page_counts, vec![2; 6]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless we refactor this test to create the metadata directly I think this is a good idea. Previously there was no assertion that we were creating more than 1 page or anything of the sort, since this is testing code that evidently diverges in code paths depending on how many pages there are, etc. it's good to at least have assertions that the data is laid out as we expect it to be.
parquet/src/column/writer/mod.rs
Outdated
vec![], | ||
vec![], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, that's reading ahead 🤣
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reverted it and pushed a single commit that also credits you 😄
…ll pages Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com>
cd93a1e
to
5a127c6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 (non-binding). I like the increased sanity checking in the tests. Having second thoughts on the fix, however, but it's just a style thing and not blocking.
@@ -450,19 +451,57 @@ mod tests { | |||
true, | |||
)])); | |||
|
|||
let a: ArrayRef = Arc::new(Int32Array::from(vec![Some(1), None, Some(2)])); | |||
// build row groups / pages that exercise different combinations of nulls and values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like these test additions, particularly that they build off of the tests added in #6197. They also seem much more thorough than what I proposed 😄 .
parquet/src/file/page_index/index.rs
Outdated
let mut min_values = vec![vec![]; self.indexes.len()]; | ||
let mut max_values = vec![vec![]; self.indexes.len()]; | ||
for (i, index) in self.indexes.iter().enumerate() { | ||
if let Some(min) = index.min_bytes() { | ||
min_values[i].extend_from_slice(min); | ||
} | ||
if let Some(max) = index.max_bytes() { | ||
max_values[i].extend_from_slice(max); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know I'm responsible for this, but now I wonder if simply changing the above to something like:
let min_values = self
.indexes
.iter()
.map(|x| x.min_bytes().unwrap_or(&[]).to_vec())
.collect::<Vec<_>>();
would be more rusty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup i think it would be, changed
Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com>
Thank you @etseidl! Can we merge this or do we need to loop in some more folks? |
I'm not a committer so I have no power here 😅. @alamb will hopefully take a look once his vacation is over. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me -- thank you @adriangb and @etseidl . I really appreciate the quick turnaround and help. Again sorry for the delay in reviewing
I just merged #6316 and re-enabled the tests in https://github.com/apache/arrow-rs/pull/6319/files#r1735056465
let writer_props = writer_props_builder | ||
.set_max_row_group_size(4) | ||
.set_data_page_row_count_limit(2) | ||
.set_write_batch_size(2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thik this is reasonable
🔧 |
Fixes #6310. All credit goes to @etseidl.