-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor parquet row group pruning into a struct (use new statistics API, part 1) #10607
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewing with whitespace blind diff makes the changes in this PR more obvious I think
https://github.com/apache/datafusion/pull/10607/files?w=1
); | ||
let rg_metadata = file_metadata.row_groups(); | ||
// track which row groups to actually read | ||
let mut row_groups = RowGroupSet::new(rg_metadata.len()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The whole point of this PR is to move the set of RowGroups that are still under consideration into a RowGroupSet
rather than a Vec<usize>
This makes it clearer what is going on in my opinion, as well as sets up up for vectorized pruning evaluation in the next PR
/// | ||
/// Updates this set to mark row groups that should not be scanned | ||
pub fn prune_by_range(&mut self, groups: &[RowGroupMetaData], range: &FileRange) { | ||
for (idx, metadata) in groups.iter().enumerate() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved the range filtering into its own function to make it clearer what is going on (it has nothing to do with statistics pruning)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 this looks more readable
&metrics | ||
), | ||
vec![1] | ||
let mut row_groups = RowGroupSet::new(2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the tests to use RowGroupSet
which I think actually makes them easier to read
@@ -1196,14 +1223,60 @@ mod tests { | |||
.await | |||
} | |||
|
|||
// What row groups are expected to be left after pruning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was finding vec![0]
hard to understand what was being tested, so I made this structure to encapsulate it
@Ted-Jiang I wonder if you might have time to review this PR at some point? |
Thanks ❤️ @alamb i will review this carefully later today |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM add RowGroupSet
really make this pruning code more easy to read 👍 Thanks @alamb
Thank you very much for the review @Ted-Jiang 🙏 |
…API, part 1) (apache#10607) * Refactor parquet row group pruning into a struct * Port tests * improve docs * fix msrv
/// Prune remaining row groups to only those within the specified range. | ||
/// | ||
/// Updates this set to mark row groups that should not be scanned | ||
pub fn prune_by_range(&mut self, groups: &[RowGroupMetaData], range: &FileRange) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for being late and nit-picking here.
There is a potential for misuse of this API if the number of row groups provided to RowGroupSet does not align with the passed groups. Should we first verify that the number of row groups matches, or do you believe the row group metadata array should be included as a field in RowGroupSet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for being late and nit-picking here.
No that is great. Thank you . I appreciate the comments.
should we first verify that the number of row groups matches,
Yes this is a good idea. I will do so in the next PR
or do you believe the row group metadata array should be included as a field in RowGroupSet?
I actually tried this in an earlier version of this code. One challenge I found is that prune_by_bloom_filter
needs to have the builder
(as it may need to potentially fetch bloom filters) which also has the metadata. So if the RowGroupSet had the metadata then there was an opportunity for inconsistency (between the metadata in the RowGroupSet and the builder)
I do think there might be an opportunity for imporvement here though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a PR to add some asserts: #10641
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this is a good idea. I will do so in the next PR
Thanks for the quick update. I agree we can at least add check first.
I actually tried this in an earlier version of this code. One challenge I found is that prune_by_bloom_filter needs to have the builder (as it may need to potentially fetch bloom filters) which also has the metadata. So if the RowGroupSet had the metadata then there was an opportunity for inconsistency (between the metadata in the RowGroupSet and the builder)
I do think there might be an opportunity for improvement here though
Then, I think we can leave it as it is for now. We can revise the impl later if some issues popped up.
…API, part 1) (apache#10607) * Refactor parquet row group pruning into a struct * Port tests * improve docs * fix msrv
Which issue does this PR close?
Part of #10453 and #9929
Rationale for this change
This PR prepares switching the parquet reader to use the new statistics extraction API added in #10537.
To avoid a massive, I broke it into its own PR for easier review. This PR refactors the code around, but doesn't change any behavior. The next PR will change the pruning API to use the new statistics extraction API.
I also think this change makes the code easier to understand and maintain as it uses a documented struct to represent the set of
RowGroup
s to scan rather than aVec<usize>
.What changes are included in this PR?
RowGroupSet
to encapsulate the set of row groups which is being prunedRowGroupSet
methodsRowGroupSet
Are these changes tested?
Yes, existing tests are updated to use the new structures
Are there any user-facing changes?
No these are all internal changes