-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prune Parquet RowGroup in a single call to PruningPredicate::prune
, update StatisticsExtractor API
#10802
Conversation
PruningPredicate::prune
, update StatisticsExtractor API
PruningPredicate::prune
, update StatisticsExtractor APIPruningPredicate::prune
, update StatisticsExtractor API
|
||
// Extract the min/max values for each row group from the statistics | ||
let row_counts = StatisticsConverter::row_counts(reader.metadata())?; | ||
let value_column_mins = StatisticsConverter::try_new( | ||
let converter = StatisticsConverter::try_new( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a pretty good example of how the statistics API changed. FYI @NGA-TRAN
Ok(values) => { | ||
// NB: false means don't scan row group | ||
if !values[0] { | ||
// Indexes of row groups still to scan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the change to prune all row groups with one call to PruningPredicate::prune
rather than one call per row group
let iter = metadatas | ||
.into_iter() | ||
.map(|x| x.column(parquet_index).statistics()); | ||
max_statistics(data_type, iter) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The min_statistcs
and max_statistics
changes in another PR could still be used here...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, indeed -- you are exactly correct. I purposely didn't change min_statistics and max_statistics as I knew you were working on them already
.extract(reader.metadata())?; | ||
reader.parquet_schema(), | ||
)?; | ||
let row_counts = StatisticsConverter::row_group_row_counts(row_groups.iter())?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like an user-facing change, should be ok at this stage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it is a user facing change, but we haven't released a version of datafusion yet that had StatisticsConverter
(it was only added a week or two ago) so this will not be an API change for anyone using released versions
.unwrap(); | ||
|
||
let _ = StatisticsConverter::row_counts(reader.metadata()).unwrap(); | ||
let _ = converter.row_group_mins(row_groups.iter()).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is more clear than using enum IMO :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No nice. Thanks Andrew
)?; | ||
let row_counts = StatisticsConverter::row_group_row_counts(row_groups.iter())?; | ||
let value_column_mins = converter.row_group_mins(row_groups.iter())?; | ||
let value_column_maxes = converter.row_group_maxes(row_groups.iter())?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
/// Null Count, returned as a [`UInt64Array`]) | ||
NullCount, | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree we do not need this. We store each min/max/row_count as an array instead :thus
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I give it a quick skim and it looks good in general 👍
Thank you @NGA-TRAN @xinlifoobar and @waynexia for the reviews |
Which issue does this PR close?
Part of #10453 and #9929
Follow on to #10607
Rationale for this change
The primary benefit of this PR is to start using the new API introduced in #10537 in the
ParquetExec
path. I plan a follow on project to use the same basic API to extract and prune pages within row groups.The current
ParquetExec
prunes one row group at a time by creating 1 rowArrayRefs
for each min/max/count in required. It would be better to create a single array with the data for multiple row groups and do a single call the vectorized pruning thatPruningPredicate
does.We recently made a similar change in InfluxDB IOx and saw a significant performance improvement for queries that accessed many row groups
I expect this to be a performance improvement, but I am not sure it will be measurable unless there are an extremely large number of row groups in a file.
What changes are included in this PR?
PruningPredicate::prune
once per file (rather than once per row group)StatisticsExtractor
API introduced from feat: API for collecting statistics/index for metadata of a parquet file + tests #10537StatisticsExtractor
API so it extracts a specified set of row groups rather than all of themThe changes to the
StatisticsExtractor
API are to return min/max statistics by different functions rather than enum. This will allow the same basic API to extract min/max statistics for pages as well (page_mins()
,page_maxs()
,page_counts()
) etc.Are these changes tested?
Covered by existing CI tests.
I ran some benchmark tests and this doesn't seem to meaningfully change
Are there any user-facing changes?
The
StatisticsExtractor
API has changed, but since this API has not yet been released, this is not strictly a breaking API change