-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Efficiently and correctly Extract Page Index statistics into ArrayRef
s
#10806
Comments
ArrayRef
sArrayRef
s
@alamb Do we already have a helper fn at place to write a parquet file with I'll keep on looking - but perhaps you have a quick pointer here, where to look? |
Thanks @marvinlanhenke 🙏 To write the relevant structues into Parquet, the statistics_enable field needs to be Page To read them back, the reader needs to be configured with with_page_index I think Also I have a proposed change to the Statistics code in #10802 If that gets merged, then the API for extracting the mins from data pages might look like // get relevant index statistics somehow
let data_page_statatistics: Vec<&Statistics> = todo!();
let converter = StatisticsConverter::try_new(
column_name,
reader.schema(),
reader.parquet_schema(),
);
// get mins from the ColumnIndex
let mins = converter.column_index_mins(data_page_statatistics).unwrap(); |
The proposed Api looks nice 👌Until the merge I can use the time to explore and prototype. Thanks for the pointers |
@alamb This is what I originally had in mind for the converter method: pub fn column_index_mins(&self, metadata: &ParquetMetaData) -> Result<ArrayRef> {
let data_type = self.arrow_field.data_type();
let Some(parquet_column_index) = metadata.column_index() else {
return Ok(self.make_null_array(data_type, metadata.row_groups()));
};
let Some(parquet_index) = self.parquet_index else {
return Ok(self.make_null_array(data_type, metadata.row_groups()));
};
let row_group_page_indices = parquet_column_index
.into_iter()
.map(|x| x.get(parquet_index));
min_page_statistics(Some(data_type), row_group_page_indices)
} So we would simply create an iterator for all row group's column indices, match the index and apply the statsfunc. Which works - all tests are passing. However, the API, or the integration with
Now, my API has to change. I'm wondering how specific it should be? Maybe I'm missing something, but I think it would help to maybe outline the scope of the refactor you had in mind. |
Thank you @marvinlanhenke -- excellent analysis.
Yes. It is my eventual goal for all of the code to convert To avoid a massive PR, however, I think it makes sense to add new code to
Indeed that is how it works today (one row group at a time). I eventually hope/plan to apply the same treatment to data page filtering as I did to row group filtering in #10802 (that is, make a single call to
let me play around with some options and get back to you |
@marvinlanhenke -- I whipped up something (actually I had been playing with it yesterday) #10843 |
Thank you so much - I quickly skimmed the draft you uploaded (will take a closer look tomorrow). My main question should be answered - for now we are iterating over each row group one by one using a row group index. I also agree about the scope for now. I'll try to incorporate your suggestions and upload a draft myself, so we have something more concrete to reason about. |
Follow on work tracked in #10922 |
Is your feature request related to a problem or challenge?
Related to #10453
There are at least two types of statistics stored in Parquet files
ColumnChunk
level statistics (a min/max/null_count per column per row group):RowGroupMetadata
--> ColumnChunkMetaData --> Option<&Statistics>As part of #10453 we have pulled conversion of the
ColumnChunk
level statistics intoStatisticsConverter
and #10802 prunes the row groups using this APIIt would be good to apply the same treatment to the statistics in the page index
Describe the solution you'd like
Describe alternatives you've considered
ArrayRef
from Index in page_filter (source link) toStatisticsConverter
(source)prune_pages_in_one_row_group
(source) to use the newStatisticsExtractor
codeAdditional context
No response
The text was updated successfully, but these errors were encountered: