Efficiently and correctly Extract Page Index statistics into `ArrayRef`s #10806

alamb · 2024-06-05T13:28:54Z

Is your feature request related to a problem or challenge?

Related to #10453

There are at least two types of statistics stored in Parquet files

ColumnChunk level statistics (a min/max/null_count per column per row group): RowGroupMetadata --> ColumnChunkMetaData --> Option<&Statistics>
"Page Index" statistics (stored per page, may be more than one page per column per row group): ColumnChunkMetaData --> read_columns_indexes --> Vec<Index>

As part of #10453 we have pulled conversion of the ColumnChunk level statistics into StatisticsConverter and #10802 prunes the row groups using this API

It would be good to apply the same treatment to the statistics in the page index

Describe the solution you'd like

Add a clear API to efficiently extract page statistics outside of DataFusion
Ensure that API is well tested
Ensure the API is fast

Describe alternatives you've considered

Move / refactor the code to extract ArrayRef from Index in page_filter (source link) to StatisticsConverter (source)
Update the tests in arrow_statistics (source) to also verify that the page statistics are correct (I believe the page min/maxes should be the same as the row group min/maxes)
Update the parquet code prune_pages_in_one_row_group (source) to use the new StatisticsExtractor code
Update the benchmark (source) for extracting page statistics and use that to ensure the statistics extraction code is reasonably performant

Additional context

No response

The text was updated successfully, but these errors were encountered:

marvinlanhenke · 2024-06-07T14:11:25Z

@alamb
I was briefly looking at this, trying to understand whats needed here.

Do we already have a helper fn at place to write a parquet file with Page Index statistics? While I was "prototyping" I tried to get the metadata.column_index() by using the existing make_test_file_rg - but it seems page index stats are not written (None)?

I'll keep on looking - but perhaps you have a quick pointer here, where to look?

alamb · 2024-06-07T14:35:05Z

Thanks @marvinlanhenke 🙏

To write the relevant structues into Parquet, the statistics_enable field needs to be Page

To read them back, the reader needs to be configured with with_page_index I think

Also I have a proposed change to the Statistics code in #10802

If that gets merged, then the API for extracting the mins from data pages might look like

        // get relevant index statistics somehow
        let data_page_statatistics: Vec<&Statistics> = todo!();
        let converter = StatisticsConverter::try_new(
            column_name,
            reader.schema(),
            reader.parquet_schema(),
        );
        // get mins from the ColumnIndex
        let mins = converter.column_index_mins(data_page_statatistics).unwrap();

marvinlanhenke · 2024-06-07T15:23:40Z

The proposed Api looks nice 👌Until the merge I can use the time to explore and prototype. Thanks for the pointers

marvinlanhenke · 2024-06-09T20:33:08Z

@alamb
I'm currently thinking about how to integrate StatisticsConverter with the existing code prune_pages_in_one_row_group.

This is what I originally had in mind for the converter method:

    pub fn column_index_mins(&self, metadata: &ParquetMetaData) -> Result<ArrayRef> {
        let data_type = self.arrow_field.data_type();

        let Some(parquet_column_index) = metadata.column_index() else {
            return Ok(self.make_null_array(data_type, metadata.row_groups()));
        };

        let Some(parquet_index) = self.parquet_index else {
            return Ok(self.make_null_array(data_type, metadata.row_groups()));
        };

        let row_group_page_indices = parquet_column_index
            .into_iter()
            .map(|x| x.get(parquet_index));
        min_page_statistics(Some(data_type), row_group_page_indices)
    }

So we would simply create an iterator for all row group's column indices, match the index and apply the statsfunc. Which works - all tests are passing.

However, the API, or the integration with prune_pages_in_one_row_group feels kind of strange:

a lot of work the StatisticConverter does is already done here
we already iterate over each row_group individually, extracting a single Option<&Index> here and passing it into prune_pages_per_one_row_group

Now, my API has to change. I'm wondering how specific it should be?
If we pass &Index as a parameter, I can match the index and extract the statistic as done here. However, I'm not sure this is the way to go. We'd simply move the get_min_max_values_for_page_index method, and basically have no need for the StatisticConverter?

Maybe I'm missing something, but I think it would help to maybe outline the scope of the refactor you had in mind.

alamb · 2024-06-09T21:01:56Z

Thank you @marvinlanhenke -- excellent analysis.

a lot of work the StatisticConverter does is already done here

Yes. It is my eventual goal for all of the code to convert Index to ArrayRef in page_filter.rs is gone and page_filter.rs only calls StatisticsConverter.

To avoid a massive PR, however, I think it makes sense to add new code to StatisticsConverter for extracting page values, and then when it is complete enough switch page_filter.rs to use StatisticsConverter

we already iterate over each row_group individually, extracting a single Option<&Index> here and passing it into prune_pages_per_one_row_group

Indeed that is how it works today (one row group at a time). I eventually hope/plan to apply the same treatment to data page filtering as I did to row group filtering in #10802 (that is, make a single call to PruningPredicate::prune for the all the remaining row groups.

Now, my API has to change. I'm wondering how specific it should be? If we pass &Index as a parameter, I can match the index and extract the statistic as done here. However, I'm not sure this is the way to go. We'd simply move the get_min_max_values_for_page_index method, and basically have no need for the StatisticConverter?

let me play around with some options and get back to you

alamb · 2024-06-09T21:31:33Z

@marvinlanhenke -- I whipped up something (actually I had been playing with it yesterday) #10843

marvinlanhenke · 2024-06-09T22:24:06Z

Thank you so much - I quickly skimmed the draft you uploaded (will take a closer look tomorrow). My main question should be answered - for now we are iterating over each row group one by one using a row group index.

I also agree about the scope for now.
However, now I can see the overall picture / direction somewhat clearer, thanks for explaining that.

I'll try to incorporate your suggestions and upload a draft myself, so we have something more concrete to reason about.

alamb · 2024-06-17T15:49:56Z

Follow on work tracked in #10922

alamb added the enhancement New feature or request label Jun 5, 2024

alamb mentioned this issue Jun 5, 2024

[EPIC] Efficiently and correctly extract parquet statistics into ArrayRefs #10453

Closed

23 tasks

alamb changed the title ~~Efficiently and corerctly Extract Page Index statistics into ArrayRefs~~ Efficiently and correctly Extract Page Index statistics into ArrayRefs Jun 5, 2024

alamb mentioned this issue Jun 9, 2024

WIP Prototype DataPage extraction API #10843

Closed

marvinlanhenke mentioned this issue Jun 10, 2024

Initial Extract parquet data page statistics API #10852

Merged

This was referenced Jun 11, 2024

DataFusion weekly project plan (Andrew Lamb) - June 10, 2024 #10869

Closed

[EPIC] Continued correct and improved extracting Parquet statistics into ArrayRefs #10922

Closed

alamb closed this as completed in #10852 Jun 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Efficiently and correctly Extract Page Index statistics into `ArrayRef`s #10806

Efficiently and correctly Extract Page Index statistics into `ArrayRef`s #10806

alamb commented Jun 5, 2024

marvinlanhenke commented Jun 7, 2024 •

edited

Loading

alamb commented Jun 7, 2024

marvinlanhenke commented Jun 7, 2024

marvinlanhenke commented Jun 9, 2024 •

edited

Loading

alamb commented Jun 9, 2024

alamb commented Jun 9, 2024

marvinlanhenke commented Jun 9, 2024 •

edited

Loading

alamb commented Jun 17, 2024

Efficiently and correctly Extract Page Index statistics into ArrayRefs #10806

Efficiently and correctly Extract Page Index statistics into ArrayRefs #10806

Comments

alamb commented Jun 5, 2024

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

marvinlanhenke commented Jun 7, 2024 • edited Loading

alamb commented Jun 7, 2024

marvinlanhenke commented Jun 7, 2024

marvinlanhenke commented Jun 9, 2024 • edited Loading

alamb commented Jun 9, 2024

alamb commented Jun 9, 2024

marvinlanhenke commented Jun 9, 2024 • edited Loading

alamb commented Jun 17, 2024

Efficiently and correctly Extract Page Index statistics into `ArrayRef`s #10806

Efficiently and correctly Extract Page Index statistics into `ArrayRef`s #10806

marvinlanhenke commented Jun 7, 2024 •

edited

Loading

marvinlanhenke commented Jun 9, 2024 •

edited

Loading

marvinlanhenke commented Jun 9, 2024 •

edited

Loading