Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Efficiently and correctly Extract Page Index statistics into ArrayRefs #10806

Closed
Tracked by #10453
alamb opened this issue Jun 5, 2024 · 8 comments · Fixed by #10852
Closed
Tracked by #10453

Efficiently and correctly Extract Page Index statistics into ArrayRefs #10806

alamb opened this issue Jun 5, 2024 · 8 comments · Fixed by #10852
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Jun 5, 2024

Is your feature request related to a problem or challenge?

Related to #10453

There are at least two types of statistics stored in Parquet files

  1. ColumnChunk level statistics (a min/max/null_count per column per row group): RowGroupMetadata --> ColumnChunkMetaData --> Option<&Statistics>
  2. "Page Index" statistics (stored per page, may be more than one page per column per row group): ColumnChunkMetaData --> read_columns_indexes --> Vec<Index>

As part of #10453 we have pulled conversion of the ColumnChunk level statistics into StatisticsConverter and #10802 prunes the row groups using this API

It would be good to apply the same treatment to the statistics in the page index

Describe the solution you'd like

  1. Add a clear API to efficiently extract page statistics outside of DataFusion
  2. Ensure that API is well tested
  3. Ensure the API is fast

Describe alternatives you've considered

  1. Move / refactor the code to extract ArrayRef from Index in page_filter (source link) to StatisticsConverter (source)
  2. Update the tests in arrow_statistics (source) to also verify that the page statistics are correct (I believe the page min/maxes should be the same as the row group min/maxes)
  3. Update the parquet code prune_pages_in_one_row_group (source) to use the new StatisticsExtractor code
  4. Update the benchmark (source) for extracting page statistics and use that to ensure the statistics extraction code is reasonably performant

Additional context

No response

@alamb alamb added the enhancement New feature or request label Jun 5, 2024
@alamb alamb changed the title Efficiently and corerctly Extract Page Index statistics into ArrayRefs Efficiently and correctly Extract Page Index statistics into ArrayRefs Jun 5, 2024
@marvinlanhenke
Copy link
Contributor

marvinlanhenke commented Jun 7, 2024

@alamb
I was briefly looking at this, trying to understand whats needed here.

Do we already have a helper fn at place to write a parquet file with Page Index statistics? While I was "prototyping" I tried to get the metadata.column_index() by using the existing make_test_file_rg - but it seems page index stats are not written (None)?

I'll keep on looking - but perhaps you have a quick pointer here, where to look?

@alamb
Copy link
Contributor Author

alamb commented Jun 7, 2024

Thanks @marvinlanhenke 🙏

To write the relevant structues into Parquet, the statistics_enable field needs to be Page

To read them back, the reader needs to be configured with with_page_index I think

Also I have a proposed change to the Statistics code in #10802

If that gets merged, then the API for extracting the mins from data pages might look like

        // get relevant index statistics somehow
        let data_page_statatistics: Vec<&Statistics> = todo!();
        let converter = StatisticsConverter::try_new(
            column_name,
            reader.schema(),
            reader.parquet_schema(),
        );
        // get mins from the ColumnIndex
        let mins = converter.column_index_mins(data_page_statatistics).unwrap();

@marvinlanhenke
Copy link
Contributor

The proposed Api looks nice 👌Until the merge I can use the time to explore and prototype. Thanks for the pointers

@marvinlanhenke
Copy link
Contributor

marvinlanhenke commented Jun 9, 2024

@alamb
I'm currently thinking about how to integrate StatisticsConverter with the existing code prune_pages_in_one_row_group.

This is what I originally had in mind for the converter method:

    pub fn column_index_mins(&self, metadata: &ParquetMetaData) -> Result<ArrayRef> {
        let data_type = self.arrow_field.data_type();

        let Some(parquet_column_index) = metadata.column_index() else {
            return Ok(self.make_null_array(data_type, metadata.row_groups()));
        };

        let Some(parquet_index) = self.parquet_index else {
            return Ok(self.make_null_array(data_type, metadata.row_groups()));
        };

        let row_group_page_indices = parquet_column_index
            .into_iter()
            .map(|x| x.get(parquet_index));
        min_page_statistics(Some(data_type), row_group_page_indices)
    }

So we would simply create an iterator for all row group's column indices, match the index and apply the statsfunc. Which works - all tests are passing.

However, the API, or the integration with prune_pages_in_one_row_group feels kind of strange:

  • a lot of work the StatisticConverter does is already done here
  • we already iterate over each row_group individually, extracting a single Option<&Index> here and passing it into prune_pages_per_one_row_group

Now, my API has to change. I'm wondering how specific it should be?
If we pass &Index as a parameter, I can match the index and extract the statistic as done here. However, I'm not sure this is the way to go. We'd simply move the get_min_max_values_for_page_index method, and basically have no need for the StatisticConverter?

Maybe I'm missing something, but I think it would help to maybe outline the scope of the refactor you had in mind.

@alamb
Copy link
Contributor Author

alamb commented Jun 9, 2024

Thank you @marvinlanhenke -- excellent analysis.

  • a lot of work the StatisticConverter does is already done here

Yes. It is my eventual goal for all of the code to convert Index to ArrayRef in page_filter.rs is gone and page_filter.rs only calls StatisticsConverter.

To avoid a massive PR, however, I think it makes sense to add new code to StatisticsConverter for extracting page values, and then when it is complete enough switch page_filter.rs to use StatisticsConverter

  • we already iterate over each row_group individually, extracting a single Option<&Index> here and passing it into prune_pages_per_one_row_group

Indeed that is how it works today (one row group at a time). I eventually hope/plan to apply the same treatment to data page filtering as I did to row group filtering in #10802 (that is, make a single call to PruningPredicate::prune for the all the remaining row groups.

Now, my API has to change. I'm wondering how specific it should be? If we pass &Index as a parameter, I can match the index and extract the statistic as done here. However, I'm not sure this is the way to go. We'd simply move the get_min_max_values_for_page_index method, and basically have no need for the StatisticConverter?

let me play around with some options and get back to you

@alamb
Copy link
Contributor Author

alamb commented Jun 9, 2024

@marvinlanhenke -- I whipped up something (actually I had been playing with it yesterday) #10843

@marvinlanhenke
Copy link
Contributor

marvinlanhenke commented Jun 9, 2024

Thank you so much - I quickly skimmed the draft you uploaded (will take a closer look tomorrow). My main question should be answered - for now we are iterating over each row group one by one using a row group index.

I also agree about the scope for now.
However, now I can see the overall picture / direction somewhat clearer, thanks for explaining that.

I'll try to incorporate your suggestions and upload a draft myself, so we have something more concrete to reason about.

@alamb
Copy link
Contributor Author

alamb commented Jun 17, 2024

Follow on work tracked in #10922

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants