-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-34949: [C++][Parquet] Enable page index by columns #35230
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me 👍
By the way, can we just disable the column index? I've a case that I don't want to collect the Column Index, because statistics is not important for me. ( I'm sure I'll read the whole file), however, offset index can be used. |
I have thought about this. However, it would be complex if we want to control ColumnIndex and OffsetIndex separately for individual columns. What about splitting |
Well, I think just having Offset Index can optimizing IO, but I don't know how can we do when we only have Column Index |
Actually we can archive it by disabling statistics on all columns. Without column statistics, ColumnIndexes are dropped automatically. |
Yes, but seems it's a bit trickey here :) |
Benchmark runs are scheduled for baseline = f01853d and contender = d2c4c21. d2c4c21 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
['Python', 'R'] benchmarks have high level of regressions. |
…5230) ### Rationale for this change Currently parquet writer only supports enabling page index for all columns. It would be good to enable/disable at the column level as sometimes it may not be useful for some columns but it pays to create them. ### What changes are included in this PR? Similar to `WriterProperties::Builder::enable_dictionary/disable_dictionary`, this patch adds `WriterProperties::Builder::enable_write_page_index/disable_write_page_index` and keep it backward compatible to enable/disable for all columns. ### Are these changes tested? Added `ParquetPageIndexRoundTripTest::EnablePerColumn` to cover the new settings. ### Are there any user-facing changes? Yes, users are now more flexible to enable/disable page index. * Closes: apache#34949 Authored-by: Gang Wu <ustcwg@gmail.com> Signed-off-by: Will Jones <willjones127@gmail.com>
…5230) ### Rationale for this change Currently parquet writer only supports enabling page index for all columns. It would be good to enable/disable at the column level as sometimes it may not be useful for some columns but it pays to create them. ### What changes are included in this PR? Similar to `WriterProperties::Builder::enable_dictionary/disable_dictionary`, this patch adds `WriterProperties::Builder::enable_write_page_index/disable_write_page_index` and keep it backward compatible to enable/disable for all columns. ### Are these changes tested? Added `ParquetPageIndexRoundTripTest::EnablePerColumn` to cover the new settings. ### Are there any user-facing changes? Yes, users are now more flexible to enable/disable page index. * Closes: apache#34949 Authored-by: Gang Wu <ustcwg@gmail.com> Signed-off-by: Will Jones <willjones127@gmail.com>
…5230) ### Rationale for this change Currently parquet writer only supports enabling page index for all columns. It would be good to enable/disable at the column level as sometimes it may not be useful for some columns but it pays to create them. ### What changes are included in this PR? Similar to `WriterProperties::Builder::enable_dictionary/disable_dictionary`, this patch adds `WriterProperties::Builder::enable_write_page_index/disable_write_page_index` and keep it backward compatible to enable/disable for all columns. ### Are these changes tested? Added `ParquetPageIndexRoundTripTest::EnablePerColumn` to cover the new settings. ### Are there any user-facing changes? Yes, users are now more flexible to enable/disable page index. * Closes: apache#34949 Authored-by: Gang Wu <ustcwg@gmail.com> Signed-off-by: Will Jones <willjones127@gmail.com>
### Rationale for this change Reduces the time taken for `TypedColumnWriter::WriteBatch`, which regressed with #35230 ### What changes are included in this PR? This change computes the value for `pages_change_on_record_boundaries` once when a `TypedColumnWriter` is constructed rather than on every call to `WriteBatch`. ### Are these changes tested? This doesn't change behaviour so should be covered by existing tests. ### Are there any user-facing changes? No * Closes: #37453 Authored-by: Adam Reeve <adreeve@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
…#37454) ### Rationale for this change Reduces the time taken for `TypedColumnWriter::WriteBatch`, which regressed with apache#35230 ### What changes are included in this PR? This change computes the value for `pages_change_on_record_boundaries` once when a `TypedColumnWriter` is constructed rather than on every call to `WriteBatch`. ### Are these changes tested? This doesn't change behaviour so should be covered by existing tests. ### Are there any user-facing changes? No * Closes: apache#37453 Authored-by: Adam Reeve <adreeve@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
…#37454) ### Rationale for this change Reduces the time taken for `TypedColumnWriter::WriteBatch`, which regressed with apache#35230 ### What changes are included in this PR? This change computes the value for `pages_change_on_record_boundaries` once when a `TypedColumnWriter` is constructed rather than on every call to `WriteBatch`. ### Are these changes tested? This doesn't change behaviour so should be covered by existing tests. ### Are there any user-facing changes? No * Closes: apache#37453 Authored-by: Adam Reeve <adreeve@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
Rationale for this change
Currently parquet writer only supports enabling page index for all columns. It would be good to enable/disable at the column level as sometimes it may not be useful for some columns but it pays to create them.
What changes are included in this PR?
Similar to
WriterProperties::Builder::enable_dictionary/disable_dictionary
, this patch addsWriterProperties::Builder::enable_write_page_index/disable_write_page_index
and keep it backward compatible to enable/disable for all columns.Are these changes tested?
Added
ParquetPageIndexRoundTripTest::EnablePerColumn
to cover the new settings.Are there any user-facing changes?
Yes, users are now more flexible to enable/disable page index.