Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-34949: [C++][Parquet] Enable page index by columns #35230

Merged
merged 1 commit into from
Apr 19, 2023

Conversation

wgtmac
Copy link
Member

@wgtmac wgtmac commented Apr 19, 2023

Rationale for this change

Currently parquet writer only supports enabling page index for all columns. It would be good to enable/disable at the column level as sometimes it may not be useful for some columns but it pays to create them.

What changes are included in this PR?

Similar to WriterProperties::Builder::enable_dictionary/disable_dictionary, this patch adds WriterProperties::Builder::enable_write_page_index/disable_write_page_index and keep it backward compatible to enable/disable for all columns.

Are these changes tested?

Added ParquetPageIndexRoundTripTest::EnablePerColumn to cover the new settings.

Are there any user-facing changes?

Yes, users are now more flexible to enable/disable page index.

@wgtmac wgtmac requested a review from wjones127 as a code owner April 19, 2023 06:46
@github-actions
Copy link

Copy link
Member

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me 👍

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting review Awaiting review labels Apr 19, 2023
@wjones127 wjones127 merged commit d2c4c21 into apache:main Apr 19, 2023
@mapleFU
Copy link
Member

mapleFU commented Apr 21, 2023

By the way, can we just disable the column index?

I've a case that I don't want to collect the Column Index, because statistics is not important for me. ( I'm sure I'll read the whole file), however, offset index can be used.

@wgtmac
Copy link
Member Author

wgtmac commented Apr 21, 2023

By the way, can we just disable the column index?

I've a case that I don't want to collect the Column Index, because statistics is not important for me. ( I'm sure I'll read the whole file), however, offset index can be used.

I have thought about this. However, it would be complex if we want to control ColumnIndex and OffsetIndex separately for individual columns. What about splitting WriterProperties::Builder::enable_write_page_index/disable_write_page_index into WriterProperties::Builder::enable_write_column_index/disable_write_column_index and WriterProperties::Builder::enable_write_offset_index/disable_write_offset_index?

@mapleFU
Copy link
Member

mapleFU commented Apr 21, 2023

Well, I think just having Offset Index can optimizing IO, but I don't know how can we do when we only have Column Index

@wgtmac
Copy link
Member Author

wgtmac commented Apr 21, 2023

Actually we can archive it by disabling statistics on all columns. Without column statistics, ColumnIndexes are dropped automatically.

@mapleFU
Copy link
Member

mapleFU commented Apr 21, 2023

Yes, but seems it's a bit trickey here :)
Maybe we can document it?

@ursabot
Copy link

ursabot commented Apr 21, 2023

Benchmark runs are scheduled for baseline = f01853d and contender = d2c4c21. d2c4c21 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed] test-mac-arm
[Finished ⬇️5.36% ⬆️0.26%] ursa-i9-9960x
[Finished ⬇️0.45% ⬆️0.03%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] d2c4c212 ec2-t3-xlarge-us-east-2
[Failed] d2c4c212 test-mac-arm
[Finished] d2c4c212 ursa-i9-9960x
[Finished] d2c4c212 ursa-thinkcentre-m75q
[Finished] f01853d7 ec2-t3-xlarge-us-east-2
[Failed] f01853d7 test-mac-arm
[Finished] f01853d7 ursa-i9-9960x
[Finished] f01853d7 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@ursabot
Copy link

ursabot commented Apr 21, 2023

['Python', 'R'] benchmarks have high level of regressions.
ursa-i9-9960x

liujiacheng777 pushed a commit to LoongArch-Python/arrow that referenced this pull request May 11, 2023
…5230)

### Rationale for this change

Currently parquet writer only supports enabling page index for all columns. It would be good to enable/disable at the column level as sometimes it may not be useful for some columns but it pays to create them.

### What changes are included in this PR?

Similar to `WriterProperties::Builder::enable_dictionary/disable_dictionary`, this patch adds `WriterProperties::Builder::enable_write_page_index/disable_write_page_index` and keep it backward compatible to enable/disable for all columns.

### Are these changes tested?

Added `ParquetPageIndexRoundTripTest::EnablePerColumn` to cover the new settings.

### Are there any user-facing changes?

Yes, users are now more flexible to enable/disable page index.
* Closes: apache#34949

Authored-by: Gang Wu <ustcwg@gmail.com>
Signed-off-by: Will Jones <willjones127@gmail.com>
ArgusLi pushed a commit to Bit-Quill/arrow that referenced this pull request May 15, 2023
…5230)

### Rationale for this change

Currently parquet writer only supports enabling page index for all columns. It would be good to enable/disable at the column level as sometimes it may not be useful for some columns but it pays to create them.

### What changes are included in this PR?

Similar to `WriterProperties::Builder::enable_dictionary/disable_dictionary`, this patch adds `WriterProperties::Builder::enable_write_page_index/disable_write_page_index` and keep it backward compatible to enable/disable for all columns.

### Are these changes tested?

Added `ParquetPageIndexRoundTripTest::EnablePerColumn` to cover the new settings.

### Are there any user-facing changes?

Yes, users are now more flexible to enable/disable page index.
* Closes: apache#34949

Authored-by: Gang Wu <ustcwg@gmail.com>
Signed-off-by: Will Jones <willjones127@gmail.com>
rtpsw pushed a commit to rtpsw/arrow that referenced this pull request May 16, 2023
…5230)

### Rationale for this change

Currently parquet writer only supports enabling page index for all columns. It would be good to enable/disable at the column level as sometimes it may not be useful for some columns but it pays to create them.

### What changes are included in this PR?

Similar to `WriterProperties::Builder::enable_dictionary/disable_dictionary`, this patch adds `WriterProperties::Builder::enable_write_page_index/disable_write_page_index` and keep it backward compatible to enable/disable for all columns.

### Are these changes tested?

Added `ParquetPageIndexRoundTripTest::EnablePerColumn` to cover the new settings.

### Are there any user-facing changes?

Yes, users are now more flexible to enable/disable page index.
* Closes: apache#34949

Authored-by: Gang Wu <ustcwg@gmail.com>
Signed-off-by: Will Jones <willjones127@gmail.com>
pitrou pushed a commit that referenced this pull request Aug 30, 2023
### Rationale for this change

Reduces the time taken for `TypedColumnWriter::WriteBatch`, which regressed with #35230 

### What changes are included in this PR?

This change computes the value for `pages_change_on_record_boundaries` once when a `TypedColumnWriter` is constructed rather than on every call to `WriteBatch`.

### Are these changes tested?

This doesn't change behaviour so should be covered by existing tests.

### Are there any user-facing changes?

No
* Closes: #37453

Authored-by: Adam Reeve <adreeve@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this pull request Nov 13, 2023
…#37454)

### Rationale for this change

Reduces the time taken for `TypedColumnWriter::WriteBatch`, which regressed with apache#35230 

### What changes are included in this PR?

This change computes the value for `pages_change_on_record_boundaries` once when a `TypedColumnWriter` is constructed rather than on every call to `WriteBatch`.

### Are these changes tested?

This doesn't change behaviour so should be covered by existing tests.

### Are there any user-facing changes?

No
* Closes: apache#37453

Authored-by: Adam Reeve <adreeve@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…#37454)

### Rationale for this change

Reduces the time taken for `TypedColumnWriter::WriteBatch`, which regressed with apache#35230 

### What changes are included in this PR?

This change computes the value for `pages_change_on_record_boundaries` once when a `TypedColumnWriter` is constructed rather than on every call to `WriteBatch`.

### Are these changes tested?

This doesn't change behaviour so should be covered by existing tests.

### Are there any user-facing changes?

No
* Closes: apache#37453

Authored-by: Adam Reeve <adreeve@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++][Parquet] Parquet writer should control set of columns to enable page index
4 participants