-
Notifications
You must be signed in to change notification settings - Fork 738
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update Parquet thrift generated structures #6045
Conversation
@@ -3417,10 +3565,15 @@ pub struct ColumnMetaData { | |||
/// Writers should write this field so readers can read the bloom filter | |||
/// in a single I/O. | |||
pub bloom_filter_length: Option<i32>, | |||
/// Optional statistics to help estimate total memory when converted to in-memory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 https://docs.rs/parquet/latest/parquet/format/struct.SortingColumn.html appears to be publically exposed (and thus this is technically a breaking API change).
Thus I think that means we couldn't merge this until August when master opens for breaking changes https://github.com/apache/arrow-rs/blob/master/CONTRIBUTING.md#breaking-changes
However, managing stacked PRs is going to get painful if we have to keep them open for an extended period of time.
THus, despite my comments on #5486 (review) to the contrary I think it would be ok and to have the whole feature in a single PR.
So perhaps we can close this PR.
Update: see below for an alternate suggestion
Sorry about that @etseidl . I am still getting used to the relatively new "breaking API" workflow in this repo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An alternative would be to make a feature branch that we could target / review PRs to, and then I could merge the feature branch to main when ready. @XiangpengHao, @Weijun-H, and I took this approach for StringView in DataFusion and it worked well in my opinion
What worked well, is that it enabled me to review the PRs faster (as smaller PRs require less contiguous time for review, and thus I am able to do them faster)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are willing to go this route, I made a branch to try: https://github.com/apache/arrow-rs/tree/parquet-size-statistics (we would change this PR to target that branch)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like it! I'll retarget this PR. Thanks!
What do you think about a more substantial feature branch? #6050 |
Per discussion on #6050 I am going to merge this branch to the |
Thanks again for your patience @etseidl -- I have started merging things to |
* bump `tonic` to 0.12 and `prost` to 0.13 for `arrow-flight` (#6041) * bump `tonic` to 0.12 and `prost` to 0.13 for `arrow-flight` Signed-off-by: Bugen Zhao <i@bugenzhao.com> * fix example tests Signed-off-by: Bugen Zhao <i@bugenzhao.com> --------- Signed-off-by: Bugen Zhao <i@bugenzhao.com> * Remove `impl<T: AsRef<[u8]>> From<T> for Buffer` that easily accidentally copies data (#6043) * deprecate auto copy, ask explicit reference * update comments * make cargo doc happy * Make display of interval types more pretty (#6006) * improve dispaly for interval. * update test in pretty, and fix display problem. * tmp * fix tests in arrow-cast. * fix tests in pretty. * fix style. * Update snafu (#5930) * Update Parquet thrift generated structures (#6045) * update to latest thrift (as of 11 Jul 2024) from parquet-format * pass None for optional size statistics * escape HTML tags * don't need to escape brackets in arrays * Revert "Revert "Write Bloom filters between row groups instead of the end (#…" (#5933) This reverts commit 22e0b44. * Revert "Update snafu (#5930)" (#6069) This reverts commit 756b1fb. * Update pyo3 requirement from 0.21.1 to 0.22.1 (fixed) (#6075) * Update pyo3 requirement from 0.21.1 to 0.22.1 Updates the requirements on [pyo3](https://github.com/pyo3/pyo3) to permit the latest version. - [Release notes](https://github.com/pyo3/pyo3/releases) - [Changelog](https://github.com/PyO3/pyo3/blob/main/CHANGELOG.md) - [Commits](PyO3/pyo3@v0.21.1...v0.22.1) --- updated-dependencies: - dependency-name: pyo3 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> * refactor: remove deprecated `FromPyArrow::from_pyarrow` "GIL Refs" are being phased out. * chore: update `pyo3` in integration tests --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * remove repeated codes to make the codes more concise. (#6080) * Add `unencoded_byte_array_data_bytes` to `ParquetMetaData` (#6068) * update to latest thrift (as of 11 Jul 2024) from parquet-format * pass None for optional size statistics * escape HTML tags * don't need to escape brackets in arrays * add support for unencoded_byte_array_data_bytes * add comments * change sig of ColumnMetrics::update_variable_length_bytes() * rename ParquetOffsetIndex to OffsetSizeIndex * rename some functions * suggestion from review Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * add Default trait to ColumnMetrics as suggested in review * rename OffsetSizeIndex to OffsetIndexMetaData --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Update pyo3 requirement from 0.21.1 to 0.22.2 (#6085) Updates the requirements on [pyo3](https://github.com/pyo3/pyo3) to permit the latest version. - [Release notes](https://github.com/pyo3/pyo3/releases) - [Changelog](https://github.com/PyO3/pyo3/blob/v0.22.2/CHANGELOG.md) - [Commits](PyO3/pyo3@v0.21.1...v0.22.2) --- updated-dependencies: - dependency-name: pyo3 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Deprecate read_page_locations() and simplify offset index in `ParquetMetaData` (#6095) * deprecate read_page_locations * add to_thrift() to OffsetIndexMetaData * Update parquet/src/column/writer/mod.rs Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com> --------- Signed-off-by: Bugen Zhao <i@bugenzhao.com> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: Bugen Zhao <i@bugenzhao.com> Co-authored-by: Xiangpeng Hao <haoxiangpeng123@gmail.com> Co-authored-by: kamille <caoruiqiu.crq@antgroup.com> Co-authored-by: Jesse <github@jessebakker.com> Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com> Co-authored-by: Marco Neumann <marco@crepererum.net> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…aData` (#6105) * bump `tonic` to 0.12 and `prost` to 0.13 for `arrow-flight` (#6041) * bump `tonic` to 0.12 and `prost` to 0.13 for `arrow-flight` Signed-off-by: Bugen Zhao <i@bugenzhao.com> * fix example tests Signed-off-by: Bugen Zhao <i@bugenzhao.com> --------- Signed-off-by: Bugen Zhao <i@bugenzhao.com> * Remove `impl<T: AsRef<[u8]>> From<T> for Buffer` that easily accidentally copies data (#6043) * deprecate auto copy, ask explicit reference * update comments * make cargo doc happy * Make display of interval types more pretty (#6006) * improve dispaly for interval. * update test in pretty, and fix display problem. * tmp * fix tests in arrow-cast. * fix tests in pretty. * fix style. * Update snafu (#5930) * Update Parquet thrift generated structures (#6045) * update to latest thrift (as of 11 Jul 2024) from parquet-format * pass None for optional size statistics * escape HTML tags * don't need to escape brackets in arrays * Revert "Revert "Write Bloom filters between row groups instead of the end (#…" (#5933) This reverts commit 22e0b44. * Revert "Update snafu (#5930)" (#6069) This reverts commit 756b1fb. * Update pyo3 requirement from 0.21.1 to 0.22.1 (fixed) (#6075) * Update pyo3 requirement from 0.21.1 to 0.22.1 Updates the requirements on [pyo3](https://github.com/pyo3/pyo3) to permit the latest version. - [Release notes](https://github.com/pyo3/pyo3/releases) - [Changelog](https://github.com/PyO3/pyo3/blob/main/CHANGELOG.md) - [Commits](PyO3/pyo3@v0.21.1...v0.22.1) --- updated-dependencies: - dependency-name: pyo3 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> * refactor: remove deprecated `FromPyArrow::from_pyarrow` "GIL Refs" are being phased out. * chore: update `pyo3` in integration tests --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * remove repeated codes to make the codes more concise. (#6080) * Add `unencoded_byte_array_data_bytes` to `ParquetMetaData` (#6068) * update to latest thrift (as of 11 Jul 2024) from parquet-format * pass None for optional size statistics * escape HTML tags * don't need to escape brackets in arrays * add support for unencoded_byte_array_data_bytes * add comments * change sig of ColumnMetrics::update_variable_length_bytes() * rename ParquetOffsetIndex to OffsetSizeIndex * rename some functions * suggestion from review Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * add Default trait to ColumnMetrics as suggested in review * rename OffsetSizeIndex to OffsetIndexMetaData --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * deprecate read_page_locations * add level histograms to metadata * add to_thrift() to OffsetIndexMetaData * Update pyo3 requirement from 0.21.1 to 0.22.2 (#6085) Updates the requirements on [pyo3](https://github.com/pyo3/pyo3) to permit the latest version. - [Release notes](https://github.com/pyo3/pyo3/releases) - [Changelog](https://github.com/PyO3/pyo3/blob/v0.22.2/CHANGELOG.md) - [Commits](PyO3/pyo3@v0.21.1...v0.22.2) --- updated-dependencies: - dependency-name: pyo3 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Deprecate read_page_locations() and simplify offset index in `ParquetMetaData` (#6095) * deprecate read_page_locations * add to_thrift() to OffsetIndexMetaData * move valid test into ColumnIndexBuilder::append_histograms * move update_histogram() inside ColumnMetrics * Update parquet/src/column/writer/mod.rs Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com> * Implement LevelHistograms as a struct * formatting * fix error in docs --------- Signed-off-by: Bugen Zhao <i@bugenzhao.com> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: Bugen Zhao <i@bugenzhao.com> Co-authored-by: Xiangpeng Hao <haoxiangpeng123@gmail.com> Co-authored-by: kamille <caoruiqiu.crq@antgroup.com> Co-authored-by: Jesse <github@jessebakker.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> Co-authored-by: Marco Neumann <marco@crepererum.net> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…the footer (#6117) * bump `tonic` to 0.12 and `prost` to 0.13 for `arrow-flight` (#6041) * bump `tonic` to 0.12 and `prost` to 0.13 for `arrow-flight` Signed-off-by: Bugen Zhao <i@bugenzhao.com> * fix example tests Signed-off-by: Bugen Zhao <i@bugenzhao.com> --------- Signed-off-by: Bugen Zhao <i@bugenzhao.com> * Remove `impl<T: AsRef<[u8]>> From<T> for Buffer` that easily accidentally copies data (#6043) * deprecate auto copy, ask explicit reference * update comments * make cargo doc happy * Make display of interval types more pretty (#6006) * improve dispaly for interval. * update test in pretty, and fix display problem. * tmp * fix tests in arrow-cast. * fix tests in pretty. * fix style. * Update snafu (#5930) * Update Parquet thrift generated structures (#6045) * update to latest thrift (as of 11 Jul 2024) from parquet-format * pass None for optional size statistics * escape HTML tags * don't need to escape brackets in arrays * Revert "Revert "Write Bloom filters between row groups instead of the end (#…" (#5933) This reverts commit 22e0b44. * Revert "Update snafu (#5930)" (#6069) This reverts commit 756b1fb. * Update pyo3 requirement from 0.21.1 to 0.22.1 (fixed) (#6075) * Update pyo3 requirement from 0.21.1 to 0.22.1 Updates the requirements on [pyo3](https://github.com/pyo3/pyo3) to permit the latest version. - [Release notes](https://github.com/pyo3/pyo3/releases) - [Changelog](https://github.com/PyO3/pyo3/blob/main/CHANGELOG.md) - [Commits](PyO3/pyo3@v0.21.1...v0.22.1) --- updated-dependencies: - dependency-name: pyo3 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> * refactor: remove deprecated `FromPyArrow::from_pyarrow` "GIL Refs" are being phased out. * chore: update `pyo3` in integration tests --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * remove repeated codes to make the codes more concise. (#6080) * Add `unencoded_byte_array_data_bytes` to `ParquetMetaData` (#6068) * update to latest thrift (as of 11 Jul 2024) from parquet-format * pass None for optional size statistics * escape HTML tags * don't need to escape brackets in arrays * add support for unencoded_byte_array_data_bytes * add comments * change sig of ColumnMetrics::update_variable_length_bytes() * rename ParquetOffsetIndex to OffsetSizeIndex * rename some functions * suggestion from review Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * add Default trait to ColumnMetrics as suggested in review * rename OffsetSizeIndex to OffsetIndexMetaData --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Update pyo3 requirement from 0.21.1 to 0.22.2 (#6085) Updates the requirements on [pyo3](https://github.com/pyo3/pyo3) to permit the latest version. - [Release notes](https://github.com/pyo3/pyo3/releases) - [Changelog](https://github.com/PyO3/pyo3/blob/v0.22.2/CHANGELOG.md) - [Commits](PyO3/pyo3@v0.21.1...v0.22.2) --- updated-dependencies: - dependency-name: pyo3 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Deprecate read_page_locations() and simplify offset index in `ParquetMetaData` (#6095) * deprecate read_page_locations * add to_thrift() to OffsetIndexMetaData * no longer write inline column metadata * Update parquet/src/column/writer/mod.rs Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com> * suggestion from review Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * add some more documentation * remove write_metadata from PageWriter --------- Signed-off-by: Bugen Zhao <i@bugenzhao.com> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: Bugen Zhao <i@bugenzhao.com> Co-authored-by: Xiangpeng Hao <haoxiangpeng123@gmail.com> Co-authored-by: kamille <caoruiqiu.crq@antgroup.com> Co-authored-by: Jesse <github@jessebakker.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> Co-authored-by: Marco Neumann <marco@crepererum.net> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* bump `tonic` to 0.12 and `prost` to 0.13 for `arrow-flight` (#6041) * bump `tonic` to 0.12 and `prost` to 0.13 for `arrow-flight` Signed-off-by: Bugen Zhao <i@bugenzhao.com> * fix example tests Signed-off-by: Bugen Zhao <i@bugenzhao.com> --------- Signed-off-by: Bugen Zhao <i@bugenzhao.com> * Remove `impl<T: AsRef<[u8]>> From<T> for Buffer` that easily accidentally copies data (#6043) * deprecate auto copy, ask explicit reference * update comments * make cargo doc happy * Make display of interval types more pretty (#6006) * improve dispaly for interval. * update test in pretty, and fix display problem. * tmp * fix tests in arrow-cast. * fix tests in pretty. * fix style. * Update snafu (#5930) * Update Parquet thrift generated structures (#6045) * update to latest thrift (as of 11 Jul 2024) from parquet-format * pass None for optional size statistics * escape HTML tags * don't need to escape brackets in arrays * Revert "Revert "Write Bloom filters between row groups instead of the end (#…" (#5933) This reverts commit 22e0b44. * Revert "Update snafu (#5930)" (#6069) This reverts commit 756b1fb. * Update pyo3 requirement from 0.21.1 to 0.22.1 (fixed) (#6075) * Update pyo3 requirement from 0.21.1 to 0.22.1 Updates the requirements on [pyo3](https://github.com/pyo3/pyo3) to permit the latest version. - [Release notes](https://github.com/pyo3/pyo3/releases) - [Changelog](https://github.com/PyO3/pyo3/blob/main/CHANGELOG.md) - [Commits](PyO3/pyo3@v0.21.1...v0.22.1) --- updated-dependencies: - dependency-name: pyo3 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> * refactor: remove deprecated `FromPyArrow::from_pyarrow` "GIL Refs" are being phased out. * chore: update `pyo3` in integration tests --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * remove repeated codes to make the codes more concise. (#6080) * Add `unencoded_byte_array_data_bytes` to `ParquetMetaData` (#6068) * update to latest thrift (as of 11 Jul 2024) from parquet-format * pass None for optional size statistics * escape HTML tags * don't need to escape brackets in arrays * add support for unencoded_byte_array_data_bytes * add comments * change sig of ColumnMetrics::update_variable_length_bytes() * rename ParquetOffsetIndex to OffsetSizeIndex * rename some functions * suggestion from review Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * add Default trait to ColumnMetrics as suggested in review * rename OffsetSizeIndex to OffsetIndexMetaData --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Update pyo3 requirement from 0.21.1 to 0.22.2 (#6085) Updates the requirements on [pyo3](https://github.com/pyo3/pyo3) to permit the latest version. - [Release notes](https://github.com/pyo3/pyo3/releases) - [Changelog](https://github.com/PyO3/pyo3/blob/v0.22.2/CHANGELOG.md) - [Commits](PyO3/pyo3@v0.21.1...v0.22.2) --- updated-dependencies: - dependency-name: pyo3 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Deprecate read_page_locations() and simplify offset index in `ParquetMetaData` (#6095) * deprecate read_page_locations * add to_thrift() to OffsetIndexMetaData * Update parquet/src/column/writer/mod.rs Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com> * Upgrade protobuf definitions to flightsql 17.0 (#6133) * Update FlightSql.proto to version 17.0 Adds new message CommandStatementIngest and removes `experimental` from other messages. * Regenerate flight sql protocol This upgrades the file to version 17.0 of the protobuf definition. * Add `ParquetMetadataWriter` allow ad-hoc encoding of `ParquetMetadata` * fix loading in test by etseidl Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com> * add rough equivalence test * one more check * make clippy happy * separate tests that require arrow into a separate module * add histograms to to_thrift() --------- Signed-off-by: Bugen Zhao <i@bugenzhao.com> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: Bugen Zhao <i@bugenzhao.com> Co-authored-by: Xiangpeng Hao <haoxiangpeng123@gmail.com> Co-authored-by: kamille <caoruiqiu.crq@antgroup.com> Co-authored-by: Jesse <github@jessebakker.com> Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> Co-authored-by: Marco Neumann <marco@crepererum.net> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Douglas Anderson <djanderson@users.noreply.github.com> Co-authored-by: Ed Seidl <etseidl@live.com>
Which issue does this PR close?
Related to #5022.
Rationale for this change
Per #5486 (review), splitting the
format.rs
changes out into a separate PR as a precursor to adding PARQUET-2261 support.What changes are included in this PR?
Updates the thrift revision to the latest in the parquet-format repository.
In addition to regenerating
format.rs
,None
is passed for optional SizeStatistics members that are not yet ready to be provided.Are there any user-facing changes?
No, as thrift structure are not exposed.Yes, the format module is public, and this makes breaking changes to that module.