Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add unencoded_byte_array_data_bytes to ParquetMetaData #6068

Merged
merged 19 commits into from
Jul 19, 2024

Conversation

etseidl
Copy link
Contributor

@etseidl etseidl commented Jul 16, 2024

Note targets the 53.0.0-dev branch

Which issue does this PR close?

Related to #5022

Rationale for this change

First step in adding new statistics defined in PARQUET-2261.

What changes are included in this PR?

This PR makes modifications to collect the unencoded sizes of BYTE_ARRAY columns. Adds a new OffsetSizeIndex structure to represent the Thrift OffsetIndex struct, which comprises and array of PageLocations and an optional array of unencoded_byte_array_data_bytes. This also adds unencoded_byte_array_data_bytes to the ParquetMetaData struct (as a precursor to redefining ParquetOffsetIndex in terms of OffsetSizeIndex).

Are there any user-facing changes?

Yes.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jul 16, 2024
@alamb alamb added api-change Changes to the arrow API next-major-release the PR has API changes and it waiting on the next major version labels Jul 17, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @etseidl -- I think this PR is quite close

I left some API and naming suggestions (specifically the struct name OffsetSizeIndex I found quite confusing)

I also think that a round trip test (that writes data to parquet, and then reads the file back in and verifies that the metadata is written correctly (e.g uncencoded_byte_array_data_bytes is present) would be important Is this something you plan to do?

parquet/src/column/writer/mod.rs Show resolved Hide resolved
@@ -956,6 +963,10 @@ pub(crate) mod private {
Ok(num_values)
}

fn variable_length_bytes(values: &[Self]) -> Option<i64> {
Some(values.iter().map(|x| x.len() as i64).sum())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I double checked and this is a &[ByteArray] so it is summing up the lengths of the buffers, not the lengths of the individual strings 👍

parquet/src/file/metadata/mod.rs Show resolved Hide resolved
@@ -109,7 +146,13 @@ pub fn read_pages_locations<R: ChunkReader>(
.collect()
}

pub(crate) fn decode_offset_index(data: &[u8]) -> Result<Vec<PageLocation>, ParquetError> {
pub(crate) fn decode_offset_index(data: &[u8]) -> Result<OffsetSizeIndex, ParquetError> {
let mut prot = TCompactSliceInputProtocol::new(data);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These functions both seem to decode the OffsetIndex which I think means it happens twice 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there is duplication, but I think read_pages_locations/decode_page_locations will now only be called in test code. My next PR is to switch over to read_offset_index everywhere and deprecate the versions that only read the PageLocations.

/// [`OffsetIndex`] information for a column chunk. Contains offsets and sizes for each page
/// in the chunk. Optionally stores fully decoded page sizes for BYTE_ARRAY columns.
#[derive(Debug, Clone, PartialEq)]
pub struct OffsetSizeIndex {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this called OffsetSizeIndex?

It seems like the corresponding thrift structure is called "OffsetIndex"

https://github.com/apache/parquet-format/blob/abb92e61e9756016c08401fef729d2ace187d184/src/main/thrift/parquet.thrift#L1031

What do you think about calling it ParquetOffsetIndex to follow the convention of ParquetMetaData (though I will freely admit the naming of these structures makes it very hard to reason about)

Copy link
Contributor Author

@etseidl etseidl Jul 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was ParquetOffsetIndex in my original PR, but then that collides with the typedef in file/metadata/mod.rs. As I write this, I'm wondering if we need this struct at all. The current read_pages_locations returns the thrift struct. OffsetIndex has the same data members as OffsetSizeIndex, so is it worth defining a new struct? It seems there's precedent for using the thrift objects from format elsewhere in the metadata.

Failing that, how about OffsetIndexMetaData? That seems more in line with the naming of the rest of the footer objects in file::metadata.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OffsetIndex has the same data members as OffsetSizeIndex, so is it worth defining a new struct? It seems there's precedent for using the thrift objects from format elsewhere in the metadata.

I think we should define a new Rust struct so that we aren't tied to whatever thrift generates.

The existing naming of the metadata structs is terribly confusing in my mind, so whatever we can do to make it better (at least not worse) would be good

Let's go with OffsetIndexMetaData for now, and maybe we can deprecate the existing ParquetOffsetIndex typedef in file/metadata/mod.rs with a note that we eventually plan to rename OffsetIndexMetaData to ParquetOffsetIndex

///
/// [Column Index Documentation]: https://github.com/apache/parquet-format/blob/master/PageIndex.md
/// [Page Index Documentation]: https://github.com/apache/parquet-format/blob/master/PageIndex.md
pub fn read_pages_locations<R: ChunkReader>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder should we deprecate read_pages_location and instead encourage people to only use read_offset_indexes (which contain the page location too?) I think that would be less confusing to me as it took a while to realize that PageLocation is stored within the OffsetIndex

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's in my master plan 😄

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any chance you can file a ticket that tracks the work to do (one never knows, sometimes other people show up and help with things 🎣 )

etseidl and others added 2 commits July 17, 2024 14:37
@etseidl
Copy link
Contributor Author

etseidl commented Jul 17, 2024

I also think that a round trip test (that writes data to parquet, and then reads the file back in and verifies that the metadata is written correctly (e.g uncencoded_byte_array_data_bytes is present) would be important Is this something you plan to do?

Sorry if I'm missing something, but is this not what I'm doing here? https://github.com/etseidl/arrow-rs/blob/44d1a977658b2c5d30deb97b8d7b3cb10f5682d1/parquet/src/file/writer.rs#L1908-L1921

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @etseidl -- I think this is good enough to merge now

I do think we should rename OffsetSizeIndex before merge but I also think we could do so as a follow on PR.

cc @Ted-Jiang who I think added the initial offset/page index support

///
/// [Column Index Documentation]: https://github.com/apache/parquet-format/blob/master/PageIndex.md
/// [Page Index Documentation]: https://github.com/apache/parquet-format/blob/master/PageIndex.md
pub fn read_pages_locations<R: ChunkReader>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any chance you can file a ticket that tracks the work to do (one never knows, sometimes other people show up and help with things 🎣 )

/// [`OffsetIndex`] information for a column chunk. Contains offsets and sizes for each page
/// in the chunk. Optionally stores fully decoded page sizes for BYTE_ARRAY columns.
#[derive(Debug, Clone, PartialEq)]
pub struct OffsetSizeIndex {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OffsetIndex has the same data members as OffsetSizeIndex, so is it worth defining a new struct? It seems there's precedent for using the thrift objects from format elsewhere in the metadata.

I think we should define a new Rust struct so that we aren't tied to whatever thrift generates.

The existing naming of the metadata structs is terribly confusing in my mind, so whatever we can do to make it better (at least not worse) would be good

Let's go with OffsetIndexMetaData for now, and maybe we can deprecate the existing ParquetOffsetIndex typedef in file/metadata/mod.rs with a note that we eventually plan to rename OffsetIndexMetaData to ParquetOffsetIndex

@alamb
Copy link
Contributor

alamb commented Jul 19, 2024

🚀

@alamb alamb merged commit e40b311 into apache:53.0.0-dev Jul 19, 2024
17 checks passed
@alamb
Copy link
Contributor

alamb commented Jul 19, 2024

Thanks @etseidl

@etseidl
Copy link
Contributor Author

etseidl commented Jul 19, 2024

Thanks for your patience @alamb! I've renamed the offset index struct. Once this merges, I'll submit the next PR to deprecate read_pages_locations() and replace with read_offset_index() throughout.

Yes, I can submit a tracking issue for this, but the next step after the above mentioned PR is do then add the histograms to the column index, and then perhaps rename all of the various page index structures to be more consistent.

@alamb
Copy link
Contributor

alamb commented Jul 19, 2024

I think keeping the two projects separate makes a lot of sense

  1. "let's get the new structures in place"
  2. "let's rename the structures to be consistent"

If you don't get a chance I'll try and file a ticket to describing getting the naming sorted out

@etseidl
Copy link
Contributor Author

etseidl commented Jul 19, 2024

I think keeping the two projects separate makes a lot of sense

1. "let's get the new structures in place"

2. "let's rename the structures to be consistent"

Ok. Let's continue to use #5022 to track the new structures, and I'll submit a new issue describing the desire for more consistent naming. I can put a strawman in there and then others can opine as they find time.

alamb added a commit that referenced this pull request Jul 26, 2024
* bump `tonic` to 0.12 and `prost` to 0.13 for `arrow-flight` (#6041)

* bump `tonic` to 0.12 and `prost` to 0.13 for `arrow-flight`

Signed-off-by: Bugen Zhao <i@bugenzhao.com>

* fix example tests

Signed-off-by: Bugen Zhao <i@bugenzhao.com>

---------

Signed-off-by: Bugen Zhao <i@bugenzhao.com>

* Remove `impl<T: AsRef<[u8]>> From<T> for Buffer`  that easily accidentally copies data (#6043)

* deprecate auto copy, ask explicit reference

* update comments

* make cargo doc happy

* Make display of interval types more pretty (#6006)

* improve dispaly for interval.

* update test in pretty, and fix display problem.

* tmp

* fix tests in arrow-cast.

* fix tests in pretty.

* fix style.

* Update snafu (#5930)

* Update Parquet thrift generated structures (#6045)

* update to latest thrift (as of 11 Jul 2024) from parquet-format

* pass None for optional size statistics

* escape HTML tags

* don't need to escape brackets in arrays

* Revert "Revert "Write Bloom filters between row groups instead of the end  (#…" (#5933)

This reverts commit 22e0b44.

* Revert "Update snafu (#5930)" (#6069)

This reverts commit 756b1fb.

* Update pyo3 requirement from 0.21.1 to 0.22.1 (fixed) (#6075)

* Update pyo3 requirement from 0.21.1 to 0.22.1

Updates the requirements on [pyo3](https://github.com/pyo3/pyo3) to permit the latest version.
- [Release notes](https://github.com/pyo3/pyo3/releases)
- [Changelog](https://github.com/PyO3/pyo3/blob/main/CHANGELOG.md)
- [Commits](PyO3/pyo3@v0.21.1...v0.22.1)

---
updated-dependencies:
- dependency-name: pyo3
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

* refactor: remove deprecated `FromPyArrow::from_pyarrow`

"GIL Refs" are being phased out.

* chore: update `pyo3` in integration tests

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* remove repeated codes to make the codes more concise. (#6080)

* Add `unencoded_byte_array_data_bytes` to `ParquetMetaData` (#6068)

* update to latest thrift (as of 11 Jul 2024) from parquet-format

* pass None for optional size statistics

* escape HTML tags

* don't need to escape brackets in arrays

* add support for unencoded_byte_array_data_bytes

* add comments

* change sig of ColumnMetrics::update_variable_length_bytes()

* rename ParquetOffsetIndex to OffsetSizeIndex

* rename some functions

* suggestion from review

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* add Default trait to ColumnMetrics as suggested in review

* rename OffsetSizeIndex to OffsetIndexMetaData

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* Update pyo3 requirement from 0.21.1 to 0.22.2 (#6085)

Updates the requirements on [pyo3](https://github.com/pyo3/pyo3) to permit the latest version.
- [Release notes](https://github.com/pyo3/pyo3/releases)
- [Changelog](https://github.com/PyO3/pyo3/blob/v0.22.2/CHANGELOG.md)
- [Commits](PyO3/pyo3@v0.21.1...v0.22.2)

---
updated-dependencies:
- dependency-name: pyo3
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Deprecate read_page_locations() and simplify offset index in `ParquetMetaData` (#6095)

* deprecate read_page_locations

* add to_thrift() to OffsetIndexMetaData

* Update parquet/src/column/writer/mod.rs

Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com>

---------

Signed-off-by: Bugen Zhao <i@bugenzhao.com>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Bugen Zhao <i@bugenzhao.com>
Co-authored-by: Xiangpeng Hao <haoxiangpeng123@gmail.com>
Co-authored-by: kamille <caoruiqiu.crq@antgroup.com>
Co-authored-by: Jesse <github@jessebakker.com>
Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com>
Co-authored-by: Marco Neumann <marco@crepererum.net>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
alamb added a commit that referenced this pull request Jul 26, 2024
…aData` (#6105)

* bump `tonic` to 0.12 and `prost` to 0.13 for `arrow-flight` (#6041)

* bump `tonic` to 0.12 and `prost` to 0.13 for `arrow-flight`

Signed-off-by: Bugen Zhao <i@bugenzhao.com>

* fix example tests

Signed-off-by: Bugen Zhao <i@bugenzhao.com>

---------

Signed-off-by: Bugen Zhao <i@bugenzhao.com>

* Remove `impl<T: AsRef<[u8]>> From<T> for Buffer`  that easily accidentally copies data (#6043)

* deprecate auto copy, ask explicit reference

* update comments

* make cargo doc happy

* Make display of interval types more pretty (#6006)

* improve dispaly for interval.

* update test in pretty, and fix display problem.

* tmp

* fix tests in arrow-cast.

* fix tests in pretty.

* fix style.

* Update snafu (#5930)

* Update Parquet thrift generated structures (#6045)

* update to latest thrift (as of 11 Jul 2024) from parquet-format

* pass None for optional size statistics

* escape HTML tags

* don't need to escape brackets in arrays

* Revert "Revert "Write Bloom filters between row groups instead of the end  (#…" (#5933)

This reverts commit 22e0b44.

* Revert "Update snafu (#5930)" (#6069)

This reverts commit 756b1fb.

* Update pyo3 requirement from 0.21.1 to 0.22.1 (fixed) (#6075)

* Update pyo3 requirement from 0.21.1 to 0.22.1

Updates the requirements on [pyo3](https://github.com/pyo3/pyo3) to permit the latest version.
- [Release notes](https://github.com/pyo3/pyo3/releases)
- [Changelog](https://github.com/PyO3/pyo3/blob/main/CHANGELOG.md)
- [Commits](PyO3/pyo3@v0.21.1...v0.22.1)

---
updated-dependencies:
- dependency-name: pyo3
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

* refactor: remove deprecated `FromPyArrow::from_pyarrow`

"GIL Refs" are being phased out.

* chore: update `pyo3` in integration tests

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* remove repeated codes to make the codes more concise. (#6080)

* Add `unencoded_byte_array_data_bytes` to `ParquetMetaData` (#6068)

* update to latest thrift (as of 11 Jul 2024) from parquet-format

* pass None for optional size statistics

* escape HTML tags

* don't need to escape brackets in arrays

* add support for unencoded_byte_array_data_bytes

* add comments

* change sig of ColumnMetrics::update_variable_length_bytes()

* rename ParquetOffsetIndex to OffsetSizeIndex

* rename some functions

* suggestion from review

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* add Default trait to ColumnMetrics as suggested in review

* rename OffsetSizeIndex to OffsetIndexMetaData

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* deprecate read_page_locations

* add level histograms to metadata

* add to_thrift() to OffsetIndexMetaData

* Update pyo3 requirement from 0.21.1 to 0.22.2 (#6085)

Updates the requirements on [pyo3](https://github.com/pyo3/pyo3) to permit the latest version.
- [Release notes](https://github.com/pyo3/pyo3/releases)
- [Changelog](https://github.com/PyO3/pyo3/blob/v0.22.2/CHANGELOG.md)
- [Commits](PyO3/pyo3@v0.21.1...v0.22.2)

---
updated-dependencies:
- dependency-name: pyo3
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Deprecate read_page_locations() and simplify offset index in `ParquetMetaData` (#6095)

* deprecate read_page_locations

* add to_thrift() to OffsetIndexMetaData

* move valid test into ColumnIndexBuilder::append_histograms

* move update_histogram() inside ColumnMetrics

* Update parquet/src/column/writer/mod.rs

Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com>

* Implement LevelHistograms as a struct

* formatting

* fix error in docs

---------

Signed-off-by: Bugen Zhao <i@bugenzhao.com>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Bugen Zhao <i@bugenzhao.com>
Co-authored-by: Xiangpeng Hao <haoxiangpeng123@gmail.com>
Co-authored-by: kamille <caoruiqiu.crq@antgroup.com>
Co-authored-by: Jesse <github@jessebakker.com>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Marco Neumann <marco@crepererum.net>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
@etseidl etseidl deleted the size_stats_oidx_lite branch July 26, 2024 20:20
alamb added a commit that referenced this pull request Aug 2, 2024
…the footer (#6117)

* bump `tonic` to 0.12 and `prost` to 0.13 for `arrow-flight` (#6041)

* bump `tonic` to 0.12 and `prost` to 0.13 for `arrow-flight`

Signed-off-by: Bugen Zhao <i@bugenzhao.com>

* fix example tests

Signed-off-by: Bugen Zhao <i@bugenzhao.com>

---------

Signed-off-by: Bugen Zhao <i@bugenzhao.com>

* Remove `impl<T: AsRef<[u8]>> From<T> for Buffer`  that easily accidentally copies data (#6043)

* deprecate auto copy, ask explicit reference

* update comments

* make cargo doc happy

* Make display of interval types more pretty (#6006)

* improve dispaly for interval.

* update test in pretty, and fix display problem.

* tmp

* fix tests in arrow-cast.

* fix tests in pretty.

* fix style.

* Update snafu (#5930)

* Update Parquet thrift generated structures (#6045)

* update to latest thrift (as of 11 Jul 2024) from parquet-format

* pass None for optional size statistics

* escape HTML tags

* don't need to escape brackets in arrays

* Revert "Revert "Write Bloom filters between row groups instead of the end  (#…" (#5933)

This reverts commit 22e0b44.

* Revert "Update snafu (#5930)" (#6069)

This reverts commit 756b1fb.

* Update pyo3 requirement from 0.21.1 to 0.22.1 (fixed) (#6075)

* Update pyo3 requirement from 0.21.1 to 0.22.1

Updates the requirements on [pyo3](https://github.com/pyo3/pyo3) to permit the latest version.
- [Release notes](https://github.com/pyo3/pyo3/releases)
- [Changelog](https://github.com/PyO3/pyo3/blob/main/CHANGELOG.md)
- [Commits](PyO3/pyo3@v0.21.1...v0.22.1)

---
updated-dependencies:
- dependency-name: pyo3
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

* refactor: remove deprecated `FromPyArrow::from_pyarrow`

"GIL Refs" are being phased out.

* chore: update `pyo3` in integration tests

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* remove repeated codes to make the codes more concise. (#6080)

* Add `unencoded_byte_array_data_bytes` to `ParquetMetaData` (#6068)

* update to latest thrift (as of 11 Jul 2024) from parquet-format

* pass None for optional size statistics

* escape HTML tags

* don't need to escape brackets in arrays

* add support for unencoded_byte_array_data_bytes

* add comments

* change sig of ColumnMetrics::update_variable_length_bytes()

* rename ParquetOffsetIndex to OffsetSizeIndex

* rename some functions

* suggestion from review

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* add Default trait to ColumnMetrics as suggested in review

* rename OffsetSizeIndex to OffsetIndexMetaData

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* Update pyo3 requirement from 0.21.1 to 0.22.2 (#6085)

Updates the requirements on [pyo3](https://github.com/pyo3/pyo3) to permit the latest version.
- [Release notes](https://github.com/pyo3/pyo3/releases)
- [Changelog](https://github.com/PyO3/pyo3/blob/v0.22.2/CHANGELOG.md)
- [Commits](PyO3/pyo3@v0.21.1...v0.22.2)

---
updated-dependencies:
- dependency-name: pyo3
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Deprecate read_page_locations() and simplify offset index in `ParquetMetaData` (#6095)

* deprecate read_page_locations

* add to_thrift() to OffsetIndexMetaData

* no longer write inline column metadata

* Update parquet/src/column/writer/mod.rs

Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com>

* suggestion from review

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* add some more documentation

* remove write_metadata from PageWriter

---------

Signed-off-by: Bugen Zhao <i@bugenzhao.com>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Bugen Zhao <i@bugenzhao.com>
Co-authored-by: Xiangpeng Hao <haoxiangpeng123@gmail.com>
Co-authored-by: kamille <caoruiqiu.crq@antgroup.com>
Co-authored-by: Jesse <github@jessebakker.com>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Marco Neumann <marco@crepererum.net>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
alamb added a commit that referenced this pull request Aug 6, 2024
* bump `tonic` to 0.12 and `prost` to 0.13 for `arrow-flight` (#6041)

* bump `tonic` to 0.12 and `prost` to 0.13 for `arrow-flight`

Signed-off-by: Bugen Zhao <i@bugenzhao.com>

* fix example tests

Signed-off-by: Bugen Zhao <i@bugenzhao.com>

---------

Signed-off-by: Bugen Zhao <i@bugenzhao.com>

* Remove `impl<T: AsRef<[u8]>> From<T> for Buffer`  that easily accidentally copies data (#6043)

* deprecate auto copy, ask explicit reference

* update comments

* make cargo doc happy

* Make display of interval types more pretty (#6006)

* improve dispaly for interval.

* update test in pretty, and fix display problem.

* tmp

* fix tests in arrow-cast.

* fix tests in pretty.

* fix style.

* Update snafu (#5930)

* Update Parquet thrift generated structures (#6045)

* update to latest thrift (as of 11 Jul 2024) from parquet-format

* pass None for optional size statistics

* escape HTML tags

* don't need to escape brackets in arrays

* Revert "Revert "Write Bloom filters between row groups instead of the end  (#…" (#5933)

This reverts commit 22e0b44.

* Revert "Update snafu (#5930)" (#6069)

This reverts commit 756b1fb.

* Update pyo3 requirement from 0.21.1 to 0.22.1 (fixed) (#6075)

* Update pyo3 requirement from 0.21.1 to 0.22.1

Updates the requirements on [pyo3](https://github.com/pyo3/pyo3) to permit the latest version.
- [Release notes](https://github.com/pyo3/pyo3/releases)
- [Changelog](https://github.com/PyO3/pyo3/blob/main/CHANGELOG.md)
- [Commits](PyO3/pyo3@v0.21.1...v0.22.1)

---
updated-dependencies:
- dependency-name: pyo3
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

* refactor: remove deprecated `FromPyArrow::from_pyarrow`

"GIL Refs" are being phased out.

* chore: update `pyo3` in integration tests

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* remove repeated codes to make the codes more concise. (#6080)

* Add `unencoded_byte_array_data_bytes` to `ParquetMetaData` (#6068)

* update to latest thrift (as of 11 Jul 2024) from parquet-format

* pass None for optional size statistics

* escape HTML tags

* don't need to escape brackets in arrays

* add support for unencoded_byte_array_data_bytes

* add comments

* change sig of ColumnMetrics::update_variable_length_bytes()

* rename ParquetOffsetIndex to OffsetSizeIndex

* rename some functions

* suggestion from review

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* add Default trait to ColumnMetrics as suggested in review

* rename OffsetSizeIndex to OffsetIndexMetaData

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* Update pyo3 requirement from 0.21.1 to 0.22.2 (#6085)

Updates the requirements on [pyo3](https://github.com/pyo3/pyo3) to permit the latest version.
- [Release notes](https://github.com/pyo3/pyo3/releases)
- [Changelog](https://github.com/PyO3/pyo3/blob/v0.22.2/CHANGELOG.md)
- [Commits](PyO3/pyo3@v0.21.1...v0.22.2)

---
updated-dependencies:
- dependency-name: pyo3
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Deprecate read_page_locations() and simplify offset index in `ParquetMetaData` (#6095)

* deprecate read_page_locations

* add to_thrift() to OffsetIndexMetaData

* Update parquet/src/column/writer/mod.rs

Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com>

* Upgrade protobuf definitions to flightsql 17.0 (#6133)

* Update FlightSql.proto to version 17.0

Adds new message CommandStatementIngest and removes `experimental` from
other messages.

* Regenerate flight sql protocol

This upgrades the file to version 17.0 of the protobuf definition.

* Add `ParquetMetadataWriter` allow ad-hoc encoding of `ParquetMetadata`

* fix loading in test by etseidl

Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com>

* add rough equivalence test

* one more check

* make clippy happy

* separate tests that require arrow into a separate module

* add histograms to to_thrift()

---------

Signed-off-by: Bugen Zhao <i@bugenzhao.com>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Bugen Zhao <i@bugenzhao.com>
Co-authored-by: Xiangpeng Hao <haoxiangpeng123@gmail.com>
Co-authored-by: kamille <caoruiqiu.crq@antgroup.com>
Co-authored-by: Jesse <github@jessebakker.com>
Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Marco Neumann <marco@crepererum.net>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Douglas Anderson <djanderson@users.noreply.github.com>
Co-authored-by: Ed Seidl <etseidl@live.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API next-major-release the PR has API changes and it waiting on the next major version parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants