-
Notifications
You must be signed in to change notification settings - Fork 738
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for level histograms added in PARQUET-2261 to ParquetMetaData
#6105
Conversation
…ally copies data (apache#6043) * deprecate auto copy, ask explicit reference * update comments * make cargo doc happy
* improve dispaly for interval. * update test in pretty, and fix display problem. * tmp * fix tests in arrow-cast. * fix tests in pretty. * fix style.
* update to latest thrift (as of 11 Jul 2024) from parquet-format * pass None for optional size statistics * escape HTML tags * don't need to escape brackets in arrays
… end (#…" (apache#5933) This reverts commit 22e0b44.
This reverts commit 756b1fb.
* Update pyo3 requirement from 0.21.1 to 0.22.1 Updates the requirements on [pyo3](https://github.com/pyo3/pyo3) to permit the latest version. - [Release notes](https://github.com/pyo3/pyo3/releases) - [Changelog](https://github.com/PyO3/pyo3/blob/main/CHANGELOG.md) - [Commits](PyO3/pyo3@v0.21.1...v0.22.1) --- updated-dependencies: - dependency-name: pyo3 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> * refactor: remove deprecated `FromPyArrow::from_pyarrow` "GIL Refs" are being phased out. * chore: update `pyo3` in integration tests --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* update to latest thrift (as of 11 Jul 2024) from parquet-format * pass None for optional size statistics * escape HTML tags * don't need to escape brackets in arrays * add support for unencoded_byte_array_data_bytes * add comments * change sig of ColumnMetrics::update_variable_length_bytes() * rename ParquetOffsetIndex to OffsetSizeIndex * rename some functions * suggestion from review Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * add Default trait to ColumnMetrics as suggested in review * rename OffsetSizeIndex to OffsetIndexMetaData --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Updates the requirements on [pyo3](https://github.com/pyo3/pyo3) to permit the latest version. - [Release notes](https://github.com/pyo3/pyo3/releases) - [Changelog](https://github.com/PyO3/pyo3/blob/v0.22.2/CHANGELOG.md) - [Commits](PyO3/pyo3@v0.21.1...v0.22.2) --- updated-dependencies: - dependency-name: pyo3 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…MetaData` (apache#6095) * deprecate read_page_locations * add to_thrift() to OffsetIndexMetaData
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @etseidl -- I think this look really nice
The only thing I could think that we might want to explore is using a rust struct rather than a Option<Vec<i64>
so it could be easier to use / document. I'll make a PR to explore what that looks like
The clippy failure is likely due to the new rust version which was released today. I fixed it on main. Also since we have made a 52.2.0 release candidate I think we can move |
❤️ Yes, while I was reviewing my changes I was thinking the same thing. A histogram class that supports updates, unions, and concatenation. |
Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com>
Update for anyone following along: #6126 is the merge of the dev branch to main |
Merge went ok...once #6126 merges I'll rebase this on master. |
I merged the 53 dev branch ~and that seems to have closed this PR` -- any chance you can retarget main? Update: I restored the branch |
done
@alamb I too created a
Thank you for all the help and for the free Rust schooling 😄 |
Implement LevelHistograms as a struct
if let Some(ref rep_lvl_hist) = repetition_level_histogram { | ||
let hist = self.repetition_level_histograms.get_or_insert(Vec::new()); | ||
hist.reserve(rep_lvl_hist.len()); | ||
hist.extend(rep_lvl_hist.values()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb should we also replace the vectors in ColumnIndexBuilder
with LevelHistogram
? We could add an append()
method to LevelHistogram
to reduce some code duplication here.
Same question for PageIndex
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went back and forth with this -- I think I was thinking this was building the thrift structure so we should use the thrift representation (Vec<i64>
) but I may have gotten that wrong as I find the naming quite confusing
How about we explore doing it as a follow on PR?
Let's do this! |
That is half the fun of working on these projects -- I see a lot of cool things while reviewing / working on PRs |
Thanks again @etseidl -- I merged this to main and let's keep working to improve things there. |
Which issue does this PR close?
Closes #5022.
Rationale for this change
Final piece of the new
SizeStatistics
. Supercedes #5486.What changes are included in this PR?
Adds repetition and definition level histograms to
ParquetMetaData
and to thePageIndex
struct (representation of the thriftColumnIndex
).Are there any user-facing changes?
Yes.