-
Notifications
You must be signed in to change notification settings - Fork 787
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add size statistics to ParquetMetaData
introduced in PARQUET-2261
#5486
Conversation
This is my first foray into Rust programming, so I'm not sure everything is done as idiomatically as possible. Submitting this now to get early feedback on my approach. I'm also wondering how much testing to add, and whether this should have a configuration parameter to turn generating the statistics off. |
I triggered the CI checks and will try and get to review this over the next few days if no one beats me to it |
Hey @etseidl are you still working on this, otherwise I had some code in a fork, I can try to move this along. |
It seems I never looked at this PR but I just skimmed it and it looks reasonable to me -- I think the biggest thing it is currently lacking is tests |
Yeah, sorry, I got sidetracked by other work...and unit tests are the bane of my existence 😅. Let me see if I can get some added in the next few days. |
@alamb I've added two tests so far. I still need to test the repetition level histogram, but I think I've pushed I'm also not entirely sure how to test the new statistics in the page indexes. For now my story is if the |
still needs more documentation
I think using one ticket is just fine. No need to make new tickets |
Update: what do you think about doing incremental PRs to a feature branch? #6050 |
Sounds good to me. We can all use #6050 to discuss deconfliction in the event of conflicts. And we can afford to be a bit more free wheeling too 😄. For instance, I have a branch for changing the Thank you for helping to coordinate all this ❤️ |
Superseded by #6105 and others |
🎉 |
Which issue does this PR close?
Closes #5022
Rationale for this change
Implements new page and column chunk size statistics introduced in PARQUET-2261
What changes are included in this PR?
Adds the necessary structures from the updated
parquet.thrift
, and adds the code necessary to populate them.Are there any user-facing changes?
No