-
Notifications
You must be signed in to change notification settings - Fork 738
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
parquet Statistics - deprecate has_*
APIs and add _opt
functions that return Option<T>
#6216
Conversation
I first re-named the existing method to `min_unchecked` and made it internal to the crate. I then added a `pub min(&self) -> Opiton<&T>` method. I figure we can first change the public API before deciding what to do about internal usage. Ref: apache#6093
I first re-named the existing method to `max_unchecked` and made it internal to the crate. I then added a `pub max(&self) -> Opiton<&T>` method. I figure we can first change the public API before deciding what to do about internal usage. Ref: apache#6093
2b21daf
to
c65a773
Compare
This removes ambiguity around whether the between all values are non-null or just that the null count stat is missing Ref: apache#6215
Changing null_count from u64 to Option<u64> increases the memory size and layout of the metadata. I included these tests as a separate commit to call extra attention to it.
has_min_max_set
and return Option<T>
for min
and max
has_*
APIs and return Option<T>
for statistics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lastly, I'd like to call attention to the change in memory size by switching from null_count: u64
to null_count: Option<u64>
.
I updated the memory size and layout tests to pass, but I'm unsure if those values were "sacred".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've also ran into this weird API before (which I think is inspired by certain C++ interfaces) and I think this new API is better and more "Rusty".
I'm wondering if we need an API deprecation here or not. If we need one, we would need to use an approach like this:
That would be the smoothest path. However, the API breakage is rather simple and easy to fix. @alamb WDYT? |
@crepererum - I'm happy to go either way, just let me know. I went with the breaking change because I thought removing such a foot-gun is suitable for a breaking-change / major-release upgrade. Also, that's what the GH issues requested. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @Michael-J-Ward this is epic. Thank you @crepererum for the review
In my opinion we should make all the APIs consistent (aka change max_bytes
to return Option<&[u8]>
etc
Responding to @crepererum 's comment in #6216 (comment)
has_... and min/max/....: add deprecation note
add new Option APIs: min_opt/max_opt/...
That would be the smoothest path. However, the API breakage is rather simple and easy to fix. @alamb WDYT?
I would also be ok with either path as long as the API is consistent (e.g. min_bytes
should do the same thing as min
)
I would personally prefer erring on the "nicer experience for users of this crate" and thus go the backwards compatible route:
- Leave the existing functions as is but mark them deprecated
- Add new functions like
min_opt()
,max_opt()
, etc that returnOption<..>
That means users will be told what is going on with the deprecated APIs and how to fix it, even though the eventual API may not be as concise.
parquet/tests/arrow_writer_layout.rs
Outdated
@@ -189,7 +189,7 @@ fn test_primitive() { | |||
pages: (0..8) | |||
.map(|_| Page { | |||
rows: 250, | |||
page_header_size: 36, | |||
page_header_size: 38, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are these changes needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These sizes changed upon switching ValueStatistic::null_count
from u64
to Option<u64>
.
Although I'd expect such a change to require an extra bit, I was still concerned that these sizes might be set by the parquet spec, and so called it out.
If these sizes are sacred and can't be updated, I'd appreciate any pointers for implementing null_count_opt
without affecting them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scratch the above. The answer is in the linked comment.
When I initially converted null_count
to Option<64>
, the first test I updated test_memory_size
, and incorrectly assume the rest of the layout tests were downstream of that one.
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Per PR review, we will deprecate the old API instead of introducing a brekaing change. Ref: apache#6216 (review)
This adds the API and migrates all of the test usage. The old APIs will be deprecated next.
The check is unnecessary now that the stats funcs return Option<T> when unset.
An internal version was also created because it is used so extensively in testing.
parquet/src/file/statistics.rs
Outdated
} else { | ||
None | ||
}, | ||
null_count: stats.null_count_opt().map(|value| value as i64), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb - this is why the arrow_writer_layout
tests changed.
The previous API treated null_count = 0
as None.
The new API treats null_count = 0
as Some(0)
.
I believe the new behavior is what is desired, but can easily revert to the old behavior with:
null_count: stats.null_count_opt().map(|value| value as i64).filter(|&x| x > 0),
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update -- I think this code is related to reading the statistics that were in the parquet file rather than how they are written.
It seems like previously this code just set the statistics count to zero #6256 tracks this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree the new behavior is desired, but I think it changes what values are written to parquet files (specifically the parquet metadata will now have the thrift equivalent of Some(0)
rather than the equivalent of None
. I filed #6256 to track
As this PR is already quite large, I think we should split it into two parts:
- The API changes
- The change for writing the metadata
I plan to update this PR to revert the changes to the metadata writing, and will then make a follow on PR to discuss / propose changing the statistics that are written to the file
This removes the assertion from any test that subsequently unwraps both min_opt and max_opt.
…th assertions on min_opt and max_opt This removes all use of Statistics::_internal_has_min_max_set from the code base, and so it is also removed.
@alamb and @crepererum - this PR is ready for another look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for this PR @Michael-J-Ward -- this improvement is a very long time coming.
I plan to split it into two parts -- I'll leave this PR with the API change and make a new PR with the change to the files that are written
has_*
APIs and return Option<T>
for statisticshas_*
APIs and add _opt
functions that return Option<T>
for statistics
has_*
APIs and add _opt
functions that return Option<T>
for statisticshas_*
APIs and add _opt
functions that return Option<T>
I pushed d4e650 to revert the writer behavior changes and added some comments -- let me know what you think |
I am going to merge this PR as the APIs code has been reviewed and the content of what is read/written to statistics is the same as on master. I will open follow on PRs for discussion and am happy to make other changes if desired |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again @Michael-J-Ward -- epic
A small follow up from this PR #6259 |
Which issue does this PR close?
Closes #6093
Closes #6215
Rationale for this change
From the feature request for
min
andmax
From the feature request for
null_count
What changes are included in this PR?
APIs added
Statistics::{min,max}_bytes_opt
ValueStatistics::{min,max}_bytes_opt
ValueStatistics::{min,max}_opt
Statistics::null_count_opt
ValueStatistics::null_count_opt
APIs deprecated
Statistics::has_nulls
Statistics::null_count
ValueStatistics::null_count
Statistics::has_min_max_set
ValueStatistic::has_min_max_set
Statistics::{min,max}_bytes
ValueStatistics::{min,max}_bytes
ValueStatistics::{min,max}
Are there any user-facing changes?
Yes. All of the above changes are changes to the public APi.
Additionally,
null_count = 0
is now written to page statistics instead of being treated as None.