Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parquet Statistics - deprecate has_* APIs and add _opt functions that return Option<T> #6216

Merged
merged 24 commits into from
Aug 15, 2024

Conversation

Michael-J-Ward
Copy link
Contributor

@Michael-J-Ward Michael-J-Ward commented Aug 9, 2024

Which issue does this PR close?

Closes #6093
Closes #6215

Rationale for this change

From the feature request for min and max

The fact that Parquet metadata statistics requires checking before access is > confusing and error prone, for example #6092

From the feature request for null_count

and the has_nulls() is based on null_count > 0, leading to ambiguity when null_count equals 0: either all values are non-null, or just that the null count stat is missing.

What changes are included in this PR?

APIs added

  • Statistics::{min,max}_bytes_opt
  • ValueStatistics::{min,max}_bytes_opt
  • ValueStatistics::{min,max}_opt
  • Statistics::null_count_opt
  • ValueStatistics::null_count_opt

APIs deprecated

  • Statistics::has_nulls
  • Statistics::null_count
  • ValueStatistics::null_count
  • Statistics::has_min_max_set
  • ValueStatistic::has_min_max_set
  • Statistics::{min,max}_bytes
  • ValueStatistics::{min,max}_bytes
  • ValueStatistics::{min,max}

Are there any user-facing changes?

Yes. All of the above changes are changes to the public APi.

Additionally, null_count = 0 is now written to page statistics instead of being treated as None.

I first re-named the existing method to `min_unchecked` and made it
internal to the crate.

I then added a `pub min(&self) -> Opiton<&T>` method.

I figure we can first change the public API before deciding what to do
about internal usage.

Ref: apache#6093
I first re-named the existing method to `max_unchecked` and made it
internal to the crate.

I then added a `pub max(&self) -> Opiton<&T>` method.

I figure we can first change the public API before deciding what to do
about internal usage.

Ref: apache#6093
@github-actions github-actions bot added the parquet Changes to the parquet crate label Aug 9, 2024
This removes ambiguity around whether the between all values are non-null or just that the null count stat is missing

Ref: apache#6215
Changing null_count from u64 to Option<u64> increases the memory size and layout of the metadata.

I included these tests as a separate commit to call extra attention to it.
@Michael-J-Ward Michael-J-Ward changed the title parquet Statistics - remove has_min_max_set and return Option<T> for min and max parquet Statistics - remove has_* APIs and return Option<T> for statistics Aug 9, 2024
Copy link
Contributor Author

@Michael-J-Ward Michael-J-Ward left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lastly, I'd like to call attention to the change in memory size by switching from null_count: u64 to null_count: Option<u64>.

I updated the memory size and layout tests to pass, but I'm unsure if those values were "sacred".

parquet/src/file/statistics.rs Show resolved Hide resolved
parquet/src/arrow/arrow_reader/statistics.rs Outdated Show resolved Hide resolved
parquet/src/column/writer/mod.rs Outdated Show resolved Hide resolved
parquet/src/file/statistics.rs Outdated Show resolved Hide resolved
parquet/src/file/statistics.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@crepererum crepererum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've also ran into this weird API before (which I think is inspired by certain C++ interfaces) and I think this new API is better and more "Rusty".

@crepererum
Copy link
Contributor

I'm wondering if we need an API deprecation here or not. If we need one, we would need to use an approach like this:

  • has_... and min/max/....: add deprecation note
  • add new Option APIs: min_opt/max_opt/...

That would be the smoothest path. However, the API breakage is rather simple and easy to fix. @alamb WDYT?

@Michael-J-Ward
Copy link
Contributor Author

@crepererum - I'm happy to go either way, just let me know.

I went with the breaking change because I thought removing such a foot-gun is suitable for a breaking-change / major-release upgrade. Also, that's what the GH issues requested.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Michael-J-Ward this is epic. Thank you @crepererum for the review

In my opinion we should make all the APIs consistent (aka change max_bytes to return Option<&[u8]> etc

Responding to @crepererum 's comment in #6216 (comment)

has_... and min/max/....: add deprecation note
add new Option APIs: min_opt/max_opt/...
That would be the smoothest path. However, the API breakage is rather simple and easy to fix. @alamb WDYT?

I would also be ok with either path as long as the API is consistent (e.g. min_bytes should do the same thing as min)

I would personally prefer erring on the "nicer experience for users of this crate" and thus go the backwards compatible route:

  1. Leave the existing functions as is but mark them deprecated
  2. Add new functions like min_opt(), max_opt(), etc that return Option<..>

That means users will be told what is going on with the deprecated APIs and how to fix it, even though the eventual API may not be as concise.

@@ -189,7 +189,7 @@ fn test_primitive() {
pages: (0..8)
.map(|_| Page {
rows: 250,
page_header_size: 36,
page_header_size: 38,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are these changes needed?

Copy link
Contributor Author

@Michael-J-Ward Michael-J-Ward Aug 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These sizes changed upon switching ValueStatistic::null_count from u64 to Option<u64>.

Although I'd expect such a change to require an extra bit, I was still concerned that these sizes might be set by the parquet spec, and so called it out.

If these sizes are sacred and can't be updated, I'd appreciate any pointers for implementing null_count_opt without affecting them.

#6216 (review)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scratch the above. The answer is in the linked comment.

When I initially converted null_count to Option<64>, the first test I updated test_memory_size, and incorrectly assume the rest of the layout tests were downstream of that one.

#6216 (comment)

parquet/src/file/statistics.rs Show resolved Hide resolved
parquet/src/file/statistics.rs Outdated Show resolved Hide resolved
parquet/src/column/writer/mod.rs Outdated Show resolved Hide resolved
parquet/src/file/statistics.rs Outdated Show resolved Hide resolved
Michael-J-Ward and others added 11 commits August 13, 2024 17:14
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Per PR review, we will deprecate the old API instead of introducing a brekaing change.

Ref: apache#6216 (review)
This adds the API and migrates all of the test usage.
The old APIs will be deprecated next.
The check is unnecessary now that the stats funcs return Option<T> when unset.
An internal version was also created because it is used so extensively in testing.
} else {
None
},
null_count: stats.null_count_opt().map(|value| value as i64),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb - this is why the arrow_writer_layout tests changed.

The previous API treated null_count = 0 as None.

The new API treats null_count = 0 as Some(0).

I believe the new behavior is what is desired, but can easily revert to the old behavior with:

        null_count: stats.null_count_opt().map(|value| value as i64).filter(|&x| x > 0),

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update -- I think this code is related to reading the statistics that were in the parquet file rather than how they are written.

It seems like previously this code just set the statistics count to zero #6256 tracks this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree the new behavior is desired, but I think it changes what values are written to parquet files (specifically the parquet metadata will now have the thrift equivalent of Some(0) rather than the equivalent of None. I filed #6256 to track

As this PR is already quite large, I think we should split it into two parts:

  1. The API changes
  2. The change for writing the metadata

I plan to update this PR to revert the changes to the metadata writing, and will then make a follow on PR to discuss / propose changing the statistics that are written to the file

This removes the assertion from any test that subsequently unwraps both
min_opt and max_opt.
…th assertions on min_opt and max_opt

This removes all use of Statistics::_internal_has_min_max_set from the code base, and so it is also removed.
@Michael-J-Ward
Copy link
Contributor Author

@alamb and @crepererum - this PR is ready for another look.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for this PR @Michael-J-Ward -- this improvement is a very long time coming.

I plan to split it into two parts -- I'll leave this PR with the API change and make a new PR with the change to the files that are written

@alamb alamb added the api-change Changes to the arrow API label Aug 15, 2024
@alamb alamb changed the title parquet Statistics - remove has_* APIs and return Option<T> for statistics parquet Statistics - deprecate has_* APIs and add _opt functions that return Option<T> for statistics Aug 15, 2024
@alamb alamb changed the title parquet Statistics - deprecate has_* APIs and add _opt functions that return Option<T> for statistics parquet Statistics - deprecate has_* APIs and add _opt functions that return Option<T> Aug 15, 2024
@alamb
Copy link
Contributor

alamb commented Aug 15, 2024

I pushed d4e650 to revert the writer behavior changes and added some comments -- let me know what you think

@alamb
Copy link
Contributor

alamb commented Aug 15, 2024

I am going to merge this PR as the APIs code has been reviewed and the content of what is read/written to statistics is the same as on master. I will open follow on PRs for discussion and am happy to make other changes if desired

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again @Michael-J-Ward -- epic

@alamb alamb merged commit 69b17ad into apache:master Aug 15, 2024
16 checks passed
@alamb
Copy link
Contributor

alamb commented Aug 15, 2024

A small follow up from this PR #6259

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API parquet Changes to the parquet crate
Projects
None yet
3 participants