-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API: handle NaN as min/max stats in evaluators #2069
Conversation
return ROWS_CANNOT_MATCH; | ||
} | ||
|
||
if (lowerBounds != null && lowerBounds.containsKey(id)) { | ||
T lower = Conversions.fromByteBuffer(ref.type(), lowerBounds.get(id)); | ||
|
||
int cmp = lit.comparator().compare(lower, lit.value()); | ||
if (cmp > 0) { | ||
|
||
// Due to the comparison implementation of ORC stats, for float/double columns in ORC files, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for noticing this. Instead of having this paragraph everywhere, can we abstract this to a method like checkNaNLowerBound
, and make this the documentation of that method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the quick review! Yeah I do realize that repeating the same comment over and over again is a bit annoying, but I wasn't sure where the right balance is. Since I'm hoping to check for isNaN
after comparing to avoid unnecessary checking in lt
/ltEq
, the only thing I can abstract out is NaNUtil.isNaN(lower)
, that we are essentially wrapping around a wrapper; and also I guess that might not help much with readability since the actual explanation in this case will be outside of the logic flow here, so the reader will have to jump around to understand the full intention. Maybe we can shorten this comment everywhere and have the full version at the start of the class? Do you/other people have any suggestion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I did not see the comment, I think it is valuable enough to abstract out the method given the amount of comments, and have something like:
/*
* all the comments...
*/
private boolean isLowerBoundNaN(lower) {
return NaNUtil.isNaN(lower);
}
I am not good at naming, you can probably come up with a better name...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't that there is a need for an extra method that has just one method call. I'd probably do it like this:
T lower = Conversions.fromByteBuffer(ref.type(), lowerBounds.get(id));
if (NaNUtil.isNaN(lower)) {
// NaN indicates unreliable bounds. See the InclusiveMetricsEvaluator docs for more.
return ROWS_MIGHT_MATCH;
}
int cmp = lit.comparator().compare(lower, lit.value());
if (cmp > 0) {
return ROWS_CANNOT_MATCH;
}
The docs would go in the javadoc for the whole class, and each NaN check could simply refer back to it. I also moved the NaN check above the comparison to keep the logic simple: if the value is NaN, the bound is invalid.
I think in Java NaN is neither greater then, nor equal to, nor less than any numbers. That’s the IEEE754 semantics, at least. |
Thanks for the interests in this PR! I think in Java, in comparisons NaN is handled as greater than all floating point numbers including infinity, but you are right NaN shouldn't normally be comparable. |
Odd, maybe we're trying different things. What does this print for you?
I'm on OpenJDK 1.8, and this prints |
Yeah this prints
|
@jbapple @yyanyy this blog has a good summary: https://blog.csdn.net/jierui001/article/details/3278382 |
LGTM. One note I'll make: not all languages treat NaN like Java. For instance, in C++, there are roughly 2^52 different NaN values, and the totalOrder predicate sorts negative NaNs before negative infinity. As such, you could have columns with a NaN lower bound and a numeric upper bound. I'm not sure what the latest draft Iceberg specification says about this. |
Yeah that's why I was asking about the spec in #2055. I think What you describe is consistent with the current plan of -NaN < -Infinity < -value < -0 < 0 < value < Infinity < NaN. |
|
||
shouldRead = new InclusiveMetricsEvaluator( | ||
SCHEMA, Expressions.in("some_nan_correct_bounds", 30F)).eval(FILE); | ||
Assert.assertFalse("Should not match: 30 not within bounds", shouldRead); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: 1F and 30F could be combined to one case, Expressions.in("some_nan_correct_bounds", 1F, 30F)
.
Assert.assertFalse("Should not match: all nan column doesn't contain number", shouldRead); | ||
|
||
shouldRead = new StrictMetricsEvaluator(SCHEMA, func.apply("some_nan_correct_bounds", 10F)).eval(FILE); | ||
Assert.assertFalse("Should not match: nan value exists", shouldRead); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This case would return false
anyway because the correct upper bound is > 10. I think you should change this test case to 30F so that it matches the whole range, but does not match because there are some NaN values.
@yyanyy, the tests look good other than one case that I mentioned. I also commented on the implementation thread, but overall I think this is correct. Nice work! |
For ORC metrics, which currently don't populate I suppose when we go to deprecate the old This is all based off of my understanding of ORC's We could also potentially track NaN values in the various orc writers, like the parquet writers do. The relevant code for populating ORC metrics is here: iceberg/orc/src/main/java/org/apache/iceberg/orc/OrcMetrics.java Lines 100 to 172 in 29915e3
|
Thanks for your interests in this PR! You are right that some of this PR will not be working without NaN counter for checking if a field contains partial/all NaN, and it's hard to get ORC Regarding populating NaN counter for ORC files, yes we are going to track NaN values in various ORC writers just as we do for parquet writers, and the change is currently in review in #1790. |
I have updated the PR based on comments. Thank you everyone for your interests in this PR! |
Thank you for fixing this, @yyanyy! Great work. Thanks to @jbapple, @jackye1995, and @kbendick for reviewing! |
This PR is based on the behavior we observed in here:
This fixes #1761.