-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API: add isNaN and notNaN predicates #1747
Conversation
(not related to this change itself)
Here's an example for the explanation of "this may result in v2 returning more files than v1": say in v1 we consistently treat NaN as lower bound when there's any NaN value, and a file has stats distributed as below: Another change is "in". In v2 we may need to explicitly check if there's NaN value in Do statements I made above look right? Thanks! |
api/src/main/java/org/apache/iceberg/expressions/ExpressionVisitors.java
Outdated
Show resolved
Hide resolved
api/src/main/java/org/apache/iceberg/expressions/UnboundPredicate.java
Outdated
Show resolved
Hide resolved
api/src/main/java/org/apache/iceberg/expressions/BoundUnaryPredicate.java
Outdated
Show resolved
Hide resolved
|
||
// when there's no nanCounts information, but we already know the column only contains null, | ||
// it's guaranteed that there's no NaN value | ||
if (containsNullsOnly(id)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we define a similar containsNaNsOnly
method to use in notNaN
and for a similar use in isNull
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't define notNaN
originally as I could directly return ROWS_CANNOT_MATCH
when both nanCounts
and valueCounts
contain this column but numbers don't match, without going into the next block of logic (of checking upper == lower == NaN and null count == 0); but this advantage no longer exists since that block needs to be removed.
But I wasn't sure if we need it for isNull
: currently in isNull()
we are checking if nullCounts == 0
to return ROWS_CANNOT_MATCH
, and I guess the only chance where we rely on containsNaNsOnly
to return ROWS_CANNOT_MATCH
is nullCounts
for this column doesn't exist but nanCounts
does. I personally feel the chance of this happening would be small, do you think we will run into this case often?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that the containsNaNsOnly
logic will not be very useful as Yan said, but I think it is also valuable to have that private method just for readability.
Then the question reduces to: do we need to consider the case that null value metrics do not exist but NaN metrics do. For now I think the answer is no, because in all metrics modes NaN and null counters either both exist or both not exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, I'll create a containsNaNsOnly
for readability. Ryan, do you have comment on the other point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the reasoning. If we have NaN counts, then we should have null counts. No need to over-complicated the null logic with a check for when we don't have null counts but do have NaN counts. Good catch!
api/src/main/java/org/apache/iceberg/expressions/InclusiveMetricsEvaluator.java
Outdated
Show resolved
Hide resolved
|
||
public static <T> UnboundPredicate<T> notNaN(UnboundTerm<T> expr) { | ||
return new UnboundPredicate<>(Expression.Operation.NOT_NAN, expr); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we also need to update the equality predicate to catch NaN
and rewrite to isNaN
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I originally thought to update SparkFilters
to do the rewrite, but this is a much better place. Thanks for the suggestion!
Edit: what do you think about doing rewriting eq
within UnboundPredicate
? And for rewriting in
, I was thinking to let Expressions.in
to do the rewrite logic of or(isNaN, in)
/and(notNaN, notIn)
, but that means it will return Expression
instead of Predicate
; does that align with your thinking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not fully understand what you mean by "rewrite logic of or(isNaN, in)
/and(notNaN, notIn)
" when you talk about rewriting in
. Can you give some examples of what predicate are you trying to support?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So now since we want to handle NaN in in
predicate, for query in(1,2, NaN)
to avoid checking for NaN in in
evaluation all the time we can transform that to in(1,2) or isNaN
, and notIn(1,2,NaN)
to notIn(1, 2) and notNaN
. The problem is where to do that, since in
and notIn
are both predicate, and if we are extending them we are transforming a predicate (simpler form) to an expression (complex form), and I think there's no such case in the current code base, and it would touch a lot of existing test cases for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay so it's what I thought, just a bit confused by the notation.
So for eq
, what is the benefit of doing it in UnboundedPredicate
versus just rewriting it in the Expressions
?
For in
, I think it is a more complex question.We need to figure out:
- should syntax like
in(1,2,NaN)
be supported, given it can be written asis_nan or in(1,2)
on client side - if so,
Expressions.in
should returnExpression
as you said, which looks fine to me because the only callerSparkFilters.convert
also returns anExpression
in the end. - maybe we should tackle this in another PR to keep changes concise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the quick response! Yeah I think the amount of change to method return type/tests is not a concern now. I just wasn't entirely sure if rewriting eq
to isNan
in Expressions
will help with catching problems early (comparing to rewriting in UnboundPredicate
), since it seems to me that the related code will not have a chance to throw any exception until bind()
is called?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it isn't much earlier in that case. Maybe that actually exposes a problem with rewriting, too.
Expressions.equal("c", Double.NaN)
if c
is not a floating point column would result in isNaN
, which should be rejected while binding expressions. You could argue that it should rewrite to alwaysFalse
instead following the same logic as Expressions.equal("intCol", Long.MAX_VALUE)
-- it can't be true.
I think that it would be better to be strict and reject binding in that case because something is clearly wrong. I think a lot of the time, that kind of error would happen when columns are misaligned or predicates are incorrectly converted.
If the result of those errors is just to fail in expression binding, then why rewrite at all? Maybe we should just reject NaN in any predicate and force people to explicitly use isNaN
and notNaN
. That way we do throw an exception much earlier in all cases. Plus, we wouldn't have to worry about confusion over whether NaN
is equal to itself: in Java, a Double
that holds NaN is equal to itself, but a primitive is not. 😕
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, those are some good points! To make sure I understand correctly/know how to move forward, I have some questions:
- If I understand correctly, to reject NaN in any predicate sounds like we might go back to the idea of rewriting equals in
SparkFilters
(or in general, the integration point with engines during the query-to-expression translation); or maybe even earlier than that, to let engines to support syntax ofis NaN
? - Since to know if a query is eligible to be translated to
isNaN
there has to be some place that ensures the type has to be either double or float, and in iceberg code base we will only know this during binding; are we able to rely on engine to do this check before translating query toExpression
? - And seems like this may only impact
eq
as we decided to do input validation on otherlg/lteq/gt/gteq
andin
anyway? - And if we start to throw exceptions when the code passes in
NaN
toeq
, that may sound backward incompatible until the engine starts to rewrite NaN?
I guess the conversation is starting to get too detailed, if you wouldn't mind I'll try to follow up on Slack tomorrow and then post the conclusion here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, to reject NaN in any predicate sounds like we might go back to the idea of rewriting equals in SparkFilters
Yes. If the engine generally uses d = NaN
then we can convert that to isNaN
. But that would be engine-dependent and the Iceberg expression API would not support equals with NaN.
are we able to rely on engine to do this check before translating query to Expression?
I think so. Most engines will optimize the SQL expressions and handle this already. If not, then it would result in an exception from Iceberg to the user. I think that's okay, too, because as I said above, we want to fail if a NaN is used in an expression with a non-floating-point column, not rewrite to false.
And seems like this may only impact eq as we decided to do input validation on other lg/lteq/gt/gteq and in anyway?
Yes. This makes all of the handling in Expressions
consistent: always reject NaN values.
that may sound backward incompatible until the engine starts to rewrite NaN?
I'm not convinced either way. You could argue that d = NaN
is ambiguous and that rejecting it is now fixing a bug. That's certainly the case with d > NaN
, which is not defined. On the other hand, there was some bevhavior before that will now no longer work. So I'd be up for fixing this in Flink and Spark conversions as soon as we can.
Feel free to ping me on Slack!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the explanation! I think now I understand the full picture. I think I've addressed everything except for rewriting in SparkFilters
and other engines, which I think this PR is already too big so I'll submit a separate PR for it (likely next week).
// containsNull encodes whether at least one partition value is null, lowerBound is null if | ||
// all partition values are null. | ||
ByteBuffer lowerBound = stats.get(pos).lowerBound(); | ||
if (lowerBound == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be safe, I think this should validate that containsNull
is true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean check for both containsNull
and stats.get(pos).lowerBound() == null
are true? When would lowerBound
be null while the column doesn't contain null? I guess I'll also need to update notNull
for this too (since I copied the logic from there)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like #1803 is missing PartitionFieldSummary.containsNaN()
, or is it in some other PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That should be doable, although I originally consider the scope of the NaN support to be only on manifest entry level, I wasn't sure if we want to extend it beyond that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure of a case where it would happen, but containsNull
is the source of truth for whether there are null values, not a missing bound value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, I'll add containsNull
to both here and notNull
. And looks like we do want to update PartitionFieldSummary
, that I'll do in a separate pr.
api/src/main/java/org/apache/iceberg/expressions/StrictMetricsEvaluator.java
Show resolved
Hide resolved
api/src/main/java/org/apache/iceberg/expressions/StrictMetricsEvaluator.java
Outdated
Show resolved
Hide resolved
api/src/main/java/org/apache/iceberg/expressions/ResidualEvaluator.java
Outdated
Show resolved
Hide resolved
api/src/test/java/org/apache/iceberg/expressions/TestInclusiveMetricsEvaluator.java
Outdated
Show resolved
Hide resolved
@@ -214,6 +218,36 @@ public void testInTimestamp() { | |||
Assert.assertEquals("Residual should be alwaysFalse", alwaysFalse(), residual); | |||
} | |||
|
|||
@Test | |||
public void testInNaN() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need NaN cases for other evaluators?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean testing InNaN
case for other evaluators? Yeah I'll do that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we decided that we don't accept NaN in in
, I guess this conversation is outdated. I've removed inNaN
test from here.
data/src/test/java/org/apache/iceberg/data/TestMetricsRowGroupFilter.java
Outdated
Show resolved
Hide resolved
parquet/src/main/java/org/apache/iceberg/parquet/ParquetDictionaryRowGroupFilter.java
Outdated
Show resolved
Hide resolved
I think we should not produce predicates that use |
Thank you for all the comments! I'll update Do you have comment on the case of "this may result in v2 returning more files than v1" when literal is not NaN but the data to be compared have NaN? We might need to accept that to keep behavior of comparing with NaN consistent across different files? |
I don't think this is a v2 problem, it is a bug in how we currently handle NaN right? |
5789abb
to
3fc48a0
Compare
api/src/main/java/org/apache/iceberg/expressions/ExpressionVisitors.java
Show resolved
Hide resolved
|
||
// when there's no nanCounts information, but we already know the column only contains null, | ||
// it's guaranteed that there's no NaN value | ||
if (containsNullsOnly(id)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that the containsNaNsOnly
logic will not be very useful as Yan said, but I think it is also valuable to have that private method just for readability.
Then the question reduces to: do we need to consider the case that null value metrics do not exist but NaN metrics do. For now I think the answer is no, because in all metrics modes NaN and null counters either both exist or both not exist.
// containsNull encodes whether at least one partition value is null, lowerBound is null if | ||
// all partition values are null. | ||
ByteBuffer lowerBound = stats.get(pos).lowerBound(); | ||
if (lowerBound == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like #1803 is missing PartitionFieldSummary.containsNaN()
, or is it in some other PR?
api/src/main/java/org/apache/iceberg/expressions/StrictMetricsEvaluator.java
Show resolved
Hide resolved
api/src/main/java/org/apache/iceberg/expressions/StrictMetricsEvaluator.java
Outdated
Show resolved
Hide resolved
|
||
public static <T> UnboundPredicate<T> notNaN(UnboundTerm<T> expr) { | ||
return new UnboundPredicate<>(Expression.Operation.NOT_NAN, expr); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not fully understand what you mean by "rewrite logic of or(isNaN, in)
/and(notNaN, notIn)
" when you talk about rewriting in
. Can you give some examples of what predicate are you trying to support?
return dictionary.stream().allMatch(NaNUtil::isNaN) ? ROWS_CANNOT_MATCH : ROWS_MIGHT_MATCH; | ||
} | ||
|
||
private <T> Comparator<T> comparatorForNaNPredicate(BoundReference<T> ref) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can also use Comparators.forType
. The dictionary cannot contain null values so there is no need to wrap for null handling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the info! I'll update this.
e853e50
to
d5e6663
Compare
Thanks for pointing this out! After thinking about this I realized that my original concern probably shouldn't be a problem. My concern was that to make sure v2 could return exactly the same result as v1 when doing NaN comparison would require extra efforts, since the behavior of metrics evaluators now change. However, doing comparison with NaN is actually an invalid operation, and regardless of how each individual engine treats this (e.g. I think Spark consider NaN as Max, as for a column |
I plan to take another look at this tomorrow. |
public static <T> UnboundPredicate<T> lessThan(String name, T value) { | ||
validateInput("lessThan", value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An easier way to do this is to add the check in Literal.from
. That's where Iceberg enforces that the value cannot be null
. Since a literal is created for every value that is passed in, we would only need to change that one place instead of all of the factory methods here.
It also ensures that we don't add factory methods later and forget to add the check to them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! I didn't notice Literals.from
was called within constructor of UnboundPredicate
when normal object is passed in. This is definitely much more cleaner! I have created #1892 to address this.
Thank you so much for your time reviewing this long PR!
Thanks @yyanyy! I only had one thing that I would change, but we can do that in a follow-up since this is such a big PR. Thank you for fixing this, great work! |
This changes adds isNaN/notNaN predicates. Metrics evaluators currently work against a null NaN counter, which will be populated in a later pr. The logic of updating
SparkFilters.convert
to direct queries toisNaN
/notNaN
methods will also be in a later pr.