API: add isNaN and notNaN predicates #1747

yyanyy · 2020-11-10T01:23:46Z

This changes adds isNaN/notNaN predicates. Metrics evaluators currently work against a null NaN counter, which will be populated in a later pr. The logic of updating SparkFilters.convert to direct queries to isNaN/notNaN methods will also be in a later pr.

yyanyy · 2020-11-10T02:08:08Z

(not related to this change itself)
I was thinking how we should change metrics evaluators when we exclude NaN from upper/lower bounds. Here's a table I summarized about the changes we have to make:

	strict evaluator	inclusive evaluator
... where id lteq/gteq NaN	need to check if we are comparing with NaN, and check NaN counter if min==max==null	need to check if we are comparing with NaN, and check NaN counter
... where id = V, V != NaN, and column contains some NaN	if min==max==V, need to check if there are null/NaN count to decide if `ROWS_MUST_MATCH` should be returned	no change
... where id lt/lteq/gt/gteq V, V != NaN and column contains some NaN	if there are NaN count, return `ROWS_MIGHT_NOT_MATCH` (this may result in v2 returning more files than v1)	if there are NaN count, return `ROWS_MIGHT_MATCH` (this may result in v2 returning more files than v1)
... where id lt/lteq/gt/gteq V, V != NaN, and column contains only NaN	no change	if there are null/NaN count, return `ROWS_MIGHT_MATCH` (this may result in v2 returning more files than v1)

Here's an example for the explanation of "this may result in v2 returning more files than v1": say in v1 we consistently treat NaN as lower bound when there's any NaN value, and a file has stats distributed as below:
NaN-------<actual min>----<actual max>-----V-----
in v1 and without NaN counter, query where x > V will not return this file since it's outside the bound; however in v2 we will return it, since we don't know how to compare with NaN.

Another change is "in". In v2 we may need to explicitly check if there's NaN value in literalSet when comparing with lower/upper bound.

Do statements I made above look right? Thanks!

api/src/main/java/org/apache/iceberg/expressions/ExpressionVisitors.java

api/src/main/java/org/apache/iceberg/expressions/UnboundPredicate.java

api/src/main/java/org/apache/iceberg/expressions/BoundUnaryPredicate.java

rdblue · 2020-11-11T21:41:57Z

api/src/main/java/org/apache/iceberg/expressions/InclusiveMetricsEvaluator.java

+
+      // when there's no nanCounts information, but we already know the column only contains null,
+      // it's guaranteed that there's no NaN value
+      if (containsNullsOnly(id)) {


Should we define a similar containsNaNsOnly method to use in notNaN and for a similar use in isNull?

I didn't define notNaN originally as I could directly return ROWS_CANNOT_MATCH when both nanCounts and valueCounts contain this column but numbers don't match, without going into the next block of logic (of checking upper == lower == NaN and null count == 0); but this advantage no longer exists since that block needs to be removed.

But I wasn't sure if we need it for isNull: currently in isNull() we are checking if nullCounts == 0 to return ROWS_CANNOT_MATCH, and I guess the only chance where we rely on containsNaNsOnly to return ROWS_CANNOT_MATCH is nullCounts for this column doesn't exist but nanCounts does. I personally feel the chance of this happening would be small, do you think we will run into this case often?

I agree that the containsNaNsOnly logic will not be very useful as Yan said, but I think it is also valuable to have that private method just for readability.

Then the question reduces to: do we need to consider the case that null value metrics do not exist but NaN metrics do. For now I think the answer is no, because in all metrics modes NaN and null counters either both exist or both not exist.

Sounds good, I'll create a containsNaNsOnly for readability. Ryan, do you have comment on the other point?

I agree with the reasoning. If we have NaN counts, then we should have null counts. No need to over-complicated the null logic with a check for when we don't have null counts but do have NaN counts. Good catch!

api/src/main/java/org/apache/iceberg/expressions/InclusiveMetricsEvaluator.java

rdblue · 2020-11-11T21:52:50Z

api/src/main/java/org/apache/iceberg/expressions/Expressions.java

+
+  public static <T> UnboundPredicate<T> notNaN(UnboundTerm<T> expr) {
+    return new UnboundPredicate<>(Expression.Operation.NOT_NAN, expr);
+  }


Do we also need to update the equality predicate to catch NaN and rewrite to isNaN?

I originally thought to update SparkFilters to do the rewrite, but this is a much better place. Thanks for the suggestion!

Edit: what do you think about doing rewriting eq within UnboundPredicate? And for rewriting in, I was thinking to let Expressions.in to do the rewrite logic of or(isNaN, in)/and(notNaN, notIn), but that means it will return Expression instead of Predicate; does that align with your thinking?

I do not fully understand what you mean by "rewrite logic of or(isNaN, in)/and(notNaN, notIn)" when you talk about rewriting in. Can you give some examples of what predicate are you trying to support?

So now since we want to handle NaN in in predicate, for query in(1,2, NaN) to avoid checking for NaN in in evaluation all the time we can transform that to in(1,2) or isNaN, and notIn(1,2,NaN) to notIn(1, 2) and notNaN. The problem is where to do that, since in and notIn are both predicate, and if we are extending them we are transforming a predicate (simpler form) to an expression (complex form), and I think there's no such case in the current code base, and it would touch a lot of existing test cases for this.

Okay so it's what I thought, just a bit confused by the notation.

So for eq, what is the benefit of doing it in UnboundedPredicate versus just rewriting it in the Expressions?

For in, I think it is a more complex question.We need to figure out:

should syntax like in(1,2,NaN) be supported, given it can be written as is_nan or in(1,2) on client side

if so, Expressions.in should return Expression as you said, which looks fine to me because the only caller SparkFilters.convert also returns an Expression in the end.

maybe we should tackle this in another PR to keep changes concise.

Thanks for the quick response! Yeah I think the amount of change to method return type/tests is not a concern now. I just wasn't entirely sure if rewriting eq to isNan in Expressions will help with catching problems early (comparing to rewriting in UnboundPredicate), since it seems to me that the related code will not have a chance to throw any exception until bind() is called?

Yeah, it isn't much earlier in that case. Maybe that actually exposes a problem with rewriting, too.

Expressions.equal("c", Double.NaN) if c is not a floating point column would result in isNaN, which should be rejected while binding expressions. You could argue that it should rewrite to alwaysFalse instead following the same logic as Expressions.equal("intCol", Long.MAX_VALUE) -- it can't be true.

I think that it would be better to be strict and reject binding in that case because something is clearly wrong. I think a lot of the time, that kind of error would happen when columns are misaligned or predicates are incorrectly converted.

If the result of those errors is just to fail in expression binding, then why rewrite at all? Maybe we should just reject NaN in any predicate and force people to explicitly use isNaN and notNaN. That way we do throw an exception much earlier in all cases. Plus, we wouldn't have to worry about confusion over whether NaN is equal to itself: in Java, a Double that holds NaN is equal to itself, but a primitive is not. 😕

Thanks, those are some good points! To make sure I understand correctly/know how to move forward, I have some questions:

If I understand correctly, to reject NaN in any predicate sounds like we might go back to the idea of rewriting equals in SparkFilters (or in general, the integration point with engines during the query-to-expression translation); or maybe even earlier than that, to let engines to support syntax of is NaN?

Since to know if a query is eligible to be translated to isNaN there has to be some place that ensures the type has to be either double or float, and in iceberg code base we will only know this during binding; are we able to rely on engine to do this check before translating query to Expression?

And seems like this may only impact eq as we decided to do input validation on other lg/lteq/gt/gteq and in anyway?

And if we start to throw exceptions when the code passes in NaN to eq, that may sound backward incompatible until the engine starts to rewrite NaN?

I guess the conversation is starting to get too detailed, if you wouldn't mind I'll try to follow up on Slack tomorrow and then post the conclusion here?

If I understand correctly, to reject NaN in any predicate sounds like we might go back to the idea of rewriting equals in SparkFilters

Yes. If the engine generally uses d = NaN then we can convert that to isNaN. But that would be engine-dependent and the Iceberg expression API would not support equals with NaN.

are we able to rely on engine to do this check before translating query to Expression?

I think so. Most engines will optimize the SQL expressions and handle this already. If not, then it would result in an exception from Iceberg to the user. I think that's okay, too, because as I said above, we want to fail if a NaN is used in an expression with a non-floating-point column, not rewrite to false.

And seems like this may only impact eq as we decided to do input validation on other lg/lteq/gt/gteq and in anyway?

Yes. This makes all of the handling in Expressions consistent: always reject NaN values.

that may sound backward incompatible until the engine starts to rewrite NaN?

I'm not convinced either way. You could argue that d = NaN is ambiguous and that rejecting it is now fixing a bug. That's certainly the case with d > NaN, which is not defined. On the other hand, there was some bevhavior before that will now no longer work. So I'd be up for fixing this in Flink and Spark conversions as soon as we can.

Feel free to ping me on Slack!

Thank you for the explanation! I think now I understand the full picture. I think I've addressed everything except for rewriting in SparkFilters and other engines, which I think this PR is already too big so I'll submit a separate PR for it (likely next week).

rdblue · 2020-11-11T21:54:43Z

api/src/main/java/org/apache/iceberg/expressions/ManifestEvaluator.java

+      // containsNull encodes whether at least one partition value is null, lowerBound is null if
+      // all partition values are null.
+      ByteBuffer lowerBound = stats.get(pos).lowerBound();
+      if (lowerBound == null) {


To be safe, I think this should validate that containsNull is true.

You mean check for both containsNull and stats.get(pos).lowerBound() == null are true? When would lowerBound be null while the column doesn't contain null? I guess I'll also need to update notNull for this too (since I copied the logic from there)?

Seems like #1803 is missing PartitionFieldSummary.containsNaN(), or is it in some other PR?

That should be doable, although I originally consider the scope of the NaN support to be only on manifest entry level, I wasn't sure if we want to extend it beyond that?

I'm not sure of a case where it would happen, but containsNull is the source of truth for whether there are null values, not a missing bound value.

Sounds good, I'll add containsNull to both here and notNull. And looks like we do want to update PartitionFieldSummary, that I'll do in a separate pr.

api/src/main/java/org/apache/iceberg/expressions/StrictMetricsEvaluator.java

api/src/main/java/org/apache/iceberg/expressions/ResidualEvaluator.java

api/src/test/java/org/apache/iceberg/expressions/TestInclusiveMetricsEvaluator.java

rdblue · 2020-11-11T22:19:21Z

api/src/test/java/org/apache/iceberg/transforms/TestResiduals.java

@@ -214,6 +218,36 @@ public void testInTimestamp() {
    Assert.assertEquals("Residual should be alwaysFalse", alwaysFalse(), residual);
  }

+  @Test
+  public void testInNaN() {


Do we need NaN cases for other evaluators?

You mean testing InNaN case for other evaluators? Yeah I'll do that

Since we decided that we don't accept NaN in in, I guess this conversation is outdated. I've removed inNaN test from here.

data/src/test/java/org/apache/iceberg/data/TestMetricsRowGroupFilter.java

parquet/src/main/java/org/apache/iceberg/parquet/ParquetDictionaryRowGroupFilter.java

rdblue · 2020-11-11T22:34:01Z

I was thinking how we should change metrics evaluators when we exclude NaN from upper/lower bounds. Here's a table . . .

I think we should not produce predicates that use NaN as a literal for comparison. We can easily rewrite equal and notEqual to isNaN and notNaN. We can also rewrite in and notIn to or(in(non-NaNs), isNaN) or and(notIn(non-NaNs), isNotNaN). Then inequalities would either be converted to alwaysFalse or throw an exception because we don't accept the predicate. I'd lean toward throwing an exception if someone uses floatCol < NaN.

yyanyy · 2020-11-12T04:55:09Z

I was thinking how we should change metrics evaluators when we exclude NaN from upper/lower bounds. Here's a table . . .

I think we should not produce predicates that use NaN as a literal for comparison. We can easily rewrite equal and notEqual to isNaN and notNaN. We can also rewrite in and notIn to or(in(non-NaNs), isNaN) or and(notIn(non-NaNs), isNotNaN). Then inequalities would either be converted to alwaysFalse or throw an exception because we don't accept the predicate. I'd lean toward throwing an exception if someone uses floatCol < NaN.

Thank you for all the comments! I'll update Expressions to include rewritings in this PR.

Do you have comment on the case of "this may result in v2 returning more files than v1" when literal is not NaN but the data to be compared have NaN? We might need to accept that to keep behavior of comparing with NaN consistent across different files?

rdblue · 2020-11-12T19:38:35Z

Do you have comment on the case of "this may result in v2 returning more files than v1" when literal is not NaN but the data to be compared have NaN? We might need to accept that to keep behavior of comparing with NaN consistent across different files?

I don't think this is a v2 problem, it is a bug in how we currently handle NaN right?

api/src/main/java/org/apache/iceberg/expressions/ExpressionVisitors.java

api/src/main/java/org/apache/iceberg/expressions/NaNUtils.java

jackye1995 · 2020-11-23T03:08:32Z

api/src/main/java/org/apache/iceberg/expressions/InclusiveMetricsEvaluator.java

+
+      // when there's no nanCounts information, but we already know the column only contains null,
+      // it's guaranteed that there's no NaN value
+      if (containsNullsOnly(id)) {


I agree that the containsNaNsOnly logic will not be very useful as Yan said, but I think it is also valuable to have that private method just for readability.

Then the question reduces to: do we need to consider the case that null value metrics do not exist but NaN metrics do. For now I think the answer is no, because in all metrics modes NaN and null counters either both exist or both not exist.

jackye1995 · 2020-11-23T03:20:04Z

api/src/main/java/org/apache/iceberg/expressions/ManifestEvaluator.java

+      // containsNull encodes whether at least one partition value is null, lowerBound is null if
+      // all partition values are null.
+      ByteBuffer lowerBound = stats.get(pos).lowerBound();
+      if (lowerBound == null) {


Seems like #1803 is missing PartitionFieldSummary.containsNaN(), or is it in some other PR?

api/src/main/java/org/apache/iceberg/expressions/StrictMetricsEvaluator.java

jackye1995 · 2020-11-23T03:28:51Z

api/src/main/java/org/apache/iceberg/expressions/Expressions.java

+
+  public static <T> UnboundPredicate<T> notNaN(UnboundTerm<T> expr) {
+    return new UnboundPredicate<>(Expression.Operation.NOT_NAN, expr);
+  }


I do not fully understand what you mean by "rewrite logic of or(isNaN, in)/and(notNaN, notIn)" when you talk about rewriting in. Can you give some examples of what predicate are you trying to support?

rdblue · 2020-11-23T21:52:54Z

parquet/src/main/java/org/apache/iceberg/parquet/ParquetDictionaryRowGroupFilter.java

+      return dictionary.stream().allMatch(NaNUtil::isNaN) ? ROWS_CANNOT_MATCH : ROWS_MIGHT_MATCH;
+    }
+
+    private <T> Comparator<T> comparatorForNaNPredicate(BoundReference<T> ref) {


You can also use Comparators.forType. The dictionary cannot contain null values so there is no need to wrap for null handling.

Thanks for the info! I'll update this.

yyanyy · 2020-11-26T03:14:34Z

Do you have comment on the case of "this may result in v2 returning more files than v1" when literal is not NaN but the data to be compared have NaN? We might need to accept that to keep behavior of comparing with NaN consistent across different files?

I don't think this is a v2 problem, it is a bug in how we currently handle NaN right?

Thanks for pointing this out! After thinking about this I realized that my original concern probably shouldn't be a problem. My concern was that to make sure v2 could return exactly the same result as v1 when doing NaN comparison would require extra efforts, since the behavior of metrics evaluators now change. However, doing comparison with NaN is actually an invalid operation, and regardless of how each individual engine treats this (e.g. I think Spark consider NaN as Max, as for a column col containing NaNs, where col > 0 always return NaN records) that should be something to be fixed on the engine side.

rdblue · 2020-12-02T02:07:44Z

I plan to take another look at this tomorrow.

rdblue · 2020-12-06T01:10:34Z

api/src/main/java/org/apache/iceberg/expressions/Expressions.java

  public static <T> UnboundPredicate<T> lessThan(String name, T value) {
+    validateInput("lessThan", value);


An easier way to do this is to add the check in Literal.from. That's where Iceberg enforces that the value cannot be null. Since a literal is created for every value that is passed in, we would only need to change that one place instead of all of the factory methods here.

It also ensures that we don't add factory methods later and forget to add the check to them.

Thank you! I didn't notice Literals.from was called within constructor of UnboundPredicate when normal object is passed in. This is definitely much more cleaner! I have created #1892 to address this.

Thank you so much for your time reviewing this long PR!

rdblue · 2020-12-06T01:15:10Z

Thanks @yyanyy! I only had one thing that I would change, but we can do that in a follow-up since this is such a big PR. Thank you for fixing this, great work!

github-actions bot added API data ORC parquet spark labels Nov 10, 2020