Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: add isNaN and notNaN predicates #1747

Merged
merged 6 commits into from
Dec 6, 2020
Merged

Conversation

yyanyy
Copy link
Contributor

@yyanyy yyanyy commented Nov 10, 2020

This changes adds isNaN/notNaN predicates. Metrics evaluators currently work against a null NaN counter, which will be populated in a later pr. The logic of updating SparkFilters.convert to direct queries to isNaN/notNaN methods will also be in a later pr.

@yyanyy
Copy link
Contributor Author

yyanyy commented Nov 10, 2020

(not related to this change itself)
I was thinking how we should change metrics evaluators when we exclude NaN from upper/lower bounds. Here's a table I summarized about the changes we have to make:

strict evaluator inclusive evaluator
... where id lteq/gteq NaN need to check if we are comparing with NaN, and check NaN counter if min==max==null need to check if we are comparing with NaN, and check NaN counter
... where id = V, V != NaN, and column contains some NaN if min==max==V, need to check if there are null/NaN count to decide if ROWS_MUST_MATCH should be returned no change
... where id lt/lteq/gt/gteq V, V != NaN and column contains some NaN if there are NaN count, return ROWS_MIGHT_NOT_MATCH (this may result in v2 returning more files than v1) if there are NaN count, return ROWS_MIGHT_MATCH (this may result in v2 returning more files than v1)
... where id lt/lteq/gt/gteq V, V != NaN, and column contains only NaN no change if there are null/NaN count, return ROWS_MIGHT_MATCH (this may result in v2 returning more files than v1)

Here's an example for the explanation of "this may result in v2 returning more files than v1": say in v1 we consistently treat NaN as lower bound when there's any NaN value, and a file has stats distributed as below:
NaN-------<actual min>----<actual max>-----V-----
in v1 and without NaN counter, query where x > V will not return this file since it's outside the bound; however in v2 we will return it, since we don't know how to compare with NaN.

Another change is "in". In v2 we may need to explicitly check if there's NaN value in literalSet when comparing with lower/upper bound.

Do statements I made above look right? Thanks!


// when there's no nanCounts information, but we already know the column only contains null,
// it's guaranteed that there's no NaN value
if (containsNullsOnly(id)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we define a similar containsNaNsOnly method to use in notNaN and for a similar use in isNull?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't define notNaN originally as I could directly return ROWS_CANNOT_MATCH when both nanCounts and valueCounts contain this column but numbers don't match, without going into the next block of logic (of checking upper == lower == NaN and null count == 0); but this advantage no longer exists since that block needs to be removed.

But I wasn't sure if we need it for isNull: currently in isNull() we are checking if nullCounts == 0 to return ROWS_CANNOT_MATCH, and I guess the only chance where we rely on containsNaNsOnly to return ROWS_CANNOT_MATCH is nullCounts for this column doesn't exist but nanCounts does. I personally feel the chance of this happening would be small, do you think we will run into this case often?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the containsNaNsOnly logic will not be very useful as Yan said, but I think it is also valuable to have that private method just for readability.

Then the question reduces to: do we need to consider the case that null value metrics do not exist but NaN metrics do. For now I think the answer is no, because in all metrics modes NaN and null counters either both exist or both not exist.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I'll create a containsNaNsOnly for readability. Ryan, do you have comment on the other point?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the reasoning. If we have NaN counts, then we should have null counts. No need to over-complicated the null logic with a check for when we don't have null counts but do have NaN counts. Good catch!


public static <T> UnboundPredicate<T> notNaN(UnboundTerm<T> expr) {
return new UnboundPredicate<>(Expression.Operation.NOT_NAN, expr);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we also need to update the equality predicate to catch NaN and rewrite to isNaN?

Copy link
Contributor Author

@yyanyy yyanyy Nov 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally thought to update SparkFilters to do the rewrite, but this is a much better place. Thanks for the suggestion!

Edit: what do you think about doing rewriting eq within UnboundPredicate? And for rewriting in, I was thinking to let Expressions.in to do the rewrite logic of or(isNaN, in)/and(notNaN, notIn), but that means it will return Expression instead of Predicate; does that align with your thinking?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not fully understand what you mean by "rewrite logic of or(isNaN, in)/and(notNaN, notIn)" when you talk about rewriting in. Can you give some examples of what predicate are you trying to support?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So now since we want to handle NaN in in predicate, for query in(1,2, NaN) to avoid checking for NaN in in evaluation all the time we can transform that to in(1,2) or isNaN, and notIn(1,2,NaN) to notIn(1, 2) and notNaN. The problem is where to do that, since in and notIn are both predicate, and if we are extending them we are transforming a predicate (simpler form) to an expression (complex form), and I think there's no such case in the current code base, and it would touch a lot of existing test cases for this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay so it's what I thought, just a bit confused by the notation.

So for eq, what is the benefit of doing it in UnboundedPredicate versus just rewriting it in the Expressions?

For in, I think it is a more complex question.We need to figure out:

  1. should syntax like in(1,2,NaN) be supported, given it can be written as is_nan or in(1,2) on client side
  2. if so, Expressions.in should return Expression as you said, which looks fine to me because the only caller SparkFilters.convert also returns an Expression in the end.
  3. maybe we should tackle this in another PR to keep changes concise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick response! Yeah I think the amount of change to method return type/tests is not a concern now. I just wasn't entirely sure if rewriting eq to isNan in Expressions will help with catching problems early (comparing to rewriting in UnboundPredicate), since it seems to me that the related code will not have a chance to throw any exception until bind() is called?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it isn't much earlier in that case. Maybe that actually exposes a problem with rewriting, too.

Expressions.equal("c", Double.NaN) if c is not a floating point column would result in isNaN, which should be rejected while binding expressions. You could argue that it should rewrite to alwaysFalse instead following the same logic as Expressions.equal("intCol", Long.MAX_VALUE) -- it can't be true.

I think that it would be better to be strict and reject binding in that case because something is clearly wrong. I think a lot of the time, that kind of error would happen when columns are misaligned or predicates are incorrectly converted.

If the result of those errors is just to fail in expression binding, then why rewrite at all? Maybe we should just reject NaN in any predicate and force people to explicitly use isNaN and notNaN. That way we do throw an exception much earlier in all cases. Plus, we wouldn't have to worry about confusion over whether NaN is equal to itself: in Java, a Double that holds NaN is equal to itself, but a primitive is not. 😕

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, those are some good points! To make sure I understand correctly/know how to move forward, I have some questions:

  • If I understand correctly, to reject NaN in any predicate sounds like we might go back to the idea of rewriting equals in SparkFilters (or in general, the integration point with engines during the query-to-expression translation); or maybe even earlier than that, to let engines to support syntax of is NaN?
  • Since to know if a query is eligible to be translated to isNaN there has to be some place that ensures the type has to be either double or float, and in iceberg code base we will only know this during binding; are we able to rely on engine to do this check before translating query to Expression?
  • And seems like this may only impact eq as we decided to do input validation on other lg/lteq/gt/gteq and in anyway?
  • And if we start to throw exceptions when the code passes in NaN to eq, that may sound backward incompatible until the engine starts to rewrite NaN?

I guess the conversation is starting to get too detailed, if you wouldn't mind I'll try to follow up on Slack tomorrow and then post the conclusion here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, to reject NaN in any predicate sounds like we might go back to the idea of rewriting equals in SparkFilters

Yes. If the engine generally uses d = NaN then we can convert that to isNaN. But that would be engine-dependent and the Iceberg expression API would not support equals with NaN.

are we able to rely on engine to do this check before translating query to Expression?

I think so. Most engines will optimize the SQL expressions and handle this already. If not, then it would result in an exception from Iceberg to the user. I think that's okay, too, because as I said above, we want to fail if a NaN is used in an expression with a non-floating-point column, not rewrite to false.

And seems like this may only impact eq as we decided to do input validation on other lg/lteq/gt/gteq and in anyway?

Yes. This makes all of the handling in Expressions consistent: always reject NaN values.

that may sound backward incompatible until the engine starts to rewrite NaN?

I'm not convinced either way. You could argue that d = NaN is ambiguous and that rejecting it is now fixing a bug. That's certainly the case with d > NaN, which is not defined. On the other hand, there was some bevhavior before that will now no longer work. So I'd be up for fixing this in Flink and Spark conversions as soon as we can.

Feel free to ping me on Slack!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the explanation! I think now I understand the full picture. I think I've addressed everything except for rewriting in SparkFilters and other engines, which I think this PR is already too big so I'll submit a separate PR for it (likely next week).

// containsNull encodes whether at least one partition value is null, lowerBound is null if
// all partition values are null.
ByteBuffer lowerBound = stats.get(pos).lowerBound();
if (lowerBound == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be safe, I think this should validate that containsNull is true.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean check for both containsNull and stats.get(pos).lowerBound() == null are true? When would lowerBound be null while the column doesn't contain null? I guess I'll also need to update notNull for this too (since I copied the logic from there)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like #1803 is missing PartitionFieldSummary.containsNaN(), or is it in some other PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should be doable, although I originally consider the scope of the NaN support to be only on manifest entry level, I wasn't sure if we want to extend it beyond that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure of a case where it would happen, but containsNull is the source of truth for whether there are null values, not a missing bound value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I'll add containsNull to both here and notNull. And looks like we do want to update PartitionFieldSummary, that I'll do in a separate pr.

@@ -214,6 +218,36 @@ public void testInTimestamp() {
Assert.assertEquals("Residual should be alwaysFalse", alwaysFalse(), residual);
}

@Test
public void testInNaN() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need NaN cases for other evaluators?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean testing InNaN case for other evaluators? Yeah I'll do that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we decided that we don't accept NaN in in, I guess this conversation is outdated. I've removed inNaN test from here.

@rdblue
Copy link
Contributor

rdblue commented Nov 11, 2020

I was thinking how we should change metrics evaluators when we exclude NaN from upper/lower bounds. Here's a table . . .

I think we should not produce predicates that use NaN as a literal for comparison. We can easily rewrite equal and notEqual to isNaN and notNaN. We can also rewrite in and notIn to or(in(non-NaNs), isNaN) or and(notIn(non-NaNs), isNotNaN). Then inequalities would either be converted to alwaysFalse or throw an exception because we don't accept the predicate. I'd lean toward throwing an exception if someone uses floatCol < NaN.

@yyanyy
Copy link
Contributor Author

yyanyy commented Nov 12, 2020

I was thinking how we should change metrics evaluators when we exclude NaN from upper/lower bounds. Here's a table . . .

I think we should not produce predicates that use NaN as a literal for comparison. We can easily rewrite equal and notEqual to isNaN and notNaN. We can also rewrite in and notIn to or(in(non-NaNs), isNaN) or and(notIn(non-NaNs), isNotNaN). Then inequalities would either be converted to alwaysFalse or throw an exception because we don't accept the predicate. I'd lean toward throwing an exception if someone uses floatCol < NaN.

Thank you for all the comments! I'll update Expressions to include rewritings in this PR.

Do you have comment on the case of "this may result in v2 returning more files than v1" when literal is not NaN but the data to be compared have NaN? We might need to accept that to keep behavior of comparing with NaN consistent across different files?

@rdblue
Copy link
Contributor

rdblue commented Nov 12, 2020

Do you have comment on the case of "this may result in v2 returning more files than v1" when literal is not NaN but the data to be compared have NaN? We might need to accept that to keep behavior of comparing with NaN consistent across different files?

I don't think this is a v2 problem, it is a bug in how we currently handle NaN right?


// when there's no nanCounts information, but we already know the column only contains null,
// it's guaranteed that there's no NaN value
if (containsNullsOnly(id)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the containsNaNsOnly logic will not be very useful as Yan said, but I think it is also valuable to have that private method just for readability.

Then the question reduces to: do we need to consider the case that null value metrics do not exist but NaN metrics do. For now I think the answer is no, because in all metrics modes NaN and null counters either both exist or both not exist.

// containsNull encodes whether at least one partition value is null, lowerBound is null if
// all partition values are null.
ByteBuffer lowerBound = stats.get(pos).lowerBound();
if (lowerBound == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like #1803 is missing PartitionFieldSummary.containsNaN(), or is it in some other PR?


public static <T> UnboundPredicate<T> notNaN(UnboundTerm<T> expr) {
return new UnboundPredicate<>(Expression.Operation.NOT_NAN, expr);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not fully understand what you mean by "rewrite logic of or(isNaN, in)/and(notNaN, notIn)" when you talk about rewriting in. Can you give some examples of what predicate are you trying to support?

return dictionary.stream().allMatch(NaNUtil::isNaN) ? ROWS_CANNOT_MATCH : ROWS_MIGHT_MATCH;
}

private <T> Comparator<T> comparatorForNaNPredicate(BoundReference<T> ref) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also use Comparators.forType. The dictionary cannot contain null values so there is no need to wrap for null handling.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the info! I'll update this.

@yyanyy
Copy link
Contributor Author

yyanyy commented Nov 26, 2020

Do you have comment on the case of "this may result in v2 returning more files than v1" when literal is not NaN but the data to be compared have NaN? We might need to accept that to keep behavior of comparing with NaN consistent across different files?

I don't think this is a v2 problem, it is a bug in how we currently handle NaN right?

Thanks for pointing this out! After thinking about this I realized that my original concern probably shouldn't be a problem. My concern was that to make sure v2 could return exactly the same result as v1 when doing NaN comparison would require extra efforts, since the behavior of metrics evaluators now change. However, doing comparison with NaN is actually an invalid operation, and regardless of how each individual engine treats this (e.g. I think Spark consider NaN as Max, as for a column col containing NaNs, where col > 0 always return NaN records) that should be something to be fixed on the engine side.

@yyanyy yyanyy mentioned this pull request Dec 2, 2020
@rdblue
Copy link
Contributor

rdblue commented Dec 2, 2020

I plan to take another look at this tomorrow.

public static <T> UnboundPredicate<T> lessThan(String name, T value) {
validateInput("lessThan", value);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An easier way to do this is to add the check in Literal.from. That's where Iceberg enforces that the value cannot be null. Since a literal is created for every value that is passed in, we would only need to change that one place instead of all of the factory methods here.

It also ensures that we don't add factory methods later and forget to add the check to them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! I didn't notice Literals.from was called within constructor of UnboundPredicate when normal object is passed in. This is definitely much more cleaner! I have created #1892 to address this.

Thank you so much for your time reviewing this long PR!

@rdblue rdblue merged commit fab4a5f into apache:master Dec 6, 2020
@rdblue
Copy link
Contributor

rdblue commented Dec 6, 2020

Thanks @yyanyy! I only had one thing that I would change, but we can do that in a follow-up since this is such a big PR. Thank you for fixing this, great work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants