Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rewrite equals filters #1857

Merged
merged 4 commits into from
Dec 10, 2020
Merged

Conversation

yyanyy
Copy link
Contributor

@yyanyy yyanyy commented Dec 2, 2020

This PR is based on #1747. Currently it couldn't compile but I have confirmed locally that the project could build and all tests passed after rebasing it on top of #1747. This is because there are a few files required from #1747 to let the tests work. I will rebase the change and mark as ready for review once 1747 is merged.

} else {
EqualNullSafe eq = (EqualNullSafe) filter;
if (eq.value() == null) {
return isNull(eq.attribute());
} else {
return equal(eq.attribute(), convertLiteral(eq.value()));
return handleEqual(eq.attribute(), eq.value());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this not directly inside Expressions.equal, so we can avoid duplication between spark 2 and 3?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought to reject NaN in any predicate and let SparkFilters to do rewrites was the conclusion we reached in this thread?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree. Rewriting filters should be done in translation to Iceberg so that we have simpler behavior and strong assumptions.

@@ -49,8 +49,8 @@ public TestSelect(String catalogName, String implementation, Map<String, String>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should also try to test it for spark2, maybe update some tests in TestReadProjection?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I actually also spent some time on this but wasn't able to find a good place to add in spark2, and later gave up thinking that the added logic was relatively simple anyway. To me TestReadProjection is more about testing projection which is not what we are doing. I guess I'll create a TestSelect in spark3 test suite and duplicate this class then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a TestSelect in spark2 by basically duplicating logic from the same class in spark3; although apart from basic sanity testing I'm not sure how helpful the tests are as some of the logic for examine pushed-down filters only exist in spark3...

@rdblue
Copy link
Contributor

rdblue commented Dec 6, 2020

I merged #1747, so you can rebase this. Thanks!

@@ -177,4 +179,13 @@ private static Object convertLiteral(Object value) {
}
return value;
}

private static Expression handleEqual(String attribute, Object value) {
Object literal = convertLiteral(value);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be moved into the else block because literal should not allow creating a NaN literal.

@yyanyy yyanyy force-pushed the nan_expression_rewriting branch from 445ee6d to 730ce5b Compare December 8, 2020 02:29
@yyanyy yyanyy marked this pull request as ready for review December 8, 2020 22:57

Assert.assertEquals("Should create only one scan", 1, scanEventCount);
Assert.assertEquals("Should push down expected filter",
"(float IS NOT NULL AND float = NaN)",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be is_nan(float) instead of = NaN?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because in DescribeExpressionVisitor we translate is_nan to = NaN in here. Do you want me to change this to is_nan(float)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think so. The description shouldn't produce a predicate that we don't support!


Assert.assertEquals("Should return all expected rows", expected,
sql("SELECT * FROM table where doubleVal = double('NaN')"));
Assert.assertEquals("Should create only one scan", 1, scanEventCount);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this validate more than just the number of scans?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sorry I forgot to revisit this after cleaning up other changes. Since in spark2 we don't have Spark3Util.describe() I wasn't sure to which level we want to assert the expression, so that we can still have test coverage without being too coupled with internal implementation. Let me know how you think the updated test is!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

}

private List<Record> sql(String str) {
List<Row> rows = spark.sql(str).collectAsList();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems brittle because it uses types to place the results.

Other tests use StructProjection and StructLikeSet for similar validations. The incoming row is wrapped to be a StructLike and added to a StructLikeSet based on the expected schema. Then another StructLikeSet is created with the expected rows, which are projected using StructProjection and the expected schema. That is a cleaner way to do this, I think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this uses a Java bean record class, so you could also rely on Spark to convert to your record class, and then use a special comparison function to only compare expected columns.

Copy link
Contributor Author

@yyanyy yyanyy Dec 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I wanted to scope this Record class to only be used for this class' use cases but this is definitely not clean. I changed this to use Spark for converting to Java bean, but encountered a similar issue as described in this post that when projecting a subset of columns, conversion doesn't work due to missing expected columns. Since in this class I'm just projecting one column with primitive type, I convert data frame into their specific classes instead. Please let me know if you know better ways of doing this!

@rdblue rdblue merged commit 04e73de into apache:master Dec 10, 2020
@rdblue
Copy link
Contributor

rdblue commented Dec 10, 2020

Thanks, @yyanyy! Looks good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants