rewrite equals filters #1857

yyanyy · 2020-12-02T00:36:28Z

This PR is based on #1747. Currently it couldn't compile but I have confirmed locally that the project could build and all tests passed after rebasing it on top of #1747. This is because there are a few files required from #1747 to let the tests work. I will rebase the change and mark as ready for review once 1747 is merged.

jackye1995 · 2020-12-04T20:57:51Z

spark2/src/main/java/org/apache/iceberg/spark/SparkFilters.java

          } else {
            EqualNullSafe eq = (EqualNullSafe) filter;
            if (eq.value() == null) {
              return isNull(eq.attribute());
            } else {
-              return equal(eq.attribute(), convertLiteral(eq.value()));
+              return handleEqual(eq.attribute(), eq.value());


why is this not directly inside Expressions.equal, so we can avoid duplication between spark 2 and 3?

I thought to reject NaN in any predicate and let SparkFilters to do rewrites was the conclusion we reached in this thread?

Yes, I agree. Rewriting filters should be done in translation to Iceberg so that we have simpler behavior and strong assumptions.

jackye1995 · 2020-12-04T21:01:55Z

spark3/src/test/java/org/apache/iceberg/spark/sql/TestSelect.java

@@ -49,8 +49,8 @@ public TestSelect(String catalogName, String implementation, Map<String, String>



I think we should also try to test it for spark2, maybe update some tests in TestReadProjection?

Yeah I actually also spent some time on this but wasn't able to find a good place to add in spark2, and later gave up thinking that the added logic was relatively simple anyway. To me TestReadProjection is more about testing projection which is not what we are doing. I guess I'll create a TestSelect in spark3 test suite and duplicate this class then.

Added a TestSelect in spark2 by basically duplicating logic from the same class in spark3; although apart from basic sanity testing I'm not sure how helpful the tests are as some of the logic for examine pushed-down filters only exist in spark3...

rdblue · 2020-12-06T01:15:42Z

I merged #1747, so you can rebase this. Thanks!

rdblue · 2020-12-06T01:18:07Z

spark2/src/main/java/org/apache/iceberg/spark/SparkFilters.java

@@ -177,4 +179,13 @@ private static Object convertLiteral(Object value) {
    }
    return value;
  }
+
+  private static Expression handleEqual(String attribute, Object value) {
+    Object literal = convertLiteral(value);


This should be moved into the else block because literal should not allow creating a NaN literal.

rdblue · 2020-12-08T23:56:31Z

spark3/src/test/java/org/apache/iceberg/spark/sql/TestSelect.java

+
+    Assert.assertEquals("Should create only one scan", 1, scanEventCount);
+    Assert.assertEquals("Should push down expected filter",
+        "(float IS NOT NULL AND float = NaN)",


Shouldn't this be is_nan(float) instead of = NaN?

This is because in DescribeExpressionVisitor we translate is_nan to = NaN in here. Do you want me to change this to is_nan(float)?

Yes, I think so. The description shouldn't produce a predicate that we don't support!

rdblue · 2020-12-08T23:58:39Z

spark2/src/test/java/org/apache/iceberg/spark/source/TestSelect.java

+
+    Assert.assertEquals("Should return all expected rows", expected,
+        sql("SELECT * FROM table where doubleVal = double('NaN')"));
+    Assert.assertEquals("Should create only one scan", 1, scanEventCount);


Shouldn't this validate more than just the number of scans?

Yes, sorry I forgot to revisit this after cleaning up other changes. Since in spark2 we don't have Spark3Util.describe() I wasn't sure to which level we want to assert the expression, so that we can still have test coverage without being too coupled with internal implementation. Let me know how you think the updated test is!

Looks good!

rdblue · 2020-12-09T00:02:41Z

spark2/src/test/java/org/apache/iceberg/spark/source/TestSelect.java

+  }
+
+  private List<Record> sql(String str) {
+    List<Row> rows = spark.sql(str).collectAsList();


This seems brittle because it uses types to place the results.

Other tests use StructProjection and StructLikeSet for similar validations. The incoming row is wrapped to be a StructLike and added to a StructLikeSet based on the expected schema. Then another StructLikeSet is created with the expected rows, which are projected using StructProjection and the expected schema. That is a cleaner way to do this, I think.

Looks like this uses a Java bean record class, so you could also rely on Spark to convert to your record class, and then use a special comparison function to only compare expected columns.

Sounds good, I wanted to scope this Record class to only be used for this class' use cases but this is definitely not clean. I changed this to use Spark for converting to Java bean, but encountered a similar issue as described in this post that when projecting a subset of columns, conversion doesn't work due to missing expected columns. Since in this class I'm just projecting one column with primitive type, I convert data frame into their specific classes instead. Please let me know if you know better ways of doing this!

rdblue · 2020-12-10T23:57:28Z

Thanks, @yyanyy! Looks good.

github-actions bot added MR pig spark labels Dec 2, 2020

giovannifumarola approved these changes Dec 3, 2020

View reviewed changes

jackye1995 reviewed Dec 4, 2020

View reviewed changes

rdblue reviewed Dec 6, 2020

View reviewed changes

yyanyy force-pushed the nan_expression_rewriting branch from 445ee6d to 730ce5b Compare December 8, 2020 02:29

yyanyy marked this pull request as ready for review December 8, 2020 22:57

rdblue reviewed Dec 8, 2020

View reviewed changes

rdblue reviewed Dec 9, 2020

View reviewed changes

yyanyy added 3 commits December 9, 2020 14:17

rewrite equals filters

cc8dc2b

add spark2 test

e894215

update spark2 test

ecb3839

yyanyy force-pushed the nan_expression_rewriting branch from 730ce5b to ecb3839 Compare December 9, 2020 22:18

github-actions bot added the API label Dec 9, 2020

yyanyy mentioned this pull request Dec 9, 2020

API: move NaN validation from Expressions to Literals #1892

Merged

change is_nan() representation in describe()

031582f

rdblue approved these changes Dec 10, 2020

View reviewed changes

rdblue merged commit 04e73de into apache:master Dec 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rewrite equals filters #1857

rewrite equals filters #1857

yyanyy commented Dec 2, 2020

jackye1995 Dec 4, 2020

yyanyy Dec 4, 2020

rdblue Dec 6, 2020

jackye1995 Dec 4, 2020

yyanyy Dec 4, 2020

yyanyy Dec 8, 2020

rdblue commented Dec 6, 2020

rdblue Dec 6, 2020

rdblue Dec 8, 2020

yyanyy Dec 9, 2020

rdblue Dec 10, 2020

rdblue Dec 8, 2020

yyanyy Dec 9, 2020

rdblue Dec 10, 2020

rdblue Dec 9, 2020

rdblue Dec 9, 2020

yyanyy Dec 9, 2020 •

edited

Loading

rdblue commented Dec 10, 2020

		@@ -49,8 +49,8 @@ public TestSelect(String catalogName, String implementation, Map<String, String>

rewrite equals filters #1857

rewrite equals filters #1857

Conversation

yyanyy commented Dec 2, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdblue commented Dec 6, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yyanyy Dec 9, 2020 • edited Loading

Choose a reason for hiding this comment

rdblue commented Dec 10, 2020

yyanyy Dec 9, 2020 •

edited

Loading