[SPARK-31705][SQL] Push more possible predicates through Join via CNF conversion #28733

gengliangwang · 2020-06-05T08:03:55Z

What changes were proposed in this pull request?

This PR add a new rule to support push predicate through join by rewriting join condition to CNF(conjunctive normal form). The following example is the steps of this rule:

Prepare Table:

CREATE TABLE x(a INT);
CREATE TABLE y(b INT);
...
SELECT * FROM x JOIN y ON ((a < 0 and a > b) or a > 10);

Convert the join condition to CNF:

(a < 0 or a > 10) and (a > b or a > 10)

Split conjunctive predicates

Predicates
(a < 0 or a > 10)
(a > b or a > 10)

Push predicate

Table	Predicate
x	(a < 0 or a > 10)

Why are the changes needed?

Improve query performance. PostgreSQL, Impala and Hive support this feature.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit test and benchmark test.

SQL	Before this PR	After this PR
TPCDS 5T Q13	84s	21s
TPCDS 5T q85	66s	34s
TPCH 1T q19	37s	32s

gengliangwang · 2020-06-05T08:08:46Z

As I talked to @wangyum offline, I am taking #28575 over for the CNF implementation and config naming.

There have been PRs for CNF conversion, such as #10444, #15558, #28575. The common issue is the recursive implementation can slow, or even cause a stack overflow exception.

With this non-recursive implementation, the rule should be faster and more robust.

gengliangwang · 2020-06-05T08:15:22Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/FilterPushdownSuite.scala

@@ -1230,4 +1237,134 @@ class FilterPushdownSuite extends PlanTest {

    comparePlans(Optimize.execute(query.analyze), expected)
  }
+
+  test("inner join: rewrite filter predicates to conjunctive normal form") {


Test cases are copied from #28575

gengliangwang · 2020-06-05T08:18:38Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/FilterPushdownSuite.scala

+          (testRelation.subquery('x), testRelation.subquery('y))
+        } else {
+          (testRelation.subquery('x),
+            testRelation.where(('c <= 5 || 'c < 1) && ('c <=5 || 'a > 2)).subquery('y))


@wangyum To make it simple, this PR didn't convert the pushed down predicate to a shorter form.
We can have a follow-up PR if you like that feature.

SparkQA · 2020-06-05T13:09:42Z

Test build #123556 has finished for PR 28733 at commit 729be0f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

wangyum · 2020-06-05T14:25:11Z

...yst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/PushCNFPredicateThroughJoin.scala

+      node match {
+        case Not(a And b) => stack.push(Or(Not(a), Not(b)))
+        case Not(a Or b) => stack.push(And(Not(a), Not(b)))
+        case Not(Not(a)) => stack.push(a)


Do we need to handle these NOT cases? It seems that the NOT operator is removed by BooleanSimplification:

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala

Lines 391 to 394 in 240840f

case Not(a Or b) => And(Not(a), Not(b))

case Not(a And b) => Or(Not(a), Not(b))

case Not(Not(e)) => e

Keeping here is OK since it is also very straightforward

SparkQA · 2020-06-05T22:00:37Z

Test build #123577 has finished for PR 28733 at commit a216cf8.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-06-05T22:58:13Z

Test build #123578 has finished for PR 28733 at commit a9a5c0b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2020-06-06T00:29:23Z

test failure from TPCDSQuerySuite:

18884 was not less than or equal to 8000 too long generated codes found in the WholeStageCodegenExec subtree (id=375762) and JIT optimization might not work:

I will update the PR to simplify the pushed down predicates

viirya · 2020-06-06T03:42:16Z

...yst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/PushCNFPredicateThroughJoin.scala

+   *         If the conversion repeatedly expands nondeterministic expressions, return Seq.empty.
+   *         Otherwise, return the converted result as sequence of disjunctive expressions.
+   */
+  protected def conjunctiveNormalForm(condition: Expression): Seq[Expression] = {


Could you add tests for this method? We should have particular tests to verify the CNF conversion.

@viirya Thanks, I will add more test cases

SparkQA · 2020-06-06T06:59:46Z

Test build #123585 has finished for PR 28733 at commit cbc1220.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

wangyum · 2020-06-07T10:52:38Z

...yst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/PushCNFPredicateThroughJoin.scala

+   */
+  protected def conjunctiveNormalForm(condition: Expression): Seq[Expression] = {
+    val postOrderNodes = postOrderTraversal(condition)
+    val resultStack = new scala.collection.mutable.Stack[Seq[Expression]]


scala.collection.mutable.Stack[Seq[Expression]] -> mutable.Stack[Seq[Expression]]?

wangyum · 2020-06-07T10:53:41Z

...yst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/PushCNFPredicateThroughJoin.scala

+    resultStack.top
+  }
+
+  private def aggregateExpressionsOfSameReference(expressions: Seq[Expression]): Seq[Expression] = {


aggregateExpressionsOfSameReference -> aggregateExpressionsOfSameQualifier?

wangyum · 2020-06-07T10:56:03Z

...yst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/PushCNFPredicateThroughJoin.scala

+
+  private def aggregateExpressionsOfSameReference(expressions: Seq[Expression]): Seq[Expression] = {
+    expressions.groupBy(_.references.map(_.qualifier)).map(_._2.reduceLeft(And)).toSeq
+  }


Add a new empty line below?

SparkQA · 2020-06-07T12:06:59Z

Test build #123601 has finished for PR 28733 at commit cbc1220.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AngersZhuuuu · 2020-06-08T07:18:27Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala

+  private def aggregateExpressionsOfSameQualifiers(
+    expressions: Seq[Expression]): Seq[Expression] = {
+    expressions.groupBy(_.references.map(_.qualifier)).map(_._2.reduceLeft(And)).toSeq
+  }


For a test case dt = '1' OR (dt = '2' AND id = 1) passed to conjunctiveNormalForm, still return dt = '1' OR (dt = '2' AND id = 1).

See qualifier when groupby , they are

List(List(spark_catalog, default, t)) List(List(spark_catalog, default, t))

I think we can try

expressions.groupBy(_.references.flatMap(_.qualifier).toSet).map(_._2.reduceLeft(And)).toSeq

I will update this PR later

expressions.groupBy(_.references.flatMap(_.qualifier).toSet).map(_._2.reduceLeft(And)).toSeq

Not work, just

expressions.groupBy(_.references).map(_._2.reduceLeft(And)).toSeq

The qualifier is the table name which is able to be used for aggregating more expressions

The qualifier is the table name which is able to be used for aggregating more expressions

Got the point, you did this for split condition to join children, I want convert scan predicate condition to optimize scan predicate.

I think this PR is complex enough. Let's keep this part in this way for now.

Yea, I will raise pr for other problem base on your code and change a little after your pr merged.

dilipbiswal · 2020-06-08T08:26:31Z

...yst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/PushCNFPredicateThroughJoin.scala

+        val rightFilterConditions =
+          pushDownCandidates.filter(_.references.subsetOf(right.outputSet))
+
+        val newLeft =


@gengliangwang Question: Can newLeft and newRight be declared lazy ? Seems like we need to compute it conditionally based on join type ?

@dilipbiswal sure, thanks for the suggestion.

SparkQA · 2020-06-08T11:39:06Z

Test build #123617 has finished for PR 28733 at commit 97c3414.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-06-08T11:50:17Z

Test build #123618 has finished for PR 28733 at commit 2976a60.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-06-09T15:13:08Z

Test build #123686 has finished for PR 28733 at commit f951463.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

wangyum · 2020-06-09T15:45:46Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala

+      }
+      resultStack.push(cnf)
+    }
+    assert(resultStack.length == 1,


Just logWarning()?

SparkQA · 2020-06-10T01:20:00Z

Test build #123705 has finished for PR 28733 at commit a0c7110.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2020-06-11T05:56:58Z

@wangyum @maropu @viirya @dilipbiswal @AngersZhuuuu @cloud-fan Thanks for the review. I think this PR is ready to be merged once the tests are passed. Let me know if you still have more comments.

SparkQA · 2020-06-11T07:05:02Z

Test build #123803 has finished for PR 28733 at commit b42ce1d.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

AngersZhuuuu · 2020-06-11T07:08:28Z

retest this please

maropu

Thanks for the updates, @gengliangwang. Looks okay to me.

maropu · 2020-06-11T08:03:38Z

(Just to check) Btw, it seems the previous works about CNF tried to implement this conversion in an independent rule for Filter plans (e.g., the @viirya one: https://github.com/apache/spark/pull/15558/files#diff-a1acb054bc8888376603ef510e6d0ee0R139). On the other hand, this PR only targets at join queries. Is this because this conversion has a severe trade-off relationship between time complexity / performance gains, but join quries can get much performance improvements even in this case? Is my understanding correct?

gengliangwang · 2020-06-11T08:06:43Z

@maropu Yes pushing down predicates through join should be the major scenario.

wangyum

LGTM.

AngersZhuuuu · 2020-06-11T09:03:32Z

LGTM

SparkQA · 2020-06-11T13:44:18Z

Test build #123831 has finished for PR 28733 at commit b42ce1d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2020-06-11T17:12:28Z

Merging to master

### What changes were proposed in this pull request? Spark can't push down scan predicate condition of **Or**: Such as if I have a table `default.test`, it's partition col is `dt`, If we use query : ``` select * from default.test where dt=20190625 or (dt = 20190626 and id in (1,2,3) ) ``` In this case, Spark will resolve **Or** condition as one expression, and since this expr has reference of "id", then it can't been push down. Base on pr #28733, In my PR , for SQL like `select * from default.test` `where dt = 20190626 or (dt = 20190627 and xxx="a") ` For this condition `dt = 20190626 or (dt = 20190627 and xxx="a" )`, it will been converted to CNF ``` (dt = 20190626 or dt = 20190627) and (dt = 20190626 or xxx = "a" ) ``` then condition `dt = 20190626 or dt = 20190627` will be push down when partition pruning ### Why are the changes needed? Optimize partition pruning ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? Added UT Closes #28805 from AngersZhuuuu/cnf-for-partition-pruning. Lead-authored-by: angerszhu <angers.zhu@gmail.com> Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

… Join/Partitions ### What changes were proposed in this pull request? In #28733 and #28805, CNF conversion is used to push down disjunctive predicates through join and partitions pruning. It's a good improvement, however, converting all the predicates in CNF can lead to a very long result, even with grouping functions over expressions. For example, for the following predicate ``` (p0 = '1' AND p1 = '1') OR (p0 = '2' AND p1 = '2') OR (p0 = '3' AND p1 = '3') OR (p0 = '4' AND p1 = '4') OR (p0 = '5' AND p1 = '5') OR (p0 = '6' AND p1 = '6') OR (p0 = '7' AND p1 = '7') OR (p0 = '8' AND p1 = '8') OR (p0 = '9' AND p1 = '9') OR (p0 = '10' AND p1 = '10') OR (p0 = '11' AND p1 = '11') OR (p0 = '12' AND p1 = '12') OR (p0 = '13' AND p1 = '13') OR (p0 = '14' AND p1 = '14') OR (p0 = '15' AND p1 = '15') OR (p0 = '16' AND p1 = '16') OR (p0 = '17' AND p1 = '17') OR (p0 = '18' AND p1 = '18') OR (p0 = '19' AND p1 = '19') OR (p0 = '20' AND p1 = '20') ``` will be converted into a long query(130K characters) in Hive metastore, and there will be error: ``` javax.jdo.JDOException: Exception thrown when executing query : SELECT DISTINCT 'org.apache.hadoop.hive.metastore.model.MPartition' AS NUCLEUS_TYPE,A0.CREATE_TIME,A0.LAST_ACCESS_TIME,A0.PART_NAME,A0.PART_ID,A0.PART_NAME AS NUCORDER0 FROM PARTITIONS A0 LEFT OUTER JOIN TBLS B0 ON A0.TBL_ID = B0.TBL_ID LEFT OUTER JOIN DBS C0 ON B0.DB_ID = C0.DB_ID WHERE B0.TBL_NAME = ? AND C0."NAME" = ? AND ((((((A0.PART_NAME LIKE '%/p1=1' ESCAPE '\' ) OR (A0.PART_NAME LIKE '%/p1=2' ESCAPE '\' )) OR (A0.PART_NAME LIKE '%/p1=3' ESCAPE '\' )) OR ((A0.PART_NAME LIKE '%/p1=4' ESCAPE '\' ) O ... ``` Essentially, we just need to traverse predicate and extract the convertible sub-predicates like what we did in #24598. There is no need to maintain the CNF result set. ### Why are the changes needed? A better implementation for pushing down disjunctive and complex predicates. The pushed down predicates is always equal or shorter than the CNF result. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit tests Closes #29101 from gengliangwang/pushJoin. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

probot-autolabeler bot added the SQL label Jun 5, 2020

gengliangwang requested review from cloud-fan and wangyum June 5, 2020 08:09

gengliangwang mentioned this pull request Jun 5, 2020

[SPARK-31705][SQL] Push predicate through join by rewriting join condition to conjunctive normal form #28575

Closed

gengliangwang commented Jun 5, 2020

View reviewed changes

wangyum reviewed Jun 5, 2020

View reviewed changes

viirya reviewed Jun 6, 2020

View reviewed changes

gengliangwang mentioned this pull request Jun 6, 2020

[WIP][SPARK-31919][SQL] Push down more predicates through Join #28741

Closed

gengliangwang closed this Jun 6, 2020

gengliangwang reopened this Jun 7, 2020

wangyum reviewed Jun 7, 2020

View reviewed changes

gengliangwang force-pushed the cnf branch from cbc1220 to 97c3414 Compare June 8, 2020 07:04

AngersZhuuuu reviewed Jun 8, 2020

View reviewed changes

dilipbiswal reviewed Jun 8, 2020

View reviewed changes

wangyum reviewed Jun 9, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Outdated Show resolved Hide resolved

wangyum reviewed Jun 9, 2020

View reviewed changes

gengliangwang added 14 commits June 10, 2020 15:42

increase threshold and test case

7e4b019

fix test case; reduce threshold default value

0c92c1c

address comments and add test cases

7853f67

revise

84e89de

lazy val

0af4c48

update doc

95ee45e

remove assert

6bf4747

add back warning

fa03b00

address comments

c225f74

address comments

296068c

address comments

377f9d8

update test case

be79ab7

address comments

af018be

fix build

b42ce1d

gengliangwang force-pushed the cnf branch from 58e6fa5 to b42ce1d Compare June 11, 2020 00:53

maropu approved these changes Jun 11, 2020

View reviewed changes

wangyum approved these changes Jun 11, 2020

View reviewed changes

gengliangwang closed this in 11d3a74 Jun 11, 2020

AngersZhuuuu mentioned this pull request Jun 12, 2020

[SPARK-28169][SQL] Convert scan predicate condition to CNF #28805

Closed

gengliangwang mentioned this pull request Jul 14, 2020

[SPARK-32302][SQL] Partially push down disjunctive predicates through Join/Partitions #29101

Closed

	case Not(a Or b) => And(Not(a), Not(b))
	case Not(a And b) => Or(Not(a), Not(b))

	case Not(Not(e)) => e

[SPARK-31705][SQL] Push more possible predicates through Join via CNF conversion #28733

[SPARK-31705][SQL] Push more possible predicates through Join via CNF conversion #28733

Conversation

gengliangwang commented Jun 5, 2020 • edited by wangyum Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

gengliangwang commented Jun 5, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jun 5, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jun 5, 2020

SparkQA commented Jun 5, 2020

gengliangwang commented Jun 6, 2020

viirya Jun 6, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jun 6, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jun 7, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dilipbiswal Jun 8, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jun 8, 2020

SparkQA commented Jun 8, 2020

SparkQA commented Jun 9, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jun 10, 2020

gengliangwang commented Jun 11, 2020

SparkQA commented Jun 11, 2020

AngersZhuuuu commented Jun 11, 2020

maropu left a comment

Choose a reason for hiding this comment

maropu commented Jun 11, 2020

gengliangwang commented Jun 11, 2020

wangyum left a comment

Choose a reason for hiding this comment

AngersZhuuuu commented Jun 11, 2020

SparkQA commented Jun 11, 2020

gengliangwang commented Jun 11, 2020

gengliangwang commented Jun 5, 2020 •

edited by wangyum

Loading

viirya Jun 6, 2020 •

edited

Loading

dilipbiswal Jun 8, 2020 •

edited

Loading