[SPARK-28169][SQL] Convert scan predicate condition to CNF #28805

AngersZhuuuu · 2020-06-12T03:33:59Z

What changes were proposed in this pull request?

Spark can't push down scan predicate condition of Or:
Such as if I have a table default.test, it's partition col is dt,
If we use query :

select * from default.test 
where dt=20190625 or (dt = 20190626 and id in (1,2,3) )

In this case, Spark will resolve Or condition as one expression, and since this expr has reference of "id", then it can't been push down.

Base on pr #28733, In my PR , for SQL like
select * from default.test
where dt = 20190626 or (dt = 20190627 and xxx="a")

For this condition dt = 20190626 or (dt = 20190627 and xxx="a" ), it will been converted to CNF

(dt = 20190626 or dt = 20190627) and (dt = 20190626 or xxx = "a" )

then condition dt = 20190626 or dt = 20190627 will be push down when partition pruning

Why are the changes needed?

Optimize partition pruning

Does this PR introduce any user-facing change?

NO

How was this patch tested?

Added UT

SparkQA · 2020-06-12T04:28:29Z

Test build #123882 has finished for PR 28805 at commit b253af3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala

SparkQA · 2020-06-12T07:05:01Z

Test build #123890 has finished for PR 28805 at commit 478a7a8.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-06-12T11:21:15Z

Test build #123915 has finished for PR 28805 at commit 69f1763.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-06-12T11:31:03Z

Test build #123904 has finished for PR 28805 at commit 603660b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-06-12T18:57:41Z

Test build #123928 has finished for PR 28805 at commit 2f576fa.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AngersZhuuuu · 2020-06-13T00:09:11Z

cc @gengliangwang @maropu @cloud-fan @viirya

gengliangwang · 2020-06-19T05:30:12Z

In this case, Spark will resolve Or condition as one expression, and since this expr has reference of "id", then it can't been push down.

Sorry, could you explain more here?
The CNF process should break down dt = 20190626 and id in (1,2,3) to Seq((dt = 20190626), (id in (1,2,3)), and then these two sub-predicates will be processed in groupExpressionsByQualifier. What is the problem here?

AngersZhuuuu · 2020-06-19T05:39:02Z

The CNF process should break down dt = 20190626 and id in (1,2,3) to Seq((dt = 20190626), (id in (1,2,3)), and then these two sub-predicates will be processed in groupExpressionsByQualifier. What is the problem here?

In current partition pruning, ScanOperation get predicates by splitConjunctivePredicates ,
if there is (dt = 1 or (dt = 2 and id = 3)), it won't be seperated, then since this expression is reference contains (id, dt), it won't be pushed down as a partition predicates. Then it will scan all data in the partition table.

object HiveTableScans extends Strategy {
    def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match {
      case ScanOperation(projectList, predicates, relation: HiveTableRelation) =>
        // Filter out all predicates that only deal with partition keys, these are given to the
        // hive table scan operator to be used for partition pruning.
        val partitionKeyIds = AttributeSet(relation.partitionCols)
        val (pruningPredicates, otherPredicates) = predicates.partition { predicate =>
          !predicate.references.isEmpty &&
          predicate.references.subsetOf(partitionKeyIds)
        }

        pruneFilterProject(
          projectList,
          otherPredicates,
          identity[Seq[Expression]],
          HiveTableScanExec(_, relation, pruningPredicates)(sparkSession)) :: Nil
      case _ =>
        Nil
    }
  }

With convert to CNF, (dt = 1 or (dt = 2 and id = 3)) will be converted to (dt = 1 or dt = 2) and (dt = 1 or id = 3)) then this expression can be split by splitConjunctivePredicates and split to two expression (dt = 1 or dt = 2) and (dt = 1 or id = 3)), then (dt = 1 or dt = 2) can be pushed down as partition pruning predicates.

SparkQA · 2020-06-19T07:05:02Z

Test build #124256 has finished for PR 28805 at commit e71c45c.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class TimeFormatters(date: DateFormatter, timestamp: TimestampFormatter)

AngersZhuuuu · 2020-06-19T07:11:35Z

retest this please

AngersZhuuuu · 2020-06-19T07:12:53Z

@dongjoon-hyun Seems jenkins wrong? I didn't add class named TimeFormatters

SparkQA · 2020-06-19T11:59:00Z

Test build #124270 has finished for PR 28805 at commit e71c45c.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class TimeFormatters(date: DateFormatter, timestamp: TimestampFormatter)

SparkQA · 2020-06-30T23:00:47Z

Test build #124670 has finished for PR 28805 at commit 35b5813.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2020-07-01T00:39:58Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala

+   * Convert an expression to conjunctive normal form when pushing predicates through Join,
+   * when expand predicates, we can group by the qualifier avoiding generate unnecessary
+   * expression to control the length of final result since there are multiple tables.
+   * @param condition condition need to be convert


nit: convert -> converted

maropu · 2020-07-01T00:46:13Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala

+   * expression to control the length of final result since there are multiple tables.
+   * @param condition condition need to be convert
+   * @return expression seq in conjunctive normal form of input expression, if length exceeds
+   *         the threshold [[SQLConf.MAX_CNF_NODE_COUNT]] or length != 1, return empty Seq


nit: This @return says the same thing with the line 211 in a different way?

* @return the CNF result as sequence of disjunctive expressions. If the number of expressions * exceeds threshold on converting `Or`, `Seq.empty` is returned.

maropu · 2020-07-01T00:46:34Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala

+   * [[splitConjunctivePredicates]] won't split [[Or]] expression.
+   * @param condition condition need to be convert
+   * @return expression seq in conjunctive normal form of input expression, if length exceeds
+   *         the threshold [[SQLConf.MAX_CNF_NODE_COUNT]] or length != 1, return empty Seq


maropu · 2020-07-01T00:52:11Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala

+  def conjunctiveNormalFormAndGroupExpsByQualifier(condition: Expression): Seq[Expression] = {
+    conjunctiveNormalForm(condition,
+      (expressions: Seq[Expression]) =>
+        expressions.groupBy(_.references.map(_.qualifier)).map(_._2.reduceLeft(And)).toSeq)


nit format:

conjunctiveNormalForm(condition, (expressions: Seq[Expression]) => expressions.groupBy(_.references.map(_.qualifier)).map(_._2.reduceLeft(And)).toSeq)

maropu · 2020-07-01T00:52:42Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala

+   */
+  def conjunctiveNormalFormAndGroupExpsByReference(condition: Expression): Seq[Expression] = {
+    conjunctiveNormalForm(condition,
+      (expressions: Seq[Expression]) =>


nit format:

conjunctiveNormalForm(condition, (expressions: Seq[Expression]) => expressions.groupBy(e => AttributeSet(e.references)).map(_._2.reduceLeft(And)).toSeq)

maropu · 2020-07-01T01:01:07Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala

+  /**
+   * Convert an expression to conjunctive normal form when pushing predicates for partition pruning,
+   * when expand predicates, we can group by the reference avoiding generate unnecessary expression
+   * to control the length of final result since here we just have one table. In partition pruning


nit: How about rephrasing it like this?

* Convert an expression to conjunctive normal form for predicate pushdown and partition pruning. * When expanding predicates, this method groups expressions by their references for reducing * the size of pushed down predicates and corresponding codegen. In partition pruning strategies, * ...

maropu · 2020-07-01T01:03:21Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/PruneHiveTablePartitionsSuite.scala


  object Optimize extends RuleExecutor[LogicalPlan] {
    val batches =
      Batch("PruneHiveTablePartitions", Once,
        EliminateSubqueryAliases, new PruneHiveTablePartitions(spark)) :: Nil
  }

-  test("SPARK-15616 statistics pruned after going throuhg PruneHiveTablePartitions") {
+  test("SPARK-15616 statistics pruned after going through PruneHiveTablePartitions") {


nit: SPARK-15616 -> SPARK-15616: (This is not related to this pr though)

maropu · 2020-07-01T01:04:16Z

LGTM except for the minor comments.

AngersZhuuuu · 2020-07-01T01:52:24Z

LGTM except for the minor comments.

All minor comment done

maropu

cc: @gengliangwang

SparkQA · 2020-07-01T02:54:18Z

Test build #124717 has finished for PR 28805 at commit 3df019a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

AngersZhuuuu · 2020-07-01T02:57:45Z

retest this please

gengliangwang · 2020-07-01T06:32:11Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala

+   * @return the CNF result as sequence of disjunctive expressions. If the number of expressions
+   *         exceeds threshold on converting `Or`, `Seq.empty` is returned.
+   */
+  def conjunctiveNormalFormAndGroupExpsByQualifier(condition: Expression): Seq[Expression] = {


On second thought, the method name conjunctiveNormalFormAndGroupExpsByQualifier is too long and the And is weird.
How about changing to CNFWithGroupExpressionsByQualifier?

gengliangwang · 2020-07-01T06:32:34Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala

+   * @return the CNF result as sequence of disjunctive expressions. If the number of expressions
+   *         exceeds threshold on converting `Or`, `Seq.empty` is returned.
+   */
+  def conjunctiveNormalFormAndGroupExpsByReference(condition: Expression): Seq[Expression] = {


How about changing to CNFWithGroupExpressionsByReference?

gengliangwang

LGTM except for one comment on method naming. Thanks for the work.

SparkQA · 2020-07-01T08:29:07Z

Test build #124732 has finished for PR 28805 at commit 1b8466e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

AngersZhuuuu · 2020-07-01T08:40:11Z

LGTM except for one comment on method naming. Thanks for the work.

Updated, seems lates jenkins test failed not related to my change?

maropu · 2020-07-01T08:45:14Z

Yea, looks the failuare not related to this PR.

AngersZhuuuu · 2020-07-01T09:45:16Z

Yea, looks the failuare not related to this PR.

Ok, some confuse, can I see how spark's jenkins config the CI/CD?, I want to make our internal ci/cd pipline can show Unit Test result like this jenkins and some place I don't know how to config

SparkQA · 2020-07-01T09:58:03Z

Test build #124721 has finished for PR 28805 at commit 1b8466e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-07-01T12:00:13Z

thanks, merging to master!

SparkQA · 2020-07-01T16:24:29Z

Test build #124761 has finished for PR 28805 at commit e2777c9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

… Join/Partitions ### What changes were proposed in this pull request? In #28733 and #28805, CNF conversion is used to push down disjunctive predicates through join and partitions pruning. It's a good improvement, however, converting all the predicates in CNF can lead to a very long result, even with grouping functions over expressions. For example, for the following predicate ``` (p0 = '1' AND p1 = '1') OR (p0 = '2' AND p1 = '2') OR (p0 = '3' AND p1 = '3') OR (p0 = '4' AND p1 = '4') OR (p0 = '5' AND p1 = '5') OR (p0 = '6' AND p1 = '6') OR (p0 = '7' AND p1 = '7') OR (p0 = '8' AND p1 = '8') OR (p0 = '9' AND p1 = '9') OR (p0 = '10' AND p1 = '10') OR (p0 = '11' AND p1 = '11') OR (p0 = '12' AND p1 = '12') OR (p0 = '13' AND p1 = '13') OR (p0 = '14' AND p1 = '14') OR (p0 = '15' AND p1 = '15') OR (p0 = '16' AND p1 = '16') OR (p0 = '17' AND p1 = '17') OR (p0 = '18' AND p1 = '18') OR (p0 = '19' AND p1 = '19') OR (p0 = '20' AND p1 = '20') ``` will be converted into a long query(130K characters) in Hive metastore, and there will be error: ``` javax.jdo.JDOException: Exception thrown when executing query : SELECT DISTINCT 'org.apache.hadoop.hive.metastore.model.MPartition' AS NUCLEUS_TYPE,A0.CREATE_TIME,A0.LAST_ACCESS_TIME,A0.PART_NAME,A0.PART_ID,A0.PART_NAME AS NUCORDER0 FROM PARTITIONS A0 LEFT OUTER JOIN TBLS B0 ON A0.TBL_ID = B0.TBL_ID LEFT OUTER JOIN DBS C0 ON B0.DB_ID = C0.DB_ID WHERE B0.TBL_NAME = ? AND C0."NAME" = ? AND ((((((A0.PART_NAME LIKE '%/p1=1' ESCAPE '\' ) OR (A0.PART_NAME LIKE '%/p1=2' ESCAPE '\' )) OR (A0.PART_NAME LIKE '%/p1=3' ESCAPE '\' )) OR ((A0.PART_NAME LIKE '%/p1=4' ESCAPE '\' ) O ... ``` Essentially, we just need to traverse predicate and extract the convertible sub-predicates like what we did in #24598. There is no need to maintain the CNF result set. ### Why are the changes needed? A better implementation for pushing down disjunctive and complex predicates. The pushed down predicates is always equal or shorter than the CNF result. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit tests Closes #29101 from gengliangwang/pushJoin. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

AngersZhuuuu added 11 commits June 8, 2020 11:20

WIP

3356bac

save

346a1b4

Update HiveTableScanSuite.scala

250c7b3

save

15d62be

Merge branch 'master' into cnf-for-partition-pruning

39e85ad

save

d8f7c9e

Update SQLConf.scala

8856453

Update HiveTableScanSuite.scala

697a3a9

Update predicates.scala

7e8319e

empty safe

3734866

save

b253af3

probot-autolabeler bot added the SQL label Jun 12, 2020

fix bug

478a7a8

wangyum reviewed Jun 12, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala Show resolved Hide resolved

AngersZhuuuu added 2 commits June 12, 2020 15:34

save

603660b

wip

69f1763

fix return bug

2f576fa

Merge branch 'master' into cnf-for-partition-pruning

e71c45c

maropu reviewed Jul 1, 2020

View reviewed changes

AngersZhuuuu added 2 commits July 1, 2020 09:37

Update predicates.scala

3df019a

follow comment

1b8466e

maropu approved these changes Jul 1, 2020

View reviewed changes

gengliangwang reviewed Jul 1, 2020

View reviewed changes

gengliangwang approved these changes Jul 1, 2020

View reviewed changes

follow comment

e2777c9

cloud-fan closed this in 15fb5d7 Jul 1, 2020

This was referenced Jul 12, 2020

[SPARK-32284][SQL] Avoid expanding too many CNF predicates in partition pruning #29075

Closed

[SPARK-32302][SQL] Partially push down disjunctive predicates through Join/Partitions #29101

Closed

[SPARK-28169][SQL] Convert scan predicate condition to CNF #28805

[SPARK-28169][SQL] Convert scan predicate condition to CNF #28805

Conversation

AngersZhuuuu commented Jun 12, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented Jun 12, 2020

SparkQA commented Jun 12, 2020

SparkQA commented Jun 12, 2020

SparkQA commented Jun 12, 2020

SparkQA commented Jun 12, 2020

AngersZhuuuu commented Jun 13, 2020

gengliangwang commented Jun 19, 2020

AngersZhuuuu commented Jun 19, 2020

SparkQA commented Jun 19, 2020

AngersZhuuuu commented Jun 19, 2020

AngersZhuuuu commented Jun 19, 2020

SparkQA commented Jun 19, 2020

SparkQA commented Jun 30, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maropu commented Jul 1, 2020

AngersZhuuuu commented Jul 1, 2020

maropu left a comment

Choose a reason for hiding this comment

SparkQA commented Jul 1, 2020

AngersZhuuuu commented Jul 1, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gengliangwang left a comment

Choose a reason for hiding this comment

SparkQA commented Jul 1, 2020

AngersZhuuuu commented Jul 1, 2020

maropu commented Jul 1, 2020

AngersZhuuuu commented Jul 1, 2020

SparkQA commented Jul 1, 2020

cloud-fan commented Jul 1, 2020

SparkQA commented Jul 1, 2020