[SPARK-34168] [SQL] Support DPP in AQE when the join is Broadcast hash join at the beginning #31258

JkSelf · 2021-01-20T07:17:06Z

What changes were proposed in this pull request?

This PR is to enable AQE and DPP when the join is broadcast hash join at the beginning, which can benefit the performance improvement from DPP and AQE at the same time. This PR will make use of the result of build side and then insert the DPP filter into the probe side.

Why are the changes needed?

Improve performance

Does this PR introduce any user-facing change?

No

How was this patch tested?

adding new ut

… enable AQE

JkSelf · 2021-01-20T07:18:12Z

@cloud-fan Please help me review. Thanks.

SparkQA · 2021-01-20T08:14:21Z

Test build #134257 has finished for PR 31258 at commit b2d70f1.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2021-01-20T23:37:18Z

Hi, @JkSelf . Could you fix the scala style?

… in main query

…nd insert the DPP filter after the build side excuted

JkSelf · 2021-01-21T08:49:51Z

@cloud-fan Updated based on the offline discussions. Please help review again. Thanks.

SparkQA · 2021-01-21T09:48:53Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38908/

SparkQA · 2021-01-21T09:53:17Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38908/

gengliangwang · 2021-01-21T10:46:55Z

cc @maryannxue

SparkQA · 2021-01-21T12:17:16Z

Test build #134321 has finished for PR 31258 at commit c24efbc.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2021-01-21T14:56:54Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/InsertAdaptiveSparkPlan.scala

+        SubqueryBroadcastExec(name, broadcastKeyIndex, buildKeys, exchange)
+
+        // Update the inputPlan and the currentPhysicalPlan of the adaptivePlan.
+        adaptivePlan.inputPlan = broadcastValues


can we wrap the adaptivePlan with subquery broadcast? Then we don't need to mutate adaptivePlan.inputPlan here and keep inputPlan as immutable.

viirya · 2021-01-21T23:11:03Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala

@@ -133,7 +133,7 @@ case class AdaptiveSparkPlanExec(
      inputPlan, queryStagePreparationRules, Some((planChangeLogger, "AQE Preparations")))
  }

-  @volatile private var currentPhysicalPlan = initialPlan
+  @volatile var currentPhysicalPlan = initialPlan


Exposing a mutable variable seems not a good idea.

Yes. Fixed.

viirya · 2021-01-21T23:12:24Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/InsertAdaptiveSparkPlan.scala

@@ -101,7 +102,6 @@ case class InsertAdaptiveSparkPlan(
    // TODO migrate dynamic-partition-pruning onto adaptive execution.


So DPP is supported and this comment looks out-of-dated?

dongjoon-hyun · 2021-01-24T17:43:07Z

Hi, @JkSelf .
Could you take a look at the failures? It seems that this PR has 18 relevant failures still.

DynamicPartitionPruningSuiteAEOn.simple inner join triggers DPP with mock-up tables
org.scalatest.exceptions.TestFailedException: false did not equal true Should trigger DPP with a subquery duplicate:
== Parsed Logical Plan ==
'Project ['f.date_id, 'f.store_id]
+- 'Join Inner, (('f.store_id = 's.store_id) AND ('s.country = NL))
   :- 'SubqueryAlias f
   :  +- 'UnresolvedRelation [fact_sk], [], false
   +- 'SubqueryAlias s
      +- 'UnresolvedRelation [dim_store], [], false

SparkQA · 2021-01-25T11:12:59Z

Test build #134426 has finished for PR 31258 at commit 89c74bb.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-01-29T02:46:55Z

Test build #134634 has finished for PR 31258 at commit 0d78a62.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class SubqueryAdaptiveBroadcastExec(
case class InsertDynamicPruningFilters(

SparkQA · 2021-01-29T03:20:49Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39222/

SparkQA · 2021-01-29T03:29:34Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39222/

SparkQA · 2021-01-29T04:20:40Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39224/

SparkQA · 2021-01-29T04:48:36Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39224/

SparkQA · 2021-01-29T06:14:25Z

Test build #134636 has finished for PR 31258 at commit cdd1226.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-02-04T11:46:12Z

Test build #134859 has finished for PR 31258 at commit f1226b1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-02-04T13:02:20Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39457/

cloud-fan · 2021-02-04T13:14:40Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/QueryStageExec.scala

 */
 case class ShuffleQueryStageExec(
    override val id: Int,
-    override val plan: SparkPlan) extends QueryStageExec {
+    override val plan: SparkPlan,
+    _canonicalized: SparkPlan) extends QueryStageExec {


we missed to add override def doCanonicalize(): SparkPlan = _canonicalized

SparkQA · 2021-02-04T13:20:41Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39457/

SparkQA · 2021-02-04T13:35:47Z

Test build #134871 has finished for PR 31258 at commit 8a77832.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-02-04T14:00:34Z

Test build #134870 has finished for PR 31258 at commit 213d2b5.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

wangyum · 2021-02-05T08:43:21Z

@JkSelf @cloud-fan This implementation can not reuse BroadcastExchange if BHJ after SMJ. For example:

  SELECT count(*)
        FROM   (SELECT c.c_customer_sk,
                       s.*
                FROM   customer c
                       JOIN store_sales s
                         ON c.c_customer_sk = ss_customer_sk) t1
               JOIN date_dim
                 ON ss_sold_date_sk = d_date_sk
                    AND d_year = 2002

Enable AE	Disable AE

JkSelf · 2021-02-05T08:49:38Z

@wangyum Yes. This implementation only is the first PR to support the join is bhj before apply AQE rules. We will support the join is smj and then convert to bhj use case in the following PRs.

SparkQA · 2021-02-05T10:00:22Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39509/

SparkQA · 2021-02-05T10:30:31Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39509/

cloud-fan · 2021-02-05T12:38:19Z

sql/core/src/test/scala/org/apache/spark/sql/DynamicPartitionPruningSuite.scala

@@ -1345,7 +1371,9 @@ abstract class DynamicPartitionPruningSuiteBase
    }
  }

-  test("SPARK-32817: DPP throws error when the broadcast side is empty") {
+  test("SPARK-32817: DPP throws error when the broadcast side is empty",
+    DisableAdaptiveExecution("EliminateJoinToEmptyRelation " +


We can disable this rule by setting ADAPTIVE_OPTIMIZER_EXCLUDED_RULES.

cloud-fan

This is a good start!

SparkQA · 2021-02-05T13:31:00Z

Test build #134926 has finished for PR 31258 at commit a02e307.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-02-07T03:59:53Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39554/

SparkQA · 2021-02-07T04:27:57Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39554/

SparkQA · 2021-02-07T08:06:14Z

Test build #134971 has finished for PR 31258 at commit 3511996.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2021-02-08T12:23:08Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/QueryStageExec.scala

@@ -165,7 +175,8 @@ case class ShuffleQueryStageExec(
  override def newReuseInstance(newStageId: Int, newOutput: Seq[Attribute]): QueryStageExec = {
    val reuse = ShuffleQueryStageExec(
      newStageId,
-      ReusedExchangeExec(newOutput, shuffle))
+      ReusedExchangeExec(newOutput, shuffle),
+      shuffle.canonicalized)


nit: this should be _canonicalized

cloud-fan · 2021-02-08T12:23:15Z

sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/QueryStageExec.scala

@@ -229,7 +245,8 @@ case class BroadcastQueryStageExec(
  override def newReuseInstance(newStageId: Int, newOutput: Seq[Attribute]): QueryStageExec = {
    val reuse = BroadcastQueryStageExec(
      newStageId,
-      ReusedExchangeExec(newOutput, broadcast))
+      ReusedExchangeExec(newOutput, broadcast),
+      broadcast.canonicalized)


cloud-fan · 2021-02-08T16:42:41Z

thanks, merging to master!

SparkQA · 2021-02-08T18:37:48Z

Test build #135033 has finished for PR 31258 at commit 1e1b097.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

… join at the beginning This PR is to enable AQE and DPP when the join is broadcast hash join at the beginning, which can benefit the performance improvement from DPP and AQE at the same time. This PR will make use of the result of build side and then insert the DPP filter into the probe side. Improve performance No adding new ut Closes apache#31258 from JkSelf/supportDPP1. Authored-by: jiake <ke.a.jia@intel.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

JkSelf added 3 commits March 20, 2005 10:53

code style

89c74bb

add the InsertDynamicPruningFilters rule to plan the DPP filters when…

0d78a62

… enable AQE

support DPP in AQE when the join is bhj before AQE optimizations

b2d70f1

github-actions bot added the SQL label Jan 20, 2021

JkSelf added 5 commits January 21, 2021 16:35

convert the dpp expression into the subquery and then reuse the stage…

50ec6f9

… in main query

code style

83d3a13

code style

f3d15fc

fix the scala style issue

c24efbc

convert the logical plan to physical plan when compile the subquery a…

6cc5a57

…nd insert the DPP filter after the build side excuted

cloud-fan reviewed Jan 21, 2021

View reviewed changes

viirya reviewed Jan 21, 2021

View reviewed changes

fix the scala style checks

cdd1226

remove the Exchange field in SubqueryAdaptiveBroadcastExec class

b72830e

small fix

8a77832

cloud-fan reviewed Feb 4, 2021

View reviewed changes

resolve the comments

a02e307

cloud-fan reviewed Feb 5, 2021

View reviewed changes

cloud-fan approved these changes Feb 5, 2021

View reviewed changes

resolve the comments

3511996

cloud-fan reviewed Feb 8, 2021

View reviewed changes

cloud-fan approved these changes Feb 8, 2021

View reviewed changes

resolve the comments

1e1b097

cloud-fan closed this in 3b26bc2 Feb 8, 2021

JkSelf mentioned this pull request Mar 26, 2021

[SPARK-34637][SQL] Improve the performance of AQE and DPP through logical optimization. #31941

Closed

JkSelf mentioned this pull request May 7, 2021

[SPARK-34637] [SQL] Support DPP + AQE when the broadcast exchange can be reused #31756

Closed

		@@ -101,7 +102,6 @@ case class InsertAdaptiveSparkPlan(
		// TODO migrate dynamic-partition-pruning onto adaptive execution.

[SPARK-34168] [SQL] Support DPP in AQE when the join is Broadcast hash join at the beginning #31258

[SPARK-34168] [SQL] Support DPP in AQE when the join is Broadcast hash join at the beginning #31258

Conversation

JkSelf commented Jan 20, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

JkSelf commented Jan 20, 2021

SparkQA commented Jan 20, 2021

dongjoon-hyun commented Jan 20, 2021

JkSelf commented Jan 21, 2021

SparkQA commented Jan 21, 2021

SparkQA commented Jan 21, 2021

gengliangwang commented Jan 21, 2021

SparkQA commented Jan 21, 2021

cloud-fan Jan 21, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun commented Jan 24, 2021

SparkQA commented Jan 25, 2021

SparkQA commented Jan 29, 2021

SparkQA commented Jan 29, 2021

SparkQA commented Jan 29, 2021

SparkQA commented Jan 29, 2021

SparkQA commented Jan 29, 2021

SparkQA commented Jan 29, 2021

SparkQA commented Feb 4, 2021

SparkQA commented Feb 4, 2021

Choose a reason for hiding this comment

SparkQA commented Feb 4, 2021

SparkQA commented Feb 4, 2021

SparkQA commented Feb 4, 2021

wangyum commented Feb 5, 2021

JkSelf commented Feb 5, 2021

SparkQA commented Feb 5, 2021

SparkQA commented Feb 5, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan left a comment

Choose a reason for hiding this comment

SparkQA commented Feb 5, 2021

SparkQA commented Feb 7, 2021

SparkQA commented Feb 7, 2021

SparkQA commented Feb 7, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Feb 8, 2021

SparkQA commented Feb 8, 2021

cloud-fan Jan 21, 2021 •

edited

Loading