Skip to content

Commit

Permalink
[SPARK-38959][SQL][FOLLOWUP] Do not optimize subqueries twice
Browse files Browse the repository at this point in the history
### What changes were proposed in this pull request?

This is a followup of #38557 . We found that some optimizer rules can't be applied twice (those in the `Once` batch), but running the rule `OptimizeSubqueries` twice breaks it as it optimizes subqueries twice.

This PR partially reverts #38557 to still invoke `OptimizeSubqueries` in `RowLevelOperationRuntimeGroupFiltering`. We don't fully revert #38557 because it's still beneficial to use IN subquery directly instead of using DPP framework as there is no join.

### Why are the changes needed?

Fix the optimizer.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

N/A

Closes #38626 from cloud-fan/follow.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
  • Loading branch information
cloud-fan committed Nov 14, 2022
1 parent e873871 commit 632784d
Show file tree
Hide file tree
Showing 4 changed files with 10 additions and 6 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -51,8 +51,10 @@ class SparkOptimizer(
Batch("Optimize Metadata Only Query", Once, OptimizeMetadataOnlyQuery(catalog)) :+
Batch("PartitionPruning", Once,
PartitionPruning,
RowLevelOperationRuntimeGroupFiltering,
OptimizeSubqueries) :+
// We can't run `OptimizeSubqueries` in this batch, as it will optimize the subqueries
// twice which may break some optimizer rules that can only be applied once. The rule below
// only invokes `OptimizeSubqueries` to optimize newly added subqueries.
new RowLevelOperationRuntimeGroupFiltering(OptimizeSubqueries)) :+
Batch("InjectRuntimeFilter", FixedPoint(1),
InjectRuntimeFilter) :+
Batch("MergeScalarSubqueries", Once,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ import org.apache.spark.sql.execution.joins.{BroadcastHashJoinExec, HashedRelati
case class PlanAdaptiveDynamicPruningFilters(
rootPlan: AdaptiveSparkPlanExec) extends Rule[SparkPlan] with AdaptiveSparkPlanHelper {
def apply(plan: SparkPlan): SparkPlan = {
if (!conf.dynamicPartitionPruningEnabled && !conf.runtimeRowLevelOperationGroupFilterEnabled) {
if (!conf.dynamicPartitionPruningEnabled) {
return plan
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ case class PlanDynamicPruningFilters(sparkSession: SparkSession) extends Rule[Sp
}

override def apply(plan: SparkPlan): SparkPlan = {
if (!conf.dynamicPartitionPruningEnabled && !conf.runtimeRowLevelOperationGroupFilterEnabled) {
if (!conf.dynamicPartitionPruningEnabled) {
return plan
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,8 @@ import org.apache.spark.sql.execution.datasources.v2.{DataSourceV2Implicits, Dat
*
* Note this rule only applies to group-based row-level operations.
*/
object RowLevelOperationRuntimeGroupFiltering extends Rule[LogicalPlan] with PredicateHelper {
class RowLevelOperationRuntimeGroupFiltering(optimizeSubqueries: Rule[LogicalPlan])
extends Rule[LogicalPlan] with PredicateHelper {

import DataSourceV2Implicits._

Expand All @@ -64,7 +65,8 @@ object RowLevelOperationRuntimeGroupFiltering extends Rule[LogicalPlan] with Pre
Filter(dynamicPruningCond, r)
}

replaceData.copy(query = newQuery)
// optimize subqueries to rewrite them as joins and trigger job planning
replaceData.copy(query = optimizeSubqueries(newQuery))
}

private def buildMatchingRowsPlan(
Expand Down

0 comments on commit 632784d

Please sign in to comment.