[SPARK-38959][SQL][FOLLOWUP] Optimizer batch `PartitionPruning` should optimize subqueries #38557

cloud-fan · 2022-11-08T08:24:15Z

What changes were proposed in this pull request?

This is a followup to #36304 to simplify RowLevelOperationRuntimeGroupFiltering. It does 3 things:

run OptimizeSubqueries in the batch PartitionPruning, so that RowLevelOperationRuntimeGroupFiltering does not need to invoke it manually.
skip dpp subquery in OptimizeSubqueries, to avoid the issue fixed by [SPARK-36444][SQL] Remove OptimizeSubqueries from batch of PartitionPruning #33664
RowLevelOperationRuntimeGroupFiltering creates InSubquery instead of DynamicPruningSubquery, so that it can be optimized by OptimizeSubqueries later. This also avoids unnecessary planning overhead of DynamicPruningSubquery, as there is no join and we can only run it as a subquery.

Why are the changes needed?

code simplification

Does this PR introduce any user-facing change?

no

How was this patch tested?

existing tests

cloud-fan · 2022-11-08T08:26:22Z

cc @aokolnychyi @viirya @wangyum

aokolnychyi · 2022-11-08T19:07:36Z

...a/org/apache/spark/sql/execution/dynamicpruning/RowLevelOperationRuntimeGroupFiltering.scala

-      DynamicPruningSubquery(key, buildQuery, buildKeys, index, onlyInBroadcast = false)
-    }
-    dynamicPruningSubqueries.reduce(And)
+    val buildQuery = Aggregate(buildKeys, buildKeys, matchingRowsPlan)


Are there any downsides of rewriting DynamicPruningSubquery into DynamicPruningExpression directly instead of relying on PlanDynamicPruningFilters and PlanAdaptiveDynamicPruningFilters?

I see some special branches for exchange reuse in those rules that would not apply now.

I don't see any downside. We can only reuse broadcast if the DPP filter is derived from a join, which doesn't apply here.

Got it. I was originally worried we could miss some future optimizations given that dynamic pruning for row-level operations would go through a different route compared to the normal DPP.

One alternative could be to extend DynamicPruningSubquery with a flag whether it should be optimized or not. Up to you, though.

My rationale is, what we really need is a subquery here. This is completely different from dynamic partition pruning. One limitation is DS v2 runtime filter pushdown only applies to DynamicPruningExpression. We can probably fix that and accept normal non-correlated subqueries as well.

Yeah, DS v2 runtime filtering framework is fairly limited at this point.

viirya · 2022-11-09T01:43:51Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

+      // Do not optimize DPP subquery, as it was created from optimized plan and we should not
+      // optimize it again, to save optimization time and avoid breaking broadcast/subquery reuse.
+      case d: DynamicPruningSubquery => d


This makes sense. Just wondering that is this particularly related to SPARK-38959?

Yes, because this PR adds OptimizeSubqueries to the batch PartitionPruning and we should not break #33664

viirya · 2022-11-09T01:44:56Z

...a/org/apache/spark/sql/execution/dynamicpruning/RowLevelOperationRuntimeGroupFiltering.scala

@@ -66,7 +65,7 @@ case class RowLevelOperationRuntimeGroupFiltering(optimizeSubqueries: Rule[Logic
      }

      // optimize subqueries to rewrite them as joins and trigger job planning


This comment can be removed.

aokolnychyi

+1 to removing an explicit reference to OptimizeSubqueries. I am a bit worried we would plan dynamic pruning for row-level operations differently compared regular DPP. However, that seems safe at this point.

Thanks for looking into this, @cloud-fan!

cloud-fan · 2022-11-10T05:45:27Z

thanks for review, merging to master!

dongjoon-hyun

+1, late LGTM. Thank you all.

### What changes were proposed in this pull request? This is a followup of #38557 . We found that some optimizer rules can't be applied twice (those in the `Once` batch), but running the rule `OptimizeSubqueries` twice breaks it as it optimizes subqueries twice. This PR partially reverts #38557 to still invoke `OptimizeSubqueries` in `RowLevelOperationRuntimeGroupFiltering`. We don't fully revert #38557 because it's still beneficial to use IN subquery directly instead of using DPP framework as there is no join. ### Why are the changes needed? Fix the optimizer. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? N/A Closes #38626 from cloud-fan/follow. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…d optimize subqueries ### What changes were proposed in this pull request? This is a followup to apache#36304 to simplify `RowLevelOperationRuntimeGroupFiltering`. It does 3 things: 1. run `OptimizeSubqueries` in the batch `PartitionPruning`, so that `RowLevelOperationRuntimeGroupFiltering` does not need to invoke it manually. 2. skip dpp subquery in `OptimizeSubqueries`, to avoid the issue fixed by apache#33664 3. `RowLevelOperationRuntimeGroupFiltering` creates `InSubquery` instead of `DynamicPruningSubquery`, so that it can be optimized by `OptimizeSubqueries` later. This also avoids unnecessary planning overhead of `DynamicPruningSubquery`, as there is no join and we can only run it as a subquery. ### Why are the changes needed? code simplification ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests Closes apache#38557 from cloud-fan/help. Lead-authored-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? This is a followup of apache#38557 . We found that some optimizer rules can't be applied twice (those in the `Once` batch), but running the rule `OptimizeSubqueries` twice breaks it as it optimizes subqueries twice. This PR partially reverts apache#38557 to still invoke `OptimizeSubqueries` in `RowLevelOperationRuntimeGroupFiltering`. We don't fully revert apache#38557 because it's still beneficial to use IN subquery directly instead of using DPP framework as there is no join. ### Why are the changes needed? Fix the optimizer. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? N/A Closes apache#38626 from cloud-fan/follow. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

optimizer batch PartitionPruning should optimize subqueries

5dd44f6

github-actions bot added the SQL label Nov 8, 2022

cloud-fan mentioned this pull request Nov 8, 2022

[SPARK-38959][SQL][FOLLOW-UP] Address feedback for RowLevelOperationRuntimeGroupFiltering #38526

Closed

aokolnychyi reviewed Nov 8, 2022

View reviewed changes

viirya reviewed Nov 9, 2022

View reviewed changes

Update RowLevelOperationRuntimeGroupFiltering.scala

1da0d48

wangyum approved these changes Nov 9, 2022

View reviewed changes

viirya approved these changes Nov 9, 2022

View reviewed changes

aokolnychyi approved these changes Nov 9, 2022

View reviewed changes

cloud-fan closed this in 865a3de Nov 10, 2022

dongjoon-hyun reviewed Nov 10, 2022

View reviewed changes

cloud-fan mentioned this pull request Nov 11, 2022

[SPARK-38959][SQL][FOLLOWUP] Do not optimize subqueries twice #38626

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-38959][SQL][FOLLOWUP] Optimizer batch `PartitionPruning` should optimize subqueries #38557

[SPARK-38959][SQL][FOLLOWUP] Optimizer batch `PartitionPruning` should optimize subqueries #38557

cloud-fan commented Nov 8, 2022

cloud-fan commented Nov 8, 2022

aokolnychyi Nov 8, 2022 •

edited

Loading

cloud-fan Nov 9, 2022 •

edited

Loading

aokolnychyi Nov 9, 2022

cloud-fan Nov 9, 2022

aokolnychyi Nov 9, 2022

viirya Nov 9, 2022

cloud-fan Nov 9, 2022

viirya Nov 9, 2022

aokolnychyi left a comment •

edited

Loading

cloud-fan commented Nov 10, 2022

dongjoon-hyun left a comment

		@@ -66,7 +65,7 @@ case class RowLevelOperationRuntimeGroupFiltering(optimizeSubqueries: Rule[Logic
		}

		// optimize subqueries to rewrite them as joins and trigger job planning

[SPARK-38959][SQL][FOLLOWUP] Optimizer batch PartitionPruning should optimize subqueries #38557

[SPARK-38959][SQL][FOLLOWUP] Optimizer batch PartitionPruning should optimize subqueries #38557

Conversation

cloud-fan commented Nov 8, 2022

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

cloud-fan commented Nov 8, 2022

aokolnychyi Nov 8, 2022 • edited Loading

Choose a reason for hiding this comment

cloud-fan Nov 9, 2022 • edited Loading

Choose a reason for hiding this comment

aokolnychyi Nov 9, 2022

Choose a reason for hiding this comment

cloud-fan Nov 9, 2022

Choose a reason for hiding this comment

aokolnychyi Nov 9, 2022

Choose a reason for hiding this comment

viirya Nov 9, 2022

Choose a reason for hiding this comment

cloud-fan Nov 9, 2022

Choose a reason for hiding this comment

viirya Nov 9, 2022

Choose a reason for hiding this comment

aokolnychyi left a comment • edited Loading

Choose a reason for hiding this comment

cloud-fan commented Nov 10, 2022

dongjoon-hyun left a comment

Choose a reason for hiding this comment

[SPARK-38959][SQL][FOLLOWUP] Optimizer batch `PartitionPruning` should optimize subqueries #38557

[SPARK-38959][SQL][FOLLOWUP] Optimizer batch `PartitionPruning` should optimize subqueries #38557

aokolnychyi Nov 8, 2022 •

edited

Loading

cloud-fan Nov 9, 2022 •

edited

Loading

aokolnychyi left a comment •

edited

Loading