-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-38959][SQL][FOLLOWUP] Optimizer batch PartitionPruning
should optimize subqueries
#38557
Conversation
DynamicPruningSubquery(key, buildQuery, buildKeys, index, onlyInBroadcast = false) | ||
} | ||
dynamicPruningSubqueries.reduce(And) | ||
val buildQuery = Aggregate(buildKeys, buildKeys, matchingRowsPlan) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any downsides of rewriting DynamicPruningSubquery
into DynamicPruningExpression
directly instead of relying on PlanDynamicPruningFilters
and PlanAdaptiveDynamicPruningFilters
?
I see some special branches for exchange reuse in those rules that would not apply now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any downside. We can only reuse broadcast if the DPP filter is derived from a join, which doesn't apply here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. I was originally worried we could miss some future optimizations given that dynamic pruning for row-level operations would go through a different route compared to the normal DPP.
One alternative could be to extend DynamicPruningSubquery
with a flag whether it should be optimized or not. Up to you, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My rationale is, what we really need is a subquery here. This is completely different from dynamic partition pruning. One limitation is DS v2 runtime filter pushdown only applies to DynamicPruningExpression
. We can probably fix that and accept normal non-correlated subqueries as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, DS v2 runtime filtering framework is fairly limited at this point.
// Do not optimize DPP subquery, as it was created from optimized plan and we should not | ||
// optimize it again, to save optimization time and avoid breaking broadcast/subquery reuse. | ||
case d: DynamicPruningSubquery => d |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense. Just wondering that is this particularly related to SPARK-38959?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, because this PR adds OptimizeSubqueries
to the batch PartitionPruning
and we should not break #33664
@@ -66,7 +65,7 @@ case class RowLevelOperationRuntimeGroupFiltering(optimizeSubqueries: Rule[Logic | |||
} | |||
|
|||
// optimize subqueries to rewrite them as joins and trigger job planning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment can be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to removing an explicit reference to OptimizeSubqueries
. I am a bit worried we would plan dynamic pruning for row-level operations differently compared regular DPP. However, that seems safe at this point.
Thanks for looking into this, @cloud-fan!
thanks for review, merging to master! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, late LGTM. Thank you all.
### What changes were proposed in this pull request? This is a followup of #38557 . We found that some optimizer rules can't be applied twice (those in the `Once` batch), but running the rule `OptimizeSubqueries` twice breaks it as it optimizes subqueries twice. This PR partially reverts #38557 to still invoke `OptimizeSubqueries` in `RowLevelOperationRuntimeGroupFiltering`. We don't fully revert #38557 because it's still beneficial to use IN subquery directly instead of using DPP framework as there is no join. ### Why are the changes needed? Fix the optimizer. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? N/A Closes #38626 from cloud-fan/follow. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…d optimize subqueries ### What changes were proposed in this pull request? This is a followup to apache#36304 to simplify `RowLevelOperationRuntimeGroupFiltering`. It does 3 things: 1. run `OptimizeSubqueries` in the batch `PartitionPruning`, so that `RowLevelOperationRuntimeGroupFiltering` does not need to invoke it manually. 2. skip dpp subquery in `OptimizeSubqueries`, to avoid the issue fixed by apache#33664 3. `RowLevelOperationRuntimeGroupFiltering` creates `InSubquery` instead of `DynamicPruningSubquery`, so that it can be optimized by `OptimizeSubqueries` later. This also avoids unnecessary planning overhead of `DynamicPruningSubquery`, as there is no join and we can only run it as a subquery. ### Why are the changes needed? code simplification ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests Closes apache#38557 from cloud-fan/help. Lead-authored-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request? This is a followup of apache#38557 . We found that some optimizer rules can't be applied twice (those in the `Once` batch), but running the rule `OptimizeSubqueries` twice breaks it as it optimizes subqueries twice. This PR partially reverts apache#38557 to still invoke `OptimizeSubqueries` in `RowLevelOperationRuntimeGroupFiltering`. We don't fully revert apache#38557 because it's still beneficial to use IN subquery directly instead of using DPP framework as there is no join. ### Why are the changes needed? Fix the optimizer. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? N/A Closes apache#38626 from cloud-fan/follow. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
This is a followup to #36304 to simplify
RowLevelOperationRuntimeGroupFiltering
. It does 3 things:OptimizeSubqueries
in the batchPartitionPruning
, so thatRowLevelOperationRuntimeGroupFiltering
does not need to invoke it manually.OptimizeSubqueries
, to avoid the issue fixed by [SPARK-36444][SQL] Remove OptimizeSubqueries from batch of PartitionPruning #33664RowLevelOperationRuntimeGroupFiltering
createsInSubquery
instead ofDynamicPruningSubquery
, so that it can be optimized byOptimizeSubqueries
later. This also avoids unnecessary planning overhead ofDynamicPruningSubquery
, as there is no join and we can only run it as a subquery.Why are the changes needed?
code simplification
Does this PR introduce any user-facing change?
no
How was this patch tested?
existing tests