-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-32201][SQL] More general skew join pattern matching #29021
Conversation
Test build #125187 has finished for PR 29021 at commit
|
sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala
Show resolved
Hide resolved
Also this PR can work with #28947 to match more pattern together. |
This is an interesting use case. We must be careful when dealing with it. The key of skew join handling is to split the skew partition into smaller parts. For |
Yes. You are correct. I have recognized this case. I will should skip aggregation :( |
Test build #125189 has finished for PR 29021 at commit
|
Test build #125215 has finished for PR 29021 at commit
|
Test build #125293 has finished for PR 29021 at commit
|
sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedJoin.scala
Outdated
Show resolved
Hide resolved
Test build #125582 has finished for PR 29021 at commit
|
Test build #125667 has finished for PR 29021 at commit
|
Test build #125893 has finished for PR 29021 at commit
|
Test build #125950 has finished for PR 29021 at commit
|
Test build #125961 has finished for PR 29021 at commit
|
Test build #125968 has finished for PR 29021 at commit
|
Test build #125967 has finished for PR 29021 at commit
|
Gentle ping @cloud-fan |
@@ -142,6 +142,14 @@ object OptimizeLocalShuffleReader { | |||
def canUseLocalShuffleReader(plan: SparkPlan): Boolean = plan match { | |||
case s: ShuffleQueryStageExec => | |||
s.shuffle.canChangeNumPartitions | |||
// This CustomShuffleReaderExec used in skew side, its numPartitions increased. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It means the rule of OptimizeLocalShuffleReader
is disabled when enable the rule of OptimizedSkwedJoin
rule ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not exactly. In this more general skew join handling, we can match more patterns. For example, we can handle skew join like https://user-images.githubusercontent.com/1853780/87743215-01e9e780-c81b-11ea-97d9-f274b379912e.png. The number partitions of CustomShuffleReader
in the the BCJ (changed from SMJ by AE) after OptimizeLocalShuffleReader
is not equals to the anther side. So simply, I disable createLocalReader
.
@@ -340,3 +340,28 @@ case class BroadcastPartitioning(mode: BroadcastMode) extends Partitioning { | |||
case _ => false | |||
} | |||
} | |||
|
|||
/** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we need to add a new Partitioning
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CoalescedHashPartitioning
can satisfy the ClusteredDistribution
because a skew join may match the case which contains Aggregation (non-skew side). UnknownPartitioning
cannot satisfy ClusteredDistribution
and add an additional shuffle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @JkSelf I will provide another approach that removes this CoalescedHashPartitioning
and simplify the code. But current implementation with CoalescedHashPartitioning
might be more general for more cases.
Test build #126165 has finished for PR 29021 at commit
|
Test build #126238 has finished for PR 29021 at commit
|
retest this please |
Test build #126291 has finished for PR 29021 at commit
|
retest this please |
Test build #126315 has finished for PR 29021 at commit
|
Hi @cloud-fan @JkSelf , please help to review this PR. I am going to file a new PR for handling three tables SMJ skew which scope is beyond this PR. |
@@ -263,18 +299,31 @@ case class OptimizeSkewedJoin(conf: SQLConf) extends Rule[SparkPlan] { | |||
val shuffleStages = collectShuffleStages(plan) | |||
|
|||
if (shuffleStages.length == 2) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not we break this limitation first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because this PR is not to address the case which has multiple SMJ. We have another PR to change this limitation:
optimizeSingleStageSkewJoin
. This is the case one table is a bucket table and the SMJ is bucketing join with one side shuffle and skewingoptimizeThreeShuffleStageSkewJoin
. This is to address three tables SMJ (Two SMJs in one stage and no one can be changed to BCJ in AQE).
val right = rightOpt.get | ||
assert(left.partitionsWithSizes.length == right.partitionsWithSizes.length) | ||
val numPartitions = left.partitionsWithSizes.length | ||
// We use the median size of the original shuffle partitions to detect skewed partitions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is very hard to reason about. We need to clearly define:
- what nodes can appear between the shuffle stage and SMJ. As we discussed before, Agg can't appear at the skew side.
- how to estimate the size? Since there are nodes in the middle, the stats of the shuffle stage may not be accurate for the final join child. (e.g. Filter in the middle)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- what nodes can appear between the shuffle stage and SMJ. As we discussed before, Agg can't appear at the skew side.
In the canSplitLeftSide
and canSplitRightSide
, I added a allUnspecifiedDistribution(plan)
check. Current we only support the nodes with UnspecifiedDistribution
.
- how to estimate the size? Since there are nodes in the middle, the stats of the shuffle stage may not be accurate for the final join child. (e.g. Filter in the middle)
Filter should be pushdown to leaf, I didn't see this user case. Project may be a command case in the middle? Yes. the input size of shuffle stage may not be accurate. But the disadvantage is launching more tasks. I think the benefit from handling the skewing is more important than the disadvantage.
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
Current the AQE skew join handling logic is very specified.
It only can handle the pattern like this (2 tables):
We propose a more general skew Join pattern matching patch with less code changes.
In this patch, we can handle N-table join, join with aggregation, and so on.
PS: Here, N tables SMJ will be optimized to N-1 BCJ + 1 SMJ after AE. This PR won't handle the case N SMJ after AE. I will handle it in another PR.
Why are the changes needed?
In our production user cases, we found lots of slow jobs due to data skewing even we have enabled AQE skewed join. After investigated their patterns, we found current skewed join handle logic is so specified which can satisfied less production queries. The production queries are much more complicated than this pattern.
A straightforward case I will introduce here:
In above plan, there are 5 tables join case. This is not a simple case could be matched by the above pattern. But we still could see it is very similar with the pattern if we removed all the red boxes.
From the stage graph, the plan is much more straightforward:
The green boxes pattern is what we want to handle whatever red boxes exist or not.
Does this PR introduce any user-facing change?
No
How was this patch tested?
We give two unit tests.
Before:
After:
Before:
After: