-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-44000][SQL] Add hint to disable broadcasting and replicating one side of join #41499
Conversation
Thank you, @aokolnychyi ! |
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM (only minor comment). Thank you again, @aokolnychyi .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, cc @sunchao , @viirya , @huaxingao , too
@dongjoon-hyun, I'll submit a follow-up PR with the comments. |
@@ -354,27 +367,29 @@ abstract class SparkStrategies extends QueryPlanner[SparkPlan] { | |||
} | |||
|
|||
def createCartesianProduct() = { | |||
if (joinType.isInstanceOf[InnerLike]) { | |||
if (joinType.isInstanceOf[InnerLike] && !hintToNotBroadcastAndReplicate(hint)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is CartesianProduct
related to broadcast?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is related to replication. If I am correct, it would zip each partition on one side with each partition on the other side so the same target record would appear in multiple places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, this is for replicating?
EDIT: saw your reply after adding this comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new hint prohibits both broadcasting and replication so it applies to 3 cases:
- Broadcast hash
- Cartesian product
- BNLJ build side
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to update this to the pr description too.
Thank you, @viirya and @aokolnychyi . Now, pending CIs. |
It seems that
|
When you have a chance, could you re-trigger the failed test pipeline? |
Triggered again: https://github.com/aokolnychyi/spark/actions/runs/5202190900 |
Thank you so much! |
Unfortunately, This PR passed
|
Thank you, @aokolnychyi and @viirya . |
Thank you, @dongjoon-hyun @viirya! I've created #41509 to add comments. |
…ne side of join ### What changes were proposed in this pull request? This PR adds a new internal join hint to disable broadcasting and replicating one side of join. ### Why are the changes needed? These changes are needed to disable broadcasting and replicating one side of join when it is not permitted, such as the cardinality check in MERGE operations in PR apache#41448. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This PR comes with tests. More tests are in apache#41448. Closes apache#41499 from aokolnychyi/spark-44000. Authored-by: aokolnychyi <aokolnychyi@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
def getBroadcastNestedLoopJoinBuildSide(hint: JoinHint): Option[BuildSide] = { | ||
if (hintToNotBroadcastAndReplicateLeft(hint)) { | ||
Some(BuildRight) | ||
} else if (hintToNotBroadcastAndReplicateRight(hint)) { | ||
Some(BuildLeft) | ||
} else { | ||
None | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic is saying that if we hint to NOT broadcast left, we can broadcast right, and vice-versa. Shouldn't we check the hint on the other side as well?
i.e.
if (hintToNotBroadcastAndReplicateLeft(hint) && !hintToNotBroadcastAndReplicateRight(hint)) {
Some(BuildRight)
} else if (hintToNotBroadcastAndReplicateRight(hint) && !hintToNotBroadcastAndReplicateLeft(hint)) {
Some(BuildLeft)
} else {
None
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @aokolnychyi
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the moment, the new hint can be only set on one side and never on both. BNLJ is considered as the default join strategy and having no broadcast and replicate hints on both sides would mean there is no applicable fallback join strategy to use. If we were to adapt the method above, we can't keep the existing default logic that picks the broadcast side based on size (that could cause a correctness problem). What about adding validation the new hint is set only on one side?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about adding validation the new hint is set only on one side?
Sounds good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, I will work on it. Created SPARK-44148.
…ne side of join ### What changes were proposed in this pull request? This PR adds a new internal join hint to disable broadcasting and replicating one side of join. ### Why are the changes needed? These changes are needed to disable broadcasting and replicating one side of join when it is not permitted, such as the cardinality check in MERGE operations in PR apache#41448. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This PR comes with tests. More tests are in apache#41448. Closes apache#41499 from aokolnychyi/spark-44000. Authored-by: aokolnychyi <aokolnychyi@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit d88633a)
What changes were proposed in this pull request?
This PR adds a new internal join hint to disable broadcasting and replicating one side of join.
Why are the changes needed?
These changes are needed to disable broadcasting and replicating one side of join when it is not permitted, such as the cardinality check in MERGE operations in PR #41448.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
This PR comes with tests. More tests are in #41448.