-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-26138][SQL] Cross join requires push LocalLimit in LimitPushDown rule #23104
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add some UTs in order to enforce this. Moreover, can't we push it on the right side as well?
OK, I will add some UTs. |
Cartesian product refers to the Cartesian product of two sets X and Y in mathematics , also known as direct product , expressed as X × Y , the first object is a member of X and the second object is One of all possible ordered pairs of Y. So cross join mustpush it on the left side. |
@guoxiaolongzte still that doesn't explain why we can push to the right side too. I do believe that it is possible. If the right side contains more than N items, where N is the limit size, the output will contains the combinations of the first item from the left side and the first N items from the right side. If the right side contains less than N items, pushing the limit on its side has no effect on the result. |
Yes I tested and understood, you are right. @mgaido91 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now this seems reasonable to me. cc @cloud-fan @dongjoon-hyun @gatorsmile shall we trigger a build for this? Thanks.
@cloud-fan @dongjoon-hyun @gatorsmile |
@@ -459,6 +459,7 @@ object LimitPushDown extends Rule[LogicalPlan] { | |||
val newJoin = joinType match { | |||
case RightOuter => join.copy(right = maybePushLocalLimit(exp, right)) | |||
case LeftOuter => join.copy(left = maybePushLocalLimit(exp, left)) | |||
case Cross => join.copy(left = maybePushLocalLimit(exp, left), right = maybePushLocalLimit(exp, right)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about inner join without condition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can match InnerLike
when condition is empty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A = {(a, 0), (b, 1), (c, 2), (d, 0), (e, 1), (f, 2)}
B = {(e, 1), (f, 2)}
A inner join B limit 2
If there is limit 2, (a, 0), (b, 1) inner join {(e, 1), (f, 2)}, the result is empty. But the real result is not empty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inner join without condition is literally cross join.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When set spark.sql.crossJoin.enabled=true,
inner join without condition, LeftOuter without condition, RightOuter without condition, FullOuter without condition, all these are iterally cross join?
@cloud-fan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@guoxiaolongzte nope.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, if when set spark.sql.crossJoin.enabled=true, if Inner join without condition, LeftOuter join without condition, RightOuter join without condition, FullOuter join without condition , limit should be pushed down on both sides, just like cross join limit in this PR.
Is this correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan
Please give me some advice. Thank you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if there is no join condition, I think join type doesn't matter and we can always push down limits. We may need to look into left anti join though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two tables as follows:
CREATE TABLE test1
(id
int, name
int);
CREATE TABLE test2
(id
int, name
int);
test1 table data:
2,2
1,1
test2 table data:
2,2
3,3
4,4
Execute sql select * from test1 t1 left anti join test2 t2 on t1.id=t2.id limit 1; The result:
1,1
But
we push the limit 1 on left side, the result is not correct. Result is empty.
we push the limit 1 on right side, the result is not correct. Result is empty.
So
left anti join no need to push down limit. Similarly, left semi join is the same logic.
The title has a typo. |
Sorry, it has been fixed. |
@guoxiaolongzte good job |
Can I give you some advice on this issue?@gatorsmile @cloud-fan |
ok to test. |
Test build #100164 has finished for PR 23104 at commit
|
Hi, @guoxiaolongzte . Could you run |
@dongjoon-hyun |
Test build #100273 has finished for PR 23104 at commit
|
I looked at it. This error is not caused by my pr. |
@guoxiaolongzte can you address @cloud-fan 's comment? We need the same for InnerLike joins without conditions... |
Can one of the admins verify this patch? |
We're closing this PR because it hasn't been updated in a while. If you'd like to revive this PR, please reopen it! |
… empty ### What changes were proposed in this pull request? This pr pushdown limit through InnerLike when condition is empty(Origin pr: #23104). For example: ```sql CREATE TABLE t1 using parquet AS SELECT id AS a, id AS b FROM range(2); CREATE TABLE t2 using parquet AS SELECT id AS d FROM range(2); SELECT * FROM t1 CROSS JOIN t2 LIMIT 10; ``` Before this pr: ``` == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- CollectLimit 10 +- BroadcastNestedLoopJoin BuildRight, Cross :- FileScan parquet default.t1[a#5L,b#6L] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/private/var/folders/tg/f5mz46090wg7swzgdc69f8q03965_0/T/warehous..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<a:bigint,b:bigint> +- BroadcastExchange IdentityBroadcastMode, [id=#43] +- FileScan parquet default.t2[d#7L] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/private/var/folders/tg/f5mz46090wg7swzgdc69f8q03965_0/T/warehous..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<d:bigint> ``` After this pr: ``` == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- CollectLimit 10 +- BroadcastNestedLoopJoin BuildRight, Cross :- LocalLimit 10 : +- FileScan parquet default.t1[a#5L,b#6L] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/private/var/folders/tg/f5mz46090wg7swzgdc69f8q03965_0/T/warehous..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<a:bigint,b:bigint> +- BroadcastExchange IdentityBroadcastMode, [id=#51] +- LocalLimit 10 +- FileScan parquet default.t2[d#7L] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/private/var/folders/tg/f5mz46090wg7swzgdc69f8q03965_0/T/warehous..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<d:bigint> ``` ### Why are the changes needed? Improve query performance. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #31567 from wangyum/SPARK-26138. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Yuming Wang <yumwang@ebay.com>
What changes were proposed in this pull request?
In LimitPushDown batch, cross join can push down the limit.
How was this patch tested?
manual tests
Please review http://spark.apache.org/contributing.html before opening a pull request.