Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-26138][SQL] Cross join requires push LocalLimit in LimitPushDown rule #23104

Closed
wants to merge 4 commits into from

Conversation

guoxiaolongzte
Copy link

What changes were proposed in this pull request?

In LimitPushDown batch, cross join can push down the limit.

How was this patch tested?

manual tests

Please review http://spark.apache.org/contributing.html before opening a pull request.

Copy link
Contributor

@mgaido91 mgaido91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add some UTs in order to enforce this. Moreover, can't we push it on the right side as well?

@guoxiaolongzte
Copy link
Author

OK, I will add some UTs.

@guoxiaolongzte
Copy link
Author

guoxiaolongzte commented Nov 22, 2018

Cartesian product refers to the Cartesian product of two sets X and Y in mathematics , also known as direct product , expressed as X × Y , the first object is a member of X and the second object is One of all possible ordered pairs of Y. So cross join mustpush it on the left side.
For example, A={a,b}, B={0,1,2}, then
A × B = {(a, 0), (a, 1), (a, 2), (b, 0), (b, 1), (b, 2)}
B×A={(0, a), (0, b), (1, a), (1, b), (2, a), (2, b)}

@mgaido91
Copy link
Contributor

@guoxiaolongzte still that doesn't explain why we can push to the right side too. I do believe that it is possible. If the right side contains more than N items, where N is the limit size, the output will contains the combinations of the first item from the left side and the first N items from the right side. If the right side contains less than N items, pushing the limit on its side has no effect on the result.

@guoxiaolongzte
Copy link
Author

Yes I tested and understood, you are right. @mgaido91

Copy link
Contributor

@mgaido91 mgaido91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now this seems reasonable to me. cc @cloud-fan @dongjoon-hyun @gatorsmile shall we trigger a build for this? Thanks.

@guoxiaolongzte
Copy link
Author

@cloud-fan @dongjoon-hyun @gatorsmile
Help review the code.

@@ -459,6 +459,7 @@ object LimitPushDown extends Rule[LogicalPlan] {
val newJoin = joinType match {
case RightOuter => join.copy(right = maybePushLocalLimit(exp, right))
case LeftOuter => join.copy(left = maybePushLocalLimit(exp, left))
case Cross => join.copy(left = maybePushLocalLimit(exp, left), right = maybePushLocalLimit(exp, right))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about inner join without condition?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can match InnerLike when condition is empty.

Copy link
Author

@guoxiaolongzte guoxiaolongzte Nov 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A = {(a, 0), (b, 1), (c, 2), (d, 0), (e, 1), (f, 2)}
B = {(e, 1), (f, 2)}

A inner join B limit 2
If there is limit 2, (a, 0), (b, 1) inner join {(e, 1), (f, 2)}, the result is empty. But the real result is not empty.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inner join without condition is literally cross join.

Copy link
Author

@guoxiaolongzte guoxiaolongzte Nov 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When set spark.sql.crossJoin.enabled=true,
inner join without condition, LeftOuter without condition, RightOuter without condition, FullOuter without condition, all these are iterally cross join?
@cloud-fan

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

@guoxiaolongzte guoxiaolongzte Nov 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, if when set spark.sql.crossJoin.enabled=true, if Inner join without condition, LeftOuter join without condition, RightOuter join without condition, FullOuter join without condition , limit should be pushed down on both sides, just like cross join limit in this PR.
Is this correct?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan
Please give me some advice. Thank you.

Copy link
Contributor

@cloud-fan cloud-fan Nov 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if there is no join condition, I think join type doesn't matter and we can always push down limits. We may need to look into left anti join though.

Copy link
Author

@guoxiaolongzte guoxiaolongzte Nov 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two tables as follows:
CREATE TABLE test1(id int, name int);
CREATE TABLE test2(id int, name int);

test1 table data:
2,2
1,1

test2 table data:
2,2
3,3
4,4

Execute sql select * from test1 t1 left anti join test2 t2 on t1.id=t2.id limit 1; The result:
1,1

But
we push the limit 1 on left side, the result is not correct. Result is empty.
we push the limit 1 on right side, the result is not correct. Result is empty.

So
left anti join no need to push down limit. Similarly, left semi join is the same logic.

@gatorsmile
Copy link
Member

The title has a typo.

@guoxiaolongzte guoxiaolongzte changed the title [SPARK-26138][SQL] LimitPushDown cross join requires maybeBushLocalLimit [SPARK-26138][SQL] Cross join requires push LocalLimit in LimitPushDown rule Nov 26, 2018
@guoxiaolongzte
Copy link
Author

The title has a typo.

Sorry, it has been fixed.

@liu-zhaokun
Copy link
Contributor

@guoxiaolongzte good job

@guoxiaolongzte
Copy link
Author

Can I give you some advice on this issue?@gatorsmile @cloud-fan

@dongjoon-hyun
Copy link
Member

ok to test.

@SparkQA
Copy link

SparkQA commented Dec 14, 2018

Test build #100164 has finished for PR 23104 at commit 588c151.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Hi, @guoxiaolongzte . Could you run dev/scalastyle and fix the issue?

@guoxiaolongzte
Copy link
Author

@dongjoon-hyun
Thank you very much for your review.

@SparkQA
Copy link

SparkQA commented Dec 18, 2018

Test build #100273 has finished for PR 23104 at commit e173962.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@guoxiaolongzte
Copy link
Author

I looked at it. This error is not caused by my pr.

@mgaido91
Copy link
Contributor

@guoxiaolongzte can you address @cloud-fan 's comment? We need the same for InnerLike joins without conditions...

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@github-actions
Copy link

github-actions bot commented Jan 4, 2020

We're closing this PR because it hasn't been updated in a while.
This isn't a judgement on the merit of the PR in any way. It's just
a way of keeping the PR queue manageable.

If you'd like to revive this PR, please reopen it!

wangyum added a commit that referenced this pull request Feb 24, 2021
… empty

### What changes were proposed in this pull request?

This pr pushdown limit through InnerLike when condition is empty(Origin pr: #23104). For example:
```sql
CREATE TABLE t1 using parquet AS SELECT id AS a, id AS b FROM range(2);
CREATE TABLE t2 using parquet AS SELECT id AS d FROM range(2);
SELECT * FROM t1 CROSS JOIN t2 LIMIT 10;
```
Before this pr:
```
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- CollectLimit 10
   +- BroadcastNestedLoopJoin BuildRight, Cross
      :- FileScan parquet default.t1[a#5L,b#6L] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/private/var/folders/tg/f5mz46090wg7swzgdc69f8q03965_0/T/warehous..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<a:bigint,b:bigint>
      +- BroadcastExchange IdentityBroadcastMode, [id=#43]
         +- FileScan parquet default.t2[d#7L] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/private/var/folders/tg/f5mz46090wg7swzgdc69f8q03965_0/T/warehous..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<d:bigint>
```
After this pr:
```
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- CollectLimit 10
   +- BroadcastNestedLoopJoin BuildRight, Cross
      :- LocalLimit 10
      :  +- FileScan parquet default.t1[a#5L,b#6L] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/private/var/folders/tg/f5mz46090wg7swzgdc69f8q03965_0/T/warehous..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<a:bigint,b:bigint>
      +- BroadcastExchange IdentityBroadcastMode, [id=#51]
         +- LocalLimit 10
            +- FileScan parquet default.t2[d#7L] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/private/var/folders/tg/f5mz46090wg7swzgdc69f8q03965_0/T/warehous..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<d:bigint>
```

### Why are the changes needed?

Improve query performance.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit test.

Closes #31567 from wangyum/SPARK-26138.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Yuming Wang <yumwang@ebay.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants