-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-2213][SQL] Sort Merge Join #3173
Conversation
Conflicts: sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala
Test build #23112 has started for PR 3173 at commit
|
Test build #23112 has finished for PR 3173 at commit
|
Test FAILed. |
val mergeJoin = joins.MergeJoin(leftKeys, rightKeys, planLater(left), planLater(right)) | ||
condition.map(Filter(_, mergeJoin)).getOrElse(mergeJoin) :: Nil | ||
|
||
case ExtractEquiJoinKeys(Inner, leftKeys, rightKeys, condition, left, right) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just passing through, but is this right? It appears to be the same case as above, so I'm not sure when it would be matched.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, the hash inner join will not be matched. I am trying to resolve to merge join on inner join. I am still figuring out the best way to add merge join to query optimizer.
Test build #23113 has started for PR 3173 at commit
|
Test build #23113 has finished for PR 3173 at commit
|
Test FAILed. |
That's really nice to have the Sort-Merge-Join, as we did meet some of join queries couldn't run completely in real cases. One high level comment on this, can we also keep the BTW: do you have any performance comparison result can be shared with us? |
/cc @yhuai for the changes to our partitioning API. I also agree with @chenghao-intel that we probably want to keep ShuffleHashJoin and also that we need some performance comparison information. |
* [[Expression Expressions]] will be co-located. Based on the context, this | ||
* can mean such tuples are either co-located in the same partition or they will be contiguous | ||
* within a single partition. | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems you want to update this comment.
Thanks for working on this! However, since there are a few questions to address about how to select this operator, I suggest we move discussion to JIRA and close this issue for now. Please reopen once it is passing tests and ready for review. |
Hi, this looks great! Is there a reason why sort based join is not in spark core, only in spark SQL? |
@justinuang i think you are interesting in SPARK-5763. |
Thanks for the initial work from Ishiihara in #3173 This PR introduce a new join method of sort merge join, which firstly ensure that keys of same value are in the same partition, and inside each partition the Rows are sorted by key. Then we can run down both sides together, find matched rows using [sort merge join](http://en.wikipedia.org/wiki/Sort-merge_join). In this way, we don't have to store the whole hash table of one side as hash join, thus we have less memory usage. Also, this PR would benefit from #3438 , making the sorting phrase much more efficient. We introduced a new configuration of "spark.sql.planner.sortMergeJoin" to switch between this(`true`) and ShuffledHashJoin(`false`), probably we want the default value of it be `false` at first. Author: Daoyuan Wang <daoyuan.wang@intel.com> Author: Michael Armbrust <michael@databricks.com> This patch had conflicts when merged, resolved by Committer: Michael Armbrust <michael@databricks.com> Closes #5208 from adrian-wang/smj and squashes the following commits: 2493b9f [Daoyuan Wang] fix style 5049d88 [Daoyuan Wang] propagate rowOrdering for RangePartitioning f91a2ae [Daoyuan Wang] yin's comment: use external sort if option is enabled, add comments f515cd2 [Daoyuan Wang] yin's comment: outputOrdering, join suite refine ec8061b [Daoyuan Wang] minor change 413fd24 [Daoyuan Wang] Merge pull request #3 from marmbrus/pr/5208 952168a [Michael Armbrust] add type 5492884 [Michael Armbrust] copy when ordering 7ddd656 [Michael Armbrust] Cleanup addition of ordering requirements b198278 [Daoyuan Wang] inherit ordering in project c8e82a3 [Daoyuan Wang] fix style 6e897dd [Daoyuan Wang] hide boundReference from manually construct RowOrdering for key compare in smj 8681d73 [Daoyuan Wang] refactor Exchange and fix copy for sorting 2875ef2 [Daoyuan Wang] fix changed configuration 61d7f49 [Daoyuan Wang] add omitted comment 00a4430 [Daoyuan Wang] fix bug 078d69b [Daoyuan Wang] address comments: add comments, do sort in shuffle, and others 3af6ba5 [Daoyuan Wang] use buffer for only one side 171001f [Daoyuan Wang] change default outputordering 47455c9 [Daoyuan Wang] add apache license ... a28277f [Daoyuan Wang] fix style 645c70b [Daoyuan Wang] address comments using sort 068c35d [Daoyuan Wang] fix new style and add some tests 925203b [Daoyuan Wang] address comments 07ce92f [Daoyuan Wang] fix ArrayIndexOutOfBound 42fca0e [Daoyuan Wang] code clean e3ec096 [Daoyuan Wang] fix comment style.. 2edd235 [Daoyuan Wang] fix outputpartitioning 57baa40 [Daoyuan Wang] fix sort eval bug 303b6da [Daoyuan Wang] fix several errors 95db7ad [Daoyuan Wang] fix brackets for if-statement 4464f16 [Daoyuan Wang] fix error 880d8e9 [Daoyuan Wang] sort merge join for spark sql
This PR adds MergeJoin operator to Spark SQL. The semantics of MergeJoin operator is similar to Hive's Sort merge bucket join.
MergeJoin operator relies on SortBasedShuffle to create partitions that sorted by the join key. In each partition, we merge the two child iterators. The tricky part in merge step is handling duplicate join keys. To handle duplicate keys, we use a buffer to store all matching elements in right iterator for a certain join key. The buffer is used for generating join tuples when the join key of the next left element is the same as the current join key.
MergeJoin reduces extra memory consumption, in the current implementation, MergeJoin only needs memory that can hold elements of the key that has the most duplicates in right iterator.
For query optimization, we may resolve to MergeJoin when both relations are large and neither of the two can fit in memory. Currently, this heuristic is not added to optimizer. I would appreciate if you can add comments on how to resolve to MergeJoin in optimizer.
Currently, MergeJoin only supports inner join. However, it can be extended to support outer join. Will handle outer join in separate PRs.