-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MINOR][DOC] Update the condition description of serialized shuffle #23228
Conversation
Test build #99703 has finished for PR 23228 at commit
|
Test build #4453 has finished for PR 23228 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the test failure can be ignored as it can't be related.
@@ -33,10 +33,10 @@ import org.apache.spark.shuffle._ | |||
* Sort-based shuffle has two different write paths for producing its map output files: | |||
* | |||
* - Serialized sorting: used when all three of the following conditions hold: | |||
* 1. The shuffle dependency specifies no aggregation or output ordering. | |||
* 1. The shuffle dependency specifies no map-side combine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this sound right @JoshRosen ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks right to me, according to
spark/core/src/main/scala/org/apache/spark/shuffle/sort/SortShuffleManager.scala
Line 195 in d5dadbf
} else if (dependency.mapSideCombine) { |
LGTM, cc @jiangxb1987 |
Please update the title |
I have updated, thanks all. |
retest this please |
Test build #99892 has finished for PR 23228 at commit
|
retest this please |
Test build #99902 has finished for PR 23228 at commit
|
thanks, merging to master! |
## What changes were proposed in this pull request? These three condition descriptions should be updated, follow #23228 : <li>no Ordering is specified,</li> <li>no Aggregator is specified, and</li> <li>the number of partitions is less than <code>spark.shuffle.sort.bypassMergeThreshold</code>. </li> 1、If the shuffle dependency specifies aggregation, but it only aggregates at the reduce-side, BypassMergeSortShuffle can still be used. 2、If the number of output partitions is spark.shuffle.sort.bypassMergeThreshold(eg.200), we can use BypassMergeSortShuffle. ## How was this patch tested? N/A Closes #23281 from lcqzte10192193/wid-lcq-1211. Authored-by: lichaoqun <li.chaoqun@zte.com.cn> Signed-off-by: Sean Owen <sean.owen@databricks.com>
## What changes were proposed in this pull request? These three condition descriptions should be updated, follow apache#23228 : <li>no Ordering is specified,</li> <li>no Aggregator is specified, and</li> <li>the number of partitions is less than <code>spark.shuffle.sort.bypassMergeThreshold</code>. </li> 1、If the shuffle dependency specifies aggregation, but it only aggregates at the reduce-side, BypassMergeSortShuffle can still be used. 2、If the number of output partitions is spark.shuffle.sort.bypassMergeThreshold(eg.200), we can use BypassMergeSortShuffle. ## How was this patch tested? N/A Closes apache#23281 from lcqzte10192193/wid-lcq-1211. Authored-by: lichaoqun <li.chaoqun@zte.com.cn> Signed-off-by: Sean Owen <sean.owen@databricks.com>
## What changes were proposed in this pull request? `1. The shuffle dependency specifies no aggregation or output ordering.` If the shuffle dependency specifies aggregation, but it only aggregates at the reduce-side, serialized shuffle can still be used. `3. The shuffle produces fewer than 16777216 output partitions.` If the number of output partitions is 16777216 , we can use serialized shuffle. We can see this mothod: `canUseSerializedShuffle` ## How was this patch tested? N/A Closes apache#23228 from 10110346/SerializedShuffle_doc. Authored-by: liuxian <liu.xian3@zte.com.cn> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request? These three condition descriptions should be updated, follow apache#23228 : <li>no Ordering is specified,</li> <li>no Aggregator is specified, and</li> <li>the number of partitions is less than <code>spark.shuffle.sort.bypassMergeThreshold</code>. </li> 1、If the shuffle dependency specifies aggregation, but it only aggregates at the reduce-side, BypassMergeSortShuffle can still be used. 2、If the number of output partitions is spark.shuffle.sort.bypassMergeThreshold(eg.200), we can use BypassMergeSortShuffle. ## How was this patch tested? N/A Closes apache#23281 from lcqzte10192193/wid-lcq-1211. Authored-by: lichaoqun <li.chaoqun@zte.com.cn> Signed-off-by: Sean Owen <sean.owen@databricks.com>
What changes were proposed in this pull request?
1. The shuffle dependency specifies no aggregation or output ordering.
If the shuffle dependency specifies aggregation, but it only aggregates at the reduce-side, serialized shuffle can still be used.
3. The shuffle produces fewer than 16777216 output partitions.
If the number of output partitions is 16777216 , we can use serialized shuffle.
We can see this mothod:
canUseSerializedShuffle
How was this patch tested?
N/A