[MINOR][DOC] Update the condition description of serialized shuffle #23228

10110346 · 2018-12-05T09:09:04Z

What changes were proposed in this pull request?

1. The shuffle dependency specifies no aggregation or output ordering.
If the shuffle dependency specifies aggregation, but it only aggregates at the reduce-side, serialized shuffle can still be used.
3. The shuffle produces fewer than 16777216 output partitions.
If the number of output partitions is 16777216 , we can use serialized shuffle.

We can see this mothod: canUseSerializedShuffle

How was this patch tested?

N/A

SparkQA · 2018-12-05T12:28:58Z

Test build #99703 has finished for PR 23228 at commit d5dadbf.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-12-05T17:51:35Z

Test build #4453 has finished for PR 23228 at commit d5dadbf.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

srowen

I believe the test failure can be ignored as it can't be related.

srowen · 2018-12-05T19:56:50Z

core/src/main/scala/org/apache/spark/shuffle/sort/SortShuffleManager.scala

@@ -33,10 +33,10 @@ import org.apache.spark.shuffle._
 * Sort-based shuffle has two different write paths for producing its map output files:
 *
 *  - Serialized sorting: used when all three of the following conditions hold:
- *    1. The shuffle dependency specifies no aggregation or output ordering.
+ *    1. The shuffle dependency specifies no map-side combine.


Does this sound right @JoshRosen ?

looks right to me, according to

spark/core/src/main/scala/org/apache/spark/shuffle/sort/SortShuffleManager.scala

Line 195 in d5dadbf

} else if (dependency.mapSideCombine) {

10110346 · 2018-12-07T07:50:10Z

cc @JoshRosen @cloud-fan

cloud-fan · 2018-12-09T13:23:12Z

LGTM, cc @jiangxb1987

jiangxb1987 · 2018-12-10T01:59:39Z

Please update the title [MINOR][DOC] Update the condition description of serialized shuffle

10110346 · 2018-12-10T02:22:50Z

I have updated, thanks all.

cloud-fan · 2018-12-10T05:48:10Z

retest this please

SparkQA · 2018-12-10T08:05:01Z

Test build #99892 has finished for PR 23228 at commit d5dadbf.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

10110346 · 2018-12-10T08:16:46Z

retest this please

SparkQA · 2018-12-10T12:40:19Z

Test build #99902 has finished for PR 23228 at commit d5dadbf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-12-10T14:37:31Z

thanks, merging to master!

## What changes were proposed in this pull request? These three condition descriptions should be updated, follow #23228 : <li>no Ordering is specified,</li> <li>no Aggregator is specified, and</li> <li>the number of partitions is less than <code>spark.shuffle.sort.bypassMergeThreshold</code>. </li> 1、If the shuffle dependency specifies aggregation, but it only aggregates at the reduce-side, BypassMergeSortShuffle can still be used. 2、If the number of output partitions is spark.shuffle.sort.bypassMergeThreshold(eg.200), we can use BypassMergeSortShuffle. ## How was this patch tested? N/A Closes #23281 from lcqzte10192193/wid-lcq-1211. Authored-by: lichaoqun <li.chaoqun@zte.com.cn> Signed-off-by: Sean Owen <sean.owen@databricks.com>

## What changes were proposed in this pull request? These three condition descriptions should be updated, follow apache#23228 : <li>no Ordering is specified,</li> <li>no Aggregator is specified, and</li> <li>the number of partitions is less than <code>spark.shuffle.sort.bypassMergeThreshold</code>. </li> 1、If the shuffle dependency specifies aggregation, but it only aggregates at the reduce-side, BypassMergeSortShuffle can still be used. 2、If the number of output partitions is spark.shuffle.sort.bypassMergeThreshold(eg.200), we can use BypassMergeSortShuffle. ## How was this patch tested? N/A Closes apache#23281 from lcqzte10192193/wid-lcq-1211. Authored-by: lichaoqun <li.chaoqun@zte.com.cn> Signed-off-by: Sean Owen <sean.owen@databricks.com>

## What changes were proposed in this pull request? `1. The shuffle dependency specifies no aggregation or output ordering.` If the shuffle dependency specifies aggregation, but it only aggregates at the reduce-side, serialized shuffle can still be used. `3. The shuffle produces fewer than 16777216 output partitions.` If the number of output partitions is 16777216 , we can use serialized shuffle. We can see this mothod: `canUseSerializedShuffle` ## How was this patch tested? N/A Closes apache#23228 from 10110346/SerializedShuffle_doc. Authored-by: liuxian <liu.xian3@zte.com.cn> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

## What changes were proposed in this pull request? These three condition descriptions should be updated, follow apache#23228 : <li>no Ordering is specified,</li> <li>no Aggregator is specified, and</li> <li>the number of partitions is less than <code>spark.shuffle.sort.bypassMergeThreshold</code>. </li> 1、If the shuffle dependency specifies aggregation, but it only aggregates at the reduce-side, BypassMergeSortShuffle can still be used. 2、If the number of output partitions is spark.shuffle.sort.bypassMergeThreshold(eg.200), we can use BypassMergeSortShuffle. ## How was this patch tested? N/A Closes apache#23281 from lcqzte10192193/wid-lcq-1211. Authored-by: lichaoqun <li.chaoqun@zte.com.cn> Signed-off-by: Sean Owen <sean.owen@databricks.com>

fix

d5dadbf

srowen reviewed Dec 5, 2018

View reviewed changes

jiangxb1987 approved these changes Dec 10, 2018

View reviewed changes

10110346 changed the title ~~[MINOR][DOC]The condition description of serialized shuffle is not very accurate~~ [MINOR][DOC] Update the condition description of serialized shuffle Dec 10, 2018

asfgit closed this in 9794923 Dec 10, 2018

lcqzte10192193 mentioned this pull request Dec 11, 2018

[MINOR][DOC]update the condition description of BypassMergeSortShuffle #23281

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MINOR][DOC] Update the condition description of serialized shuffle #23228

[MINOR][DOC] Update the condition description of serialized shuffle #23228

10110346 commented Dec 5, 2018 •

edited

Loading

SparkQA commented Dec 5, 2018

SparkQA commented Dec 5, 2018

srowen left a comment

srowen Dec 5, 2018

cloud-fan Dec 9, 2018

10110346 commented Dec 7, 2018

cloud-fan commented Dec 9, 2018

jiangxb1987 commented Dec 10, 2018

10110346 commented Dec 10, 2018

cloud-fan commented Dec 10, 2018

SparkQA commented Dec 10, 2018

10110346 commented Dec 10, 2018

SparkQA commented Dec 10, 2018

cloud-fan commented Dec 10, 2018

[MINOR][DOC] Update the condition description of serialized shuffle #23228

[MINOR][DOC] Update the condition description of serialized shuffle #23228

Conversation

10110346 commented Dec 5, 2018 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Dec 5, 2018

SparkQA commented Dec 5, 2018

srowen left a comment

Choose a reason for hiding this comment

srowen Dec 5, 2018

Choose a reason for hiding this comment

cloud-fan Dec 9, 2018

Choose a reason for hiding this comment

10110346 commented Dec 7, 2018

cloud-fan commented Dec 9, 2018

jiangxb1987 commented Dec 10, 2018

10110346 commented Dec 10, 2018

cloud-fan commented Dec 10, 2018

SparkQA commented Dec 10, 2018

10110346 commented Dec 10, 2018

SparkQA commented Dec 10, 2018

cloud-fan commented Dec 10, 2018

10110346 commented Dec 5, 2018 •

edited

Loading