[SPARK-8319] [CORE] [SQL] Update logic related to key orderings in shuffle dependencies #6773

JoshRosen · 2015-06-12T01:22:33Z

This patch updates two pieces of logic that are related to handling of keyOrderings in ShuffleDependencies:

The Tungsten ShuffleManager falls back to regular SortShuffleManager whenever the shuffle dependency specifies a key ordering, but technically we only need to fall back when an aggregator is also specified. This patch updates the fallback logic to reflect this so that the Tungsten optimizations can apply to more workloads.
The SQL Exchange operator performs defensive copying of shuffle inputs when a key ordering is specified, but this is unnecessary. The copying was added to guard against cases where ExternalSorter would buffer non-serialized records in memory. When ExternalSorter is configured without an aggregator, it uses the following logic to determine whether to buffer records in a serialized or deserialized format:
```
 private val useSerializedPairBuffer =
    ordering.isEmpty &&
    conf.getBoolean("spark.shuffle.sort.serializeMapOutputs", true) &&
    ser.supportsRelocationOfSerializedObjects
```
The newOrdering.isDefined branch in ExternalSorter.needToCopyObjectsBeforeShuffle, removed by this patch, is not necessary:
- It was checked even if we weren't using sort-based shuffle, but this was unnecessary because only SortShuffleManager performs map-side sorting.
- Map-side sorting during shuffle writing is only performed for shuffles that perform map-side aggregation as part of the shuffle (to see this, look at how SortShuffleWriter constructs ExternalSorter). Since SQL never pushes aggregation into Spark's shuffle, we can guarantee that both the aggregator and ordering will be empty and Spark SQL always uses serializers that support relocation, so sort-shuffle will use the serialized pair buffer unless the user has explicitly disabled it via the SparkConf feature-flag. Therefore, I think my optimization in Exchange should be safe.

JoshRosen · 2015-06-12T01:22:52Z

/cc @yhuai for Exchange-related changes.

yhuai · 2015-06-12T03:13:32Z

sql/core/src/main/scala/org/apache/spark/sql/execution/Exchange.scala

-      // If a new ordering is required, then records will be sorted with Spark's `ExternalSorter`,
-      // which requires a defensive copy.
-      true
-    } else if (sortBasedShuffleOn) {


Just try to understand the context. Is this if (newOrdering.nonEmpty) part not necessary because of our recent change?

See comment in PR description; even if an ordering was defined I don't think it would be used to sort on the map side because we don't do map side sort in shuffle unless we specify an aggregator

SparkQA · 2015-06-12T03:43:56Z

Test build #34738 has finished for PR 6773 at commit 85a4628.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2015-06-12T05:55:16Z

Jenkins, retest this please.

SparkQA · 2015-06-12T08:13:05Z

Test build #34751 has finished for PR 6773 at commit 85a4628.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2015-06-12T23:00:56Z

ah, i see. Exchange related part looks good to me.

…on serialized records

…mplementations

JoshRosen · 2015-06-13T17:53:47Z

This patch was pretty tricky to explain, so I've revised the description and have slightly extended the code comments. I think it should be good-to-go after the next Jenkins run, so I'll merge it unless there are any objections.

JoshRosen · 2015-06-13T18:08:23Z

sql/core/src/main/scala/org/apache/spark/sql/execution/Exchange.scala

@@ -108,9 +108,12 @@ case class Exchange(
        // both cases, we must copy.
        true
      }
-    } else {
+    } else if (shuffleManager.isInstanceOf[HashShuffleManager]) {


This check isn't fixing any bugs / correctness issues yet, but I thought that it might guard against hard-to-find future bugs if someone defines a new shuffle manager without updating the logic here. It's extremely unlikely that anyone would do this, but in light of proposals like ParquetShuffleManager it seemed like the safest option.

SparkQA · 2015-06-13T20:21:00Z

Test build #34830 has finished for PR 6773 at commit 7a14129.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2015-06-13T23:13:37Z

Alright, I'm going to merge this into master.

…uffle dependencies This patch updates two pieces of logic that are related to handling of keyOrderings in ShuffleDependencies: - The Tungsten ShuffleManager falls back to regular SortShuffleManager whenever the shuffle dependency specifies a key ordering, but technically we only need to fall back when an aggregator is also specified. This patch updates the fallback logic to reflect this so that the Tungsten optimizations can apply to more workloads. - The SQL Exchange operator performs defensive copying of shuffle inputs when a key ordering is specified, but this is unnecessary. The copying was added to guard against cases where ExternalSorter would buffer non-serialized records in memory. When ExternalSorter is configured without an aggregator, it uses the following logic to determine whether to buffer records in a serialized or deserialized format: ```scala private val useSerializedPairBuffer = ordering.isEmpty && conf.getBoolean("spark.shuffle.sort.serializeMapOutputs", true) && ser.supportsRelocationOfSerializedObjects ``` The `newOrdering.isDefined` branch in `ExternalSorter.needToCopyObjectsBeforeShuffle`, removed by this patch, is not necessary: - It was checked even if we weren't using sort-based shuffle, but this was unnecessary because only SortShuffleManager performs map-side sorting. - Map-side sorting during shuffle writing is only performed for shuffles that perform map-side aggregation as part of the shuffle (to see this, look at how SortShuffleWriter constructs ExternalSorter). Since SQL never pushes aggregation into Spark's shuffle, we can guarantee that both the aggregator and ordering will be empty and Spark SQL always uses serializers that support relocation, so sort-shuffle will use the serialized pair buffer unless the user has explicitly disabled it via the SparkConf feature-flag. Therefore, I think my optimization in Exchange should be safe. Author: Josh Rosen <joshrosen@databricks.com> Closes apache#6773 from JoshRosen/SPARK-8319 and squashes the following commits: 7a14129 [Josh Rosen] Revise comments; add handler to guard against future ShuffleManager implementations 07bb2c9 [Josh Rosen] Update comment to clarify circumstances under which shuffle operates on serialized records 269089a [Josh Rosen] Avoid unnecessary copy in SQL Exchange 34e526e [Josh Rosen] Enable Tungsten shuffle for non-agg shuffles w/ key orderings

yhuai reviewed Jun 12, 2015
View reviewed changes

JoshRosen added 4 commits June 13, 2015 10:49

Enable Tungsten shuffle for non-agg shuffles w/ key orderings

34e526e

Avoid unnecessary copy in SQL Exchange

269089a

Update comment to clarify circumstances under which shuffle operates …

07bb2c9

…on serialized records

Revise comments; add handler to guard against future ShuffleManager i…

7a14129

…mplementations

JoshRosen force-pushed the SPARK-8319 branch from 85a4628 to 7a14129 Compare June 13, 2015 18:06

JoshRosen reviewed Jun 13, 2015
View reviewed changes

asfgit closed this in af31335 Jun 13, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-8319] [CORE] [SQL] Update logic related to key orderings in shuffle dependencies #6773

[SPARK-8319] [CORE] [SQL] Update logic related to key orderings in shuffle dependencies #6773

JoshRosen commented Jun 12, 2015

JoshRosen commented Jun 12, 2015

yhuai Jun 12, 2015

JoshRosen Jun 12, 2015

SparkQA commented Jun 12, 2015

JoshRosen commented Jun 12, 2015

SparkQA commented Jun 12, 2015

yhuai commented Jun 12, 2015

JoshRosen commented Jun 13, 2015

JoshRosen Jun 13, 2015

SparkQA commented Jun 13, 2015

JoshRosen commented Jun 13, 2015

[SPARK-8319] [CORE] [SQL] Update logic related to key orderings in shuffle dependencies #6773

[SPARK-8319] [CORE] [SQL] Update logic related to key orderings in shuffle dependencies #6773

Conversation

JoshRosen commented Jun 12, 2015

JoshRosen commented Jun 12, 2015

yhuai Jun 12, 2015

Choose a reason for hiding this comment

JoshRosen Jun 12, 2015

Choose a reason for hiding this comment

SparkQA commented Jun 12, 2015

JoshRosen commented Jun 12, 2015

SparkQA commented Jun 12, 2015

yhuai commented Jun 12, 2015

JoshRosen commented Jun 13, 2015

JoshRosen Jun 13, 2015

Choose a reason for hiding this comment

SparkQA commented Jun 13, 2015

JoshRosen commented Jun 13, 2015