Broadcast task sql test [DO NOT MERGE] #1654

rxin · 2014-07-30T07:40:55Z

This is the same as #1498 but adds a dummy change to sql so I can trigger a SQL test on Jenkins.

This is not meant to be merged.

SparkQA · 2014-07-30T07:44:09Z

QA tests have started for PR 1654. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17432/consoleFull

…very task). Currently (as of Spark 1.0.1), Spark sends RDD object (which contains closures) using Akka along with the task itself to the executors. This is inefficient because all tasks in the same stage use the same RDD object, but we have to send RDD object multiple times to the executors. This is especially bad when a closure references some variable that is very large. The current design led to users having to explicitly broadcast large variables. The patch uses broadcast to send RDD objects and the closures to executors, and use Akka to only send a reference to the broadcast RDD/closure along with the partition specific information for the task. For those of you who know more about the internals, Spark already relies on broadcast to send the Hadoop JobConf every time it uses the Hadoop input, because the JobConf is large. The user-facing impact of the change include: 1. Users won't need to decide what to broadcast anymore, unless they would want to use a large object multiple times in different operations 2. Task size will get smaller, resulting in faster scheduling and higher task dispatch throughput. In addition, the change will simplify some internals of Spark, eliminating the need to maintain task caches and the complex logic to broadcast JobConf (which also led to a deadlock recently). A simple way to test this: ```scala val a = new Array[Byte](1000*1000); scala.util.Random.nextBytes(a); sc.parallelize(1 to 1000, 1000).map { x => a; x }.groupBy { x => a; x }.count ``` Numbers on 3 r3.8xlarge instances on EC2 ``` master branch: 5.648436068 s, 4.715361895 s, 5.360161877 s with this change: 3.416348793 s, 1.477846558 s, 1.553432156 s ``` Author: Reynold Xin <rxin@apache.org> Closes apache#1452 from rxin/broadcast-task and squashes the following commits: 762e0be [Reynold Xin] Warn large broadcasts. ade6eac [Reynold Xin] Log broadcast size. c3b6f11 [Reynold Xin] Added a unit test for clean up. 754085f [Reynold Xin] Explain why broadcasting serialized copy of the task. 04b17f0 [Reynold Xin] [SPARK-2521] Broadcast RDD object once per TaskSet (instead of sending it for every task). (cherry picked from commit 7b8cd17) Signed-off-by: Reynold Xin <rxin@apache.org>

…eted.

…Binary.

SparkQA · 2014-07-30T08:33:05Z

QA results for PR 1654:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
abstract class Dependency[T] extends Serializable {
abstract class NarrowDependency[T](_rdd: RDD[T]) extends Dependency[T] {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17432/consoleFull

SparkQA · 2014-07-30T08:38:50Z

QA tests have started for PR 1654. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17436/consoleFull

SparkQA · 2014-07-30T08:39:38Z

QA results for PR 1654:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
abstract class Dependency[T] extends Serializable {
abstract class NarrowDependency[T](_rdd: RDD[T]) extends Dependency[T] {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17436/consoleFull

SparkQA · 2014-07-30T08:53:59Z

QA tests have started for PR 1654. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17438/consoleFull

SparkQA · 2014-07-30T10:03:51Z

QA results for PR 1654:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
abstract class Dependency[T] extends Serializable {
abstract class NarrowDependency[T](_rdd: RDD[T]) extends Dependency[T] {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17438/consoleFull

rxin added 15 commits July 30, 2014 01:30

Fixed unit test failures. One more to go.

bd11028

Don't cache the RDD broadcast variable.

c1e03a8

Fix TaskContextSuite.

fc64e01

Use HttpBroadcast.

87948d5

Use TorrentBroadcastFactory.

5980457

Check for NotSerializableException in submitMissingTasks.

023719d

Properly send SparkListenerStageSubmitted and SparkListenerStageCompl…

d057b27

…eted.

Fix broadcast tests.

1f861b2

Serialize the final task closure as well as ShuffleDependency in task…

3c8f496

…Binary.

Fixed the style violation.

f670886

Code review feedback.

f0fefdf

random sql change.

aab05e5

iii

6810286

wwwww.

40857a1

make sure diffstat makes a diff.

e40a9ac

ii

7127096

rxin closed this Jul 30, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Broadcast task sql test [DO NOT MERGE] #1654

Broadcast task sql test [DO NOT MERGE] #1654

rxin commented Jul 30, 2014

SparkQA commented Jul 30, 2014

SparkQA commented Jul 30, 2014

SparkQA commented Jul 30, 2014

SparkQA commented Jul 30, 2014

SparkQA commented Jul 30, 2014

SparkQA commented Jul 30, 2014

Broadcast task sql test [DO NOT MERGE] #1654

Broadcast task sql test [DO NOT MERGE] #1654

Conversation

rxin commented Jul 30, 2014

SparkQA commented Jul 30, 2014

SparkQA commented Jul 30, 2014

SparkQA commented Jul 30, 2014

SparkQA commented Jul 30, 2014

SparkQA commented Jul 30, 2014

SparkQA commented Jul 30, 2014