[SPARK-7713] [SQL] Use shared broadcast hadoop conf for partitioned table scan. #6252

yhuai · 2015-05-19T02:23:15Z

https://issues.apache.org/jira/browse/SPARK-7713

I tested the performance with the following code:

import sqlContext._
import sqlContext.implicits._

(1 to 5000).foreach { i =>
  val df = (1 to 1000).map(j => (j, s"str$j")).toDF("a", "b").save(s"/tmp/partitioned/i=$i")
}

sqlContext.sql("""
CREATE TEMPORARY TABLE partitionedParquet
USING org.apache.spark.sql.parquet
OPTIONS (
  path '/tmp/partitioned'
)""")

table("partitionedParquet").explain(true)

In our master explain takes 40s in my laptop. With this PR, explain takes 14s.

yhuai · 2015-05-19T02:23:56Z

@JoshRosen @liancheng @marmbrus

AmplabJenkins · 2015-05-19T02:27:10Z

Merged build triggered.

AmplabJenkins · 2015-05-19T02:27:19Z

Merged build started.

SparkQA · 2015-05-19T02:29:24Z

Test build #33045 has started for PR 6252 at commit 9e7c3cd.

liancheng · 2015-05-19T03:28:59Z

sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala

@@ -118,7 +120,7 @@ private[sql] class ParquetRelation2(
    private val maybeDataSchema: Option[StructType],
    private val maybePartitionSpec: Option[PartitionSpec],
    parameters: Map[String, String])(
-    val sqlContext: SQLContext)
+    @transient val sqlContext: SQLContext)


Why do we want to have @transient here? ParquetRelation2 is not serializable and shouldn't be serialized.

AmplabJenkins · 2015-05-19T03:52:10Z

Merged build triggered.

AmplabJenkins · 2015-05-19T03:52:20Z

Merged build started.

SparkQA · 2015-05-19T03:54:29Z

Test build #33052 has started for PR 6252 at commit 88708e5.

SparkQA · 2015-05-19T03:57:15Z

Test build #33045 has finished for PR 6252 at commit 9e7c3cd.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class SqlNewHadoopRDD[K, V](

AmplabJenkins · 2015-05-19T03:57:20Z

Merged build finished. Test FAILed.

AmplabJenkins · 2015-05-19T03:57:20Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/33045/
Test FAILed.

SparkQA · 2015-05-19T05:52:40Z

Test build #33052 has finished for PR 6252 at commit 88708e5.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class SqlNewHadoopRDD[K, V](

AmplabJenkins · 2015-05-19T05:52:45Z

Merged build finished. Test PASSed.

AmplabJenkins · 2015-05-19T05:52:45Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/33052/
Test PASSed.

andrewor14 · 2015-05-19T06:16:39Z

sql/core/src/main/scala/org/apache/spark/sql/sources/SqlNewHadoopRDD.scala

+
+  if (initLocalJobFuncOpt.isDefined) {
+    sc.clean(initLocalJobFuncOpt.get)
+  }


This is a potential source of performance optimization. Since we provide the closure internally we don't actually have to clean the closure. If we create many of these RDDs the cleaning time might add up. This could buy us a few seconds (same reasoning as in SPARK-7718, or #6256).

By the way this is definitely not critical for the release. We can fix this separately if you prefer.

AmplabJenkins · 2015-05-19T20:32:14Z

Merged build triggered.

AmplabJenkins · 2015-05-19T20:32:23Z

Merged build started.

SparkQA · 2015-05-19T20:33:15Z

Test build #33098 has started for PR 6252 at commit f0f5a3b.

SparkQA · 2015-05-19T22:22:16Z

Test build #33098 has finished for PR 6252 at commit f0f5a3b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-05-19T22:22:20Z

Merged build finished. Test PASSed.

AmplabJenkins · 2015-05-19T22:22:20Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/33098/
Test PASSed.

JoshRosen · 2015-05-20T00:52:25Z

sql/core/src/main/scala/org/apache/spark/sql/sources/SqlNewHadoopRDD.scala

+ * @param valueClass Class of the value associated with the inputFormatClass.
+ * @param conf The Hadoop configuration.
+ */
+@DeveloperApi


Minor nit: might want to remove @DeveloperApi and all of the documentation from this and just mention that it's a clone + modify of NewHadoopRDD and that its functionality will probably be folded into core in a future release.

yeah, unless there's a need I would prefer to make this private (since it's basically a copy of another class)

JoshRosen · 2015-05-20T01:39:10Z

General comment: the broadcasted object will be shared by all tasks, so if the initJob function modifies the conf then you might run into trouble. Just wanted to make sure that you were aware of the shared state here in case you're mutating it in certain ways.

…elations.

yhuai · 2015-05-20T03:51:26Z

Added a comment to explain that new Job in SqlNewHadoopRDD.getJob will make a copy of the conf.

AmplabJenkins · 2015-05-20T03:52:10Z

Merged build triggered.

AmplabJenkins · 2015-05-20T03:52:17Z

Merged build started.

SparkQA · 2015-05-20T03:53:00Z

Test build #33124 has started for PR 6252 at commit 6fa73df.

SparkQA · 2015-05-20T05:45:07Z

Test build #33124 has finished for PR 6252 at commit 6fa73df.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-05-20T05:45:11Z

Merged build finished. Test PASSed.

AmplabJenkins · 2015-05-20T05:45:12Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/33124/
Test PASSed.

yhuai · 2015-05-20T18:07:57Z

Thanks for reviewing! I am merging it to master and branch 1.4.

…able scan. https://issues.apache.org/jira/browse/SPARK-7713 I tested the performance with the following code: ```scala import sqlContext._ import sqlContext.implicits._ (1 to 5000).foreach { i => val df = (1 to 1000).map(j => (j, s"str$j")).toDF("a", "b").save(s"/tmp/partitioned/i=$i") } sqlContext.sql(""" CREATE TEMPORARY TABLE partitionedParquet USING org.apache.spark.sql.parquet OPTIONS ( path '/tmp/partitioned' )""") table("partitionedParquet").explain(true) ``` In our master `explain` takes 40s in my laptop. With this PR, `explain` takes 14s. Author: Yin Huai <yhuai@databricks.com> Closes #6252 from yhuai/broadcastHadoopConf and squashes the following commits: 6fa73df [Yin Huai] Address comments of Josh and Andrew. 807fbf9 [Yin Huai] Make the new buildScan and SqlNewHadoopRDD private sql. e393555 [Yin Huai] Cheng's comments. 2eb53bb [Yin Huai] Use a shared broadcast Hadoop Configuration for partitioned HadoopFsRelations. (cherry picked from commit b631bf7) Signed-off-by: Yin Huai <yhuai@databricks.com>

…able scan. https://issues.apache.org/jira/browse/SPARK-7713 I tested the performance with the following code: ```scala import sqlContext._ import sqlContext.implicits._ (1 to 5000).foreach { i => val df = (1 to 1000).map(j => (j, s"str$j")).toDF("a", "b").save(s"/tmp/partitioned/i=$i") } sqlContext.sql(""" CREATE TEMPORARY TABLE partitionedParquet USING org.apache.spark.sql.parquet OPTIONS ( path '/tmp/partitioned' )""") table("partitionedParquet").explain(true) ``` In our master `explain` takes 40s in my laptop. With this PR, `explain` takes 14s. Author: Yin Huai <yhuai@databricks.com> Closes apache#6252 from yhuai/broadcastHadoopConf and squashes the following commits: 6fa73df [Yin Huai] Address comments of Josh and Andrew. 807fbf9 [Yin Huai] Make the new buildScan and SqlNewHadoopRDD private sql. e393555 [Yin Huai] Cheng's comments. 2eb53bb [Yin Huai] Use a shared broadcast Hadoop Configuration for partitioned HadoopFsRelations.

liancheng reviewed May 19, 2015
View reviewed changes

andrewor14 reviewed May 19, 2015
View reviewed changes

yhuai mentioned this pull request May 19, 2015

[SPARK-7718] [SQL] Speed up partitioning by avoiding closure cleaning #6256

Closed

JoshRosen reviewed May 20, 2015
View reviewed changes

yhuai added 2 commits May 19, 2015 20:49

Use a shared broadcast Hadoop Configuration for partitioned HadoopFsR…

2eb53bb

…elations.

Cheng's comments.

e393555

yhuai added 2 commits May 19, 2015 20:49

Make the new buildScan and SqlNewHadoopRDD private sql.

807fbf9

Address comments of Josh and Andrew.

6fa73df

asfgit closed this in b631bf7 May 20, 2015

JoshRosen mentioned this pull request Oct 23, 2015

Feat no union databricks/spark-avro#95

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-7713] [SQL] Use shared broadcast hadoop conf for partitioned table scan. #6252

[SPARK-7713] [SQL] Use shared broadcast hadoop conf for partitioned table scan. #6252

yhuai commented May 19, 2015

yhuai commented May 19, 2015

AmplabJenkins commented May 19, 2015

AmplabJenkins commented May 19, 2015

SparkQA commented May 19, 2015

liancheng May 19, 2015

AmplabJenkins commented May 19, 2015

AmplabJenkins commented May 19, 2015

SparkQA commented May 19, 2015

SparkQA commented May 19, 2015

AmplabJenkins commented May 19, 2015

AmplabJenkins commented May 19, 2015

SparkQA commented May 19, 2015

AmplabJenkins commented May 19, 2015

AmplabJenkins commented May 19, 2015

andrewor14 May 19, 2015

AmplabJenkins commented May 19, 2015

AmplabJenkins commented May 19, 2015

SparkQA commented May 19, 2015

SparkQA commented May 19, 2015

AmplabJenkins commented May 19, 2015

AmplabJenkins commented May 19, 2015

JoshRosen May 20, 2015

andrewor14 May 20, 2015

JoshRosen commented May 20, 2015

yhuai commented May 20, 2015

AmplabJenkins commented May 20, 2015

AmplabJenkins commented May 20, 2015

SparkQA commented May 20, 2015

SparkQA commented May 20, 2015

AmplabJenkins commented May 20, 2015

AmplabJenkins commented May 20, 2015

yhuai commented May 20, 2015

[SPARK-7713] [SQL] Use shared broadcast hadoop conf for partitioned table scan. #6252

[SPARK-7713] [SQL] Use shared broadcast hadoop conf for partitioned table scan. #6252

Conversation

yhuai commented May 19, 2015

yhuai commented May 19, 2015

AmplabJenkins commented May 19, 2015

AmplabJenkins commented May 19, 2015

SparkQA commented May 19, 2015

liancheng May 19, 2015

Choose a reason for hiding this comment

AmplabJenkins commented May 19, 2015

AmplabJenkins commented May 19, 2015

SparkQA commented May 19, 2015

SparkQA commented May 19, 2015

AmplabJenkins commented May 19, 2015

AmplabJenkins commented May 19, 2015

SparkQA commented May 19, 2015

AmplabJenkins commented May 19, 2015

AmplabJenkins commented May 19, 2015

andrewor14 May 19, 2015

Choose a reason for hiding this comment

AmplabJenkins commented May 19, 2015

AmplabJenkins commented May 19, 2015

SparkQA commented May 19, 2015

SparkQA commented May 19, 2015

AmplabJenkins commented May 19, 2015

AmplabJenkins commented May 19, 2015

JoshRosen May 20, 2015

Choose a reason for hiding this comment

andrewor14 May 20, 2015

Choose a reason for hiding this comment

JoshRosen commented May 20, 2015

yhuai commented May 20, 2015

AmplabJenkins commented May 20, 2015

AmplabJenkins commented May 20, 2015

SparkQA commented May 20, 2015

SparkQA commented May 20, 2015

AmplabJenkins commented May 20, 2015

AmplabJenkins commented May 20, 2015

yhuai commented May 20, 2015