Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-7713] [SQL] Use shared broadcast hadoop conf for partitioned table scan. #6252

Closed
wants to merge 4 commits into from
Closed

[SPARK-7713] [SQL] Use shared broadcast hadoop conf for partitioned table scan. #6252

wants to merge 4 commits into from

Conversation

yhuai
Copy link
Contributor

@yhuai yhuai commented May 19, 2015

https://issues.apache.org/jira/browse/SPARK-7713

I tested the performance with the following code:

import sqlContext._
import sqlContext.implicits._

(1 to 5000).foreach { i =>
  val df = (1 to 1000).map(j => (j, s"str$j")).toDF("a", "b").save(s"/tmp/partitioned/i=$i")
}

sqlContext.sql("""
CREATE TEMPORARY TABLE partitionedParquet
USING org.apache.spark.sql.parquet
OPTIONS (
  path '/tmp/partitioned'
)""")

table("partitionedParquet").explain(true)

In our master explain takes 40s in my laptop. With this PR, explain takes 14s.

@yhuai
Copy link
Contributor Author

yhuai commented May 19, 2015

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented May 19, 2015

Test build #33045 has started for PR 6252 at commit 9e7c3cd.

@@ -118,7 +120,7 @@ private[sql] class ParquetRelation2(
private val maybeDataSchema: Option[StructType],
private val maybePartitionSpec: Option[PartitionSpec],
parameters: Map[String, String])(
val sqlContext: SQLContext)
@transient val sqlContext: SQLContext)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we want to have @transient here? ParquetRelation2 is not serializable and shouldn't be serialized.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented May 19, 2015

Test build #33052 has started for PR 6252 at commit 88708e5.

@SparkQA
Copy link

SparkQA commented May 19, 2015

Test build #33045 has finished for PR 6252 at commit 9e7c3cd.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class SqlNewHadoopRDD[K, V](

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/33045/
Test FAILed.

@SparkQA
Copy link

SparkQA commented May 19, 2015

Test build #33052 has finished for PR 6252 at commit 88708e5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class SqlNewHadoopRDD[K, V](

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/33052/
Test PASSed.


if (initLocalJobFuncOpt.isDefined) {
sc.clean(initLocalJobFuncOpt.get)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a potential source of performance optimization. Since we provide the closure internally we don't actually have to clean the closure. If we create many of these RDDs the cleaning time might add up. This could buy us a few seconds (same reasoning as in SPARK-7718, or #6256).

By the way this is definitely not critical for the release. We can fix this separately if you prefer.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented May 19, 2015

Test build #33098 has started for PR 6252 at commit f0f5a3b.

@SparkQA
Copy link

SparkQA commented May 19, 2015

Test build #33098 has finished for PR 6252 at commit f0f5a3b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/33098/
Test PASSed.

* @param valueClass Class of the value associated with the inputFormatClass.
* @param conf The Hadoop configuration.
*/
@DeveloperApi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit: might want to remove @DeveloperApi and all of the documentation from this and just mention that it's a clone + modify of NewHadoopRDD and that its functionality will probably be folded into core in a future release.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, unless there's a need I would prefer to make this private (since it's basically a copy of another class)

@JoshRosen
Copy link
Contributor

General comment: the broadcasted object will be shared by all tasks, so if the initJob function modifies the conf then you might run into trouble. Just wanted to make sure that you were aware of the shared state here in case you're mutating it in certain ways.

@yhuai
Copy link
Contributor Author

yhuai commented May 20, 2015

Added a comment to explain that new Job in SqlNewHadoopRDD.getJob will make a copy of the conf.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented May 20, 2015

Test build #33124 has started for PR 6252 at commit 6fa73df.

@SparkQA
Copy link

SparkQA commented May 20, 2015

Test build #33124 has finished for PR 6252 at commit 6fa73df.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/33124/
Test PASSed.

@yhuai
Copy link
Contributor Author

yhuai commented May 20, 2015

Thanks for reviewing! I am merging it to master and branch 1.4.

asfgit pushed a commit that referenced this pull request May 20, 2015
…able scan.

https://issues.apache.org/jira/browse/SPARK-7713

I tested the performance with the following code:
```scala
import sqlContext._
import sqlContext.implicits._

(1 to 5000).foreach { i =>
  val df = (1 to 1000).map(j => (j, s"str$j")).toDF("a", "b").save(s"/tmp/partitioned/i=$i")
}

sqlContext.sql("""
CREATE TEMPORARY TABLE partitionedParquet
USING org.apache.spark.sql.parquet
OPTIONS (
  path '/tmp/partitioned'
)""")

table("partitionedParquet").explain(true)
```

In our master `explain` takes 40s in my laptop. With this PR, `explain` takes 14s.

Author: Yin Huai <yhuai@databricks.com>

Closes #6252 from yhuai/broadcastHadoopConf and squashes the following commits:

6fa73df [Yin Huai] Address comments of Josh and Andrew.
807fbf9 [Yin Huai] Make the new buildScan and SqlNewHadoopRDD private sql.
e393555 [Yin Huai] Cheng's comments.
2eb53bb [Yin Huai] Use a shared broadcast Hadoop Configuration for partitioned HadoopFsRelations.

(cherry picked from commit b631bf7)
Signed-off-by: Yin Huai <yhuai@databricks.com>
@asfgit asfgit closed this in b631bf7 May 20, 2015
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request May 28, 2015
…able scan.

https://issues.apache.org/jira/browse/SPARK-7713

I tested the performance with the following code:
```scala
import sqlContext._
import sqlContext.implicits._

(1 to 5000).foreach { i =>
  val df = (1 to 1000).map(j => (j, s"str$j")).toDF("a", "b").save(s"/tmp/partitioned/i=$i")
}

sqlContext.sql("""
CREATE TEMPORARY TABLE partitionedParquet
USING org.apache.spark.sql.parquet
OPTIONS (
  path '/tmp/partitioned'
)""")

table("partitionedParquet").explain(true)
```

In our master `explain` takes 40s in my laptop. With this PR, `explain` takes 14s.

Author: Yin Huai <yhuai@databricks.com>

Closes apache#6252 from yhuai/broadcastHadoopConf and squashes the following commits:

6fa73df [Yin Huai] Address comments of Josh and Andrew.
807fbf9 [Yin Huai] Make the new buildScan and SqlNewHadoopRDD private sql.
e393555 [Yin Huai] Cheng's comments.
2eb53bb [Yin Huai] Use a shared broadcast Hadoop Configuration for partitioned HadoopFsRelations.
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request Jun 12, 2015
…able scan.

https://issues.apache.org/jira/browse/SPARK-7713

I tested the performance with the following code:
```scala
import sqlContext._
import sqlContext.implicits._

(1 to 5000).foreach { i =>
  val df = (1 to 1000).map(j => (j, s"str$j")).toDF("a", "b").save(s"/tmp/partitioned/i=$i")
}

sqlContext.sql("""
CREATE TEMPORARY TABLE partitionedParquet
USING org.apache.spark.sql.parquet
OPTIONS (
  path '/tmp/partitioned'
)""")

table("partitionedParquet").explain(true)
```

In our master `explain` takes 40s in my laptop. With this PR, `explain` takes 14s.

Author: Yin Huai <yhuai@databricks.com>

Closes apache#6252 from yhuai/broadcastHadoopConf and squashes the following commits:

6fa73df [Yin Huai] Address comments of Josh and Andrew.
807fbf9 [Yin Huai] Make the new buildScan and SqlNewHadoopRDD private sql.
e393555 [Yin Huai] Cheng's comments.
2eb53bb [Yin Huai] Use a shared broadcast Hadoop Configuration for partitioned HadoopFsRelations.
nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015
…able scan.

https://issues.apache.org/jira/browse/SPARK-7713

I tested the performance with the following code:
```scala
import sqlContext._
import sqlContext.implicits._

(1 to 5000).foreach { i =>
  val df = (1 to 1000).map(j => (j, s"str$j")).toDF("a", "b").save(s"/tmp/partitioned/i=$i")
}

sqlContext.sql("""
CREATE TEMPORARY TABLE partitionedParquet
USING org.apache.spark.sql.parquet
OPTIONS (
  path '/tmp/partitioned'
)""")

table("partitionedParquet").explain(true)
```

In our master `explain` takes 40s in my laptop. With this PR, `explain` takes 14s.

Author: Yin Huai <yhuai@databricks.com>

Closes apache#6252 from yhuai/broadcastHadoopConf and squashes the following commits:

6fa73df [Yin Huai] Address comments of Josh and Andrew.
807fbf9 [Yin Huai] Make the new buildScan and SqlNewHadoopRDD private sql.
e393555 [Yin Huai] Cheng's comments.
2eb53bb [Yin Huai] Use a shared broadcast Hadoop Configuration for partitioned HadoopFsRelations.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants