Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-24206][SQL][FOLLOW-UP] Update DataSourceReadBenchmark benchmark results #21625

Closed
wants to merge 2 commits into from

Conversation

maropu
Copy link
Member

@maropu maropu commented Jun 24, 2018

What changes were proposed in this pull request?

This pr corrected the default configuration (spark.master=local[1]) for benchmarks. Also, this updated performance results on the AWS r3.xlarge.

How was this patch tested?

N/A

@maropu
Copy link
Member Author

maropu commented Jun 24, 2018

This pr is a follow-up of #21288 (comment)

/*
Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
Partitioned Table: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
--------------------------------------------------------------------------------------------
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems missed to update.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, thanks. I'll update soon.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, I hit the bug in csv parsing when updating this benchmark...

scala> val dir = "/tmp/spark-csv/csv"
scala> spark.range(10).selectExpr("id % 2 AS p", "id").write.mode("overwrite").partitionBy("p").csv(dir)
scala> spark.read.csv(dir).selectExpr("sum(p)").collect()
18/06/25 13:12:51 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 5)
java.lang.NullPointerException
        at org.apache.spark.sql.execution.datasources.csv.UnivocityParser.org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$convert(UnivocityParser.scala:197)  
        at org.apache.spark.sql.execution.datasources.csv.UnivocityParser.parse(UnivocityParser.scala:190)
        at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$5.apply(UnivocityParser.scala:309)
        at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$5.apply(UnivocityParser.scala:309)
        at org.apache.spark.sql.execution.datasources.FailureSafeParser.parse(FailureSafeParser.scala:61)
        ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@HyukjinKwon HyukjinKwon Jun 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu, if the JIRA blocks this PR and looks taking a while to fix it, please feel free to set the configuration to false within this benchmark and proceed. Technically, looks that's what the benchmark originally covered at that time it's merged in. Setting it true can be separately done in the JIRA you opened.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, I though I would do so first, but I couldn't because I hit another bug when the column pruning disabled...;

./bin/spark-shell --conf spark.sql.csv.parser.columnPruning.enabled=false
scala> val dir = "/tmp/spark-csv/csv"
scala> spark.range(10).selectExpr("id % 2 AS p", "id").write.mode("overwrite").partitionBy("p").csv(dir)
scala> spark.read.csv(dir).selectExpr("sum(p)").collect()
18/06/25 13:48:46 ERROR Executor: Exception in task 2.0 in stage 2.0 (TID 7)
java.lang.ClassCastException: org.apache.spark.unsafe.types.UTF8String cannot be cast to java.lang.Integer
        at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:101)
        at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getInt(rows.scala:41)
        ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon I'm currently fixing this now. But, it seems this bug is similar to SPARK-24645. So, would it be better to merge this fix with SPARK-24645?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway, I updated the results by applying #21631

@SparkQA
Copy link

SparkQA commented Jun 24, 2018

Test build #92264 has finished for PR 21625 at commit 2352820.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -39,9 +39,11 @@ import org.apache.spark.util.{Benchmark, Utils}
object DataSourceReadBenchmark {
val conf = new SparkConf()
.setAppName("DataSourceReadBenchmark")
.setIfMissing("spark.master", "local[1]")
// Since `spark.master` always exists, overrides this value
.set("spark.master", "local[1]")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for fixing this and updating the result, @maropu .

@SparkQA
Copy link

SparkQA commented Jun 25, 2018

Test build #92294 has finished for PR 21625 at commit 4e76ffd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@gengliangwang gengliangwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@HyukjinKwon
Copy link
Member

LGTM too

Merged to master.

@asfgit asfgit closed this in 1c9acc2 Jun 28, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants