-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-24206][SQL][FOLLOW-UP] Update DataSourceReadBenchmark benchmark results #21625
Conversation
This pr is a follow-up of #21288 (comment) |
/* | ||
Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz | ||
Partitioned Table: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative | ||
-------------------------------------------------------------------------------------------- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems missed to update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, thanks. I'll update soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, I hit the bug in csv parsing when updating this benchmark...
scala> val dir = "/tmp/spark-csv/csv"
scala> spark.range(10).selectExpr("id % 2 AS p", "id").write.mode("overwrite").partitionBy("p").csv(dir)
scala> spark.read.csv(dir).selectExpr("sum(p)").collect()
18/06/25 13:12:51 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 5)
java.lang.NullPointerException
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser.org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$convert(UnivocityParser.scala:197)
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser.parse(UnivocityParser.scala:190)
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$5.apply(UnivocityParser.scala:309)
at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$5.apply(UnivocityParser.scala:309)
at org.apache.spark.sql.execution.datasources.FailureSafeParser.parse(FailureSafeParser.scala:61)
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I filed a jira; https://issues.apache.org/jira/browse/SPARK-24645
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maropu, if the JIRA blocks this PR and looks taking a while to fix it, please feel free to set the configuration to false within this benchmark and proceed. Technically, looks that's what the benchmark originally covered at that time it's merged in. Setting it true can be separately done in the JIRA you opened.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, I though I would do so first, but I couldn't because I hit another bug when the column pruning disabled...;
./bin/spark-shell --conf spark.sql.csv.parser.columnPruning.enabled=false
scala> val dir = "/tmp/spark-csv/csv"
scala> spark.range(10).selectExpr("id % 2 AS p", "id").write.mode("overwrite").partitionBy("p").csv(dir)
scala> spark.read.csv(dir).selectExpr("sum(p)").collect()
18/06/25 13:48:46 ERROR Executor: Exception in task 2.0 in stage 2.0 (TID 7)
java.lang.ClassCastException: org.apache.spark.unsafe.types.UTF8String cannot be cast to java.lang.Integer
at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:101)
at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getInt(rows.scala:41)
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon I'm currently fixing this now. But, it seems this bug is similar to SPARK-24645. So, would it be better to merge this fix with SPARK-24645?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyway, I updated the results by applying #21631
Test build #92264 has finished for PR 21625 at commit
|
@@ -39,9 +39,11 @@ import org.apache.spark.util.{Benchmark, Utils} | |||
object DataSourceReadBenchmark { | |||
val conf = new SparkConf() | |||
.setAppName("DataSourceReadBenchmark") | |||
.setIfMissing("spark.master", "local[1]") | |||
// Since `spark.master` always exists, overrides this value | |||
.set("spark.master", "local[1]") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for fixing this and updating the result, @maropu .
Test build #92294 has finished for PR 21625 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
LGTM too Merged to master. |
What changes were proposed in this pull request?
This pr corrected the default configuration (
spark.master=local[1]
) for benchmarks. Also, this updated performance results on the AWSr3.xlarge
.How was this patch tested?
N/A