-
Notifications
You must be signed in to change notification settings - Fork 169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: Improve count aggregate performance #784
Conversation
Average of 3 runs, main branch versus this PR. This shows a 15.5% speedup. Command used for both runs: $SPARK_HOME/bin/spark-submit \
--master $SPARK_MASTER \
--conf spark.driver.memory=8G \
--conf spark.executor.instances=1 \
--conf spark.executor.memory=32G \
--conf spark.executor.cores=8 \
--conf spark.cores.max=8 \
--conf spark.eventLog.enabled=true \
--jars $COMET_JAR \
--conf spark.driver.extraClassPath=$COMET_JAR \
--conf spark.executor.extraClassPath=$COMET_JAR \
--conf spark.sql.extensions=org.apache.comet.CometSparkSessionExtensions \
--conf spark.comet.enabled=true \
--conf spark.comet.exec.enabled=true \
--conf spark.comet.exec.all.enabled=true \
--conf spark.comet.cast.allowIncompatible=true \
--conf spark.comet.shuffle.enforceMode.enabled=true \
--conf spark.comet.exec.shuffle.enabled=true \
--conf spark.comet.exec.shuffle.mode=auto \
--conf spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager \
tpcbench.py \
--benchmark tpcds \
--data /mnt/bigdata/tpcds/sf100/ \
--queries ../../tpcds/queries-spark \
--iterations 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for the PR @andygrove
.iter() | ||
.map(|child| self.create_expr(child, schema.clone())) | ||
.collect::<Result<Vec<_>, _>>()?; | ||
if expr.children.iter().len() == 1 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I think we can also do this for multiple child expressions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I have extended this approach for the multiple argument case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks okay. Actually it is how Spark count
does internally:
/* count = */ If(nullableChildren.map(IsNull).reduce(Or), count, count + 1L)
* Workaround for COUNT performance * add comments * remove benchmark results * fix regression * revert change to datafusion version * Revert change to Cargo.lock * fix * unify code for single and multiple arguments * clippy
Which issue does this PR close?
Closes #744
Rationale for this change
For some reason,
COUNT
is really slow when used fromComet
, butSUM
is fast, so let's translateCOUNT(expr)
toSUM(IF(expr IS NULL, 0, 1))
until we can get to the bottom of the real issue.edit: It turns out that Spark also implements
COUNT
this way, so I think this closes the issue.What changes are included in this PR?
How are these changes tested?