[SPARK-31187][SQL] Sort the whole-stage codegen debug output by codegenStageId #27955

rednaxelafx · 2020-03-19T06:32:11Z

What changes were proposed in this pull request?

Spark SQL's whole-stage codegen (WSCG) supports dumping the generated code to help with debugging. One way to get the generated code is through df.queryExecution.debug.codegen, or SQL EXPLAIN CODEGEN statement.

The generated code is currently printed without specific ordering, which can make debugging a bit annoying. This PR makes a minor improvement to sort the codegen dump by the codegenStageId, ascending.

After this change, the following query:

spark.range(10).agg(sum('id)).queryExecution.debug.codegen

will always dump the generated code in a natural, stable order. A version of this example with shorter output is:

spark.range(10).agg(sum('id)).queryExecution.debug.codegenToSeq.map(_._1).foreach(println)
*(1) HashAggregate(keys=[], functions=[partial_sum(id#8L)], output=[sum#15L])
+- *(1) Range (0, 10, step=1, splits=16)

*(2) HashAggregate(keys=[], functions=[sum(id#8L)], output=[sum(id)#12L])
+- Exchange SinglePartition, true, [id=#30]
   +- *(1) HashAggregate(keys=[], functions=[partial_sum(id#8L)], output=[sum#15L])
      +- *(1) Range (0, 10, step=1, splits=16)

The number of codegen stages within a single SQL query tends to be very small, most likely < 50, so the overhead of adding the sorting shouldn't be significant.

Why are the changes needed?

Minor improvement to aid WSCG debugging.

Does this PR introduce any user-facing change?

No user-facing change for end-users; minor change for developers who debug WSCG generated code.

How was this patch tested?

Manually tested the output; all other tests still pass.

gengliangwang

LGTM

SparkQA · 2020-03-19T07:05:02Z

Test build #120026 has finished for PR 27955 at commit c64b778.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2020-03-19T07:09:23Z

retest this please

SparkQA · 2020-03-19T11:46:20Z

Test build #120029 has finished for PR 27955 at commit c64b778.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…enStageId ### What changes were proposed in this pull request? Spark SQL's whole-stage codegen (WSCG) supports dumping the generated code to help with debugging. One way to get the generated code is through `df.queryExecution.debug.codegen`, or SQL `EXPLAIN CODEGEN` statement. The generated code is currently printed without specific ordering, which can make debugging a bit annoying. This PR makes a minor improvement to sort the codegen dump by the `codegenStageId`, ascending. After this change, the following query: ```scala spark.range(10).agg(sum('id)).queryExecution.debug.codegen ``` will always dump the generated code in a natural, stable order. A version of this example with shorter output is: ``` spark.range(10).agg(sum('id)).queryExecution.debug.codegenToSeq.map(_._1).foreach(println) *(1) HashAggregate(keys=[], functions=[partial_sum(id#8L)], output=[sum#15L]) +- *(1) Range (0, 10, step=1, splits=16) *(2) HashAggregate(keys=[], functions=[sum(id#8L)], output=[sum(id)#12L]) +- Exchange SinglePartition, true, [id=#30] +- *(1) HashAggregate(keys=[], functions=[partial_sum(id#8L)], output=[sum#15L]) +- *(1) Range (0, 10, step=1, splits=16) ``` The number of codegen stages within a single SQL query tends to be very small, most likely < 50, so the overhead of adding the sorting shouldn't be significant. ### Why are the changes needed? Minor improvement to aid WSCG debugging. ### Does this PR introduce any user-facing change? No user-facing change for end-users; minor change for developers who debug WSCG generated code. ### How was this patch tested? Manually tested the output; all other tests still pass. Closes #27955 from rednaxelafx/codegen. Authored-by: Kris Mok <kris.mok@databricks.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> (cherry picked from commit a177628) Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>

maropu · 2020-03-19T11:54:28Z

Thanks! Merged to master/3.0.

maropu · 2020-03-19T11:57:15Z

Since I think this PR improves debugability, I merged this into 3.0, too.

…enStageId ### What changes were proposed in this pull request? Spark SQL's whole-stage codegen (WSCG) supports dumping the generated code to help with debugging. One way to get the generated code is through `df.queryExecution.debug.codegen`, or SQL `EXPLAIN CODEGEN` statement. The generated code is currently printed without specific ordering, which can make debugging a bit annoying. This PR makes a minor improvement to sort the codegen dump by the `codegenStageId`, ascending. After this change, the following query: ```scala spark.range(10).agg(sum('id)).queryExecution.debug.codegen ``` will always dump the generated code in a natural, stable order. A version of this example with shorter output is: ``` spark.range(10).agg(sum('id)).queryExecution.debug.codegenToSeq.map(_._1).foreach(println) *(1) HashAggregate(keys=[], functions=[partial_sum(id#8L)], output=[sum#15L]) +- *(1) Range (0, 10, step=1, splits=16) *(2) HashAggregate(keys=[], functions=[sum(id#8L)], output=[sum(id)#12L]) +- Exchange SinglePartition, true, [id=apache#30] +- *(1) HashAggregate(keys=[], functions=[partial_sum(id#8L)], output=[sum#15L]) +- *(1) Range (0, 10, step=1, splits=16) ``` The number of codegen stages within a single SQL query tends to be very small, most likely < 50, so the overhead of adding the sorting shouldn't be significant. ### Why are the changes needed? Minor improvement to aid WSCG debugging. ### Does this PR introduce any user-facing change? No user-facing change for end-users; minor change for developers who debug WSCG generated code. ### How was this patch tested? Manually tested the output; all other tests still pass. Closes apache#27955 from rednaxelafx/codegen. Authored-by: Kris Mok <kris.mok@databricks.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>

Sort the whole-stage codegen debug output by codegenStageId

c64b778

gengliangwang approved these changes Mar 19, 2020

View reviewed changes

cloud-fan approved these changes Mar 19, 2020

View reviewed changes

maropu approved these changes Mar 19, 2020

View reviewed changes

kiszk approved these changes Mar 19, 2020

View reviewed changes

HyukjinKwon approved these changes Mar 19, 2020

View reviewed changes

maropu closed this in a177628 Mar 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-31187][SQL] Sort the whole-stage codegen debug output by codegenStageId #27955

[SPARK-31187][SQL] Sort the whole-stage codegen debug output by codegenStageId #27955

rednaxelafx commented Mar 19, 2020

gengliangwang left a comment

SparkQA commented Mar 19, 2020

kiszk commented Mar 19, 2020

SparkQA commented Mar 19, 2020

maropu commented Mar 19, 2020

maropu commented Mar 19, 2020

[SPARK-31187][SQL] Sort the whole-stage codegen debug output by codegenStageId #27955

[SPARK-31187][SQL] Sort the whole-stage codegen debug output by codegenStageId #27955

Conversation

rednaxelafx commented Mar 19, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

gengliangwang left a comment

Choose a reason for hiding this comment

SparkQA commented Mar 19, 2020

kiszk commented Mar 19, 2020

SparkQA commented Mar 19, 2020

maropu commented Mar 19, 2020

maropu commented Mar 19, 2020