[SPARK-42169] [SQL] Implement code generation for to_csv function (StructsToCsv) #39719

NarekDW · 2023-01-24T15:12:16Z

What changes were proposed in this pull request?

This PR enhances StructsToCsv class with doGenCode function instead of extending it from CodegenFallback trait (performance improvement).

Why are the changes needed?

It will improve performance.

Does this PR introduce any user-facing change?

No

How was this patch tested?

an additional test case were added to org.apache.spark.sql.CsvFunctionsSuite class.

MaxGekk

Does CSVBenchmark show any performance improvements?

NarekDW · 2023-02-02T14:18:59Z

@MaxGekk
This is from master branch:

This is from current branch:

On my local machine - to_csv related operations in average are about ~20 % faster with this change in CSVBenchmark.

MaxGekk · 2023-02-05T07:55:31Z

@NarekDW Could you regenerate benchmark results using GitHub actions, see
https://spark.apache.org/developer-tools.html (Running benchmarks in your forked repository)
and update:

CSVBenchmark-results.txt
CSVBenchmark-jdk11-results.txt
CSVBenchmark-jdk17-results.txt

in your PR.

NarekDW · 2023-02-12T20:38:40Z

@NarekDW Could you regenerate benchmark results using GitHub actions, see https://spark.apache.org/developer-tools.html (Running benchmarks in your forked repository) and update:

CSVBenchmark-results.txt

CSVBenchmark-jdk11-results.txt

CSVBenchmark-jdk17-results.txt

in your PR.

@MaxGekk sorry for the late response. I've added benchmark results from GitHub actions.
I've added 2 commits: in pre-last commit I've updated benchmarks results from master branch and in the last commit I've added benchmarks results from current branch.

Execution links for master branch:
Java 8
Java 11
Java 17

Execution links for current branch:
Java 8
Java 11
Java 17

P.S. Scala 2.12 was used for all executions.

sql/core/src/test/scala/org/apache/spark/sql/CsvFunctionsSuite.scala

NarekDW · 2023-04-21T10:43:03Z

@jaceklaskowski thank you for the review. @MaxGekk just a reminder.

MaxGekk · 2023-04-22T09:43:58Z

sql/core/src/test/scala/org/apache/spark/sql/CsvFunctionsSuite.scala

@@ -577,4 +578,11 @@ class CsvFunctionsSuite extends QueryTest with SharedSparkSession {
      $"csv", schema_of_csv("1,2\n2"), Map.empty[String, String].asJava))
    checkAnswer(actual, Row(Row(1, "2\n2")))
  }
+
+  test("StructsToCsv should not generate codes beyond 64KB") {


Could you clarify this test title, please. I don't see anything related to 64KB in the test.

Doesn't the test just duplicates existing one: to_csv - struct?

Sure, checkEvaluation will execute this test in both modes without codegen and with codegen modes. In case of codegen mode, if the generated java code (method) will be larger than 64 kb in size it won't be able to compile it, and the test will fail, as Java has a 64kb limit on the size of methods.
It doesn't duplicate to_csv - struct test case, because the purpose of this test - is to generate big StructType(with 5000 literals in current case) and test it on StructsToCsv expression to be sure that generated java code doesn't exceed the limit of 64 kb in size.

sql/core/src/test/scala/org/apache/spark/sql/CsvFunctionsSuite.scala

MaxGekk · 2023-06-28T09:40:04Z

@NarekDW Could you rebase this PR on the recent master, please.

NarekDW · 2023-06-28T17:11:04Z

@MaxGekk sure, I've rebased, but there were conflicts in benchmarks. It will take some time to regenerate them.

MaxGekk

LGTM, waiting for benchmark results.

NarekDW · 2023-07-02T08:09:24Z

@MaxGekk sorry for delay, benchmark results are updated.

MaxGekk · 2023-07-03T07:12:31Z

+1, LGTM. Merging to master.
Thank you, @NarekDW and @HyukjinKwon @jaceklaskowski for review.

github-actions bot added the SQL label Jan 24, 2023

NarekDW mentioned this pull request Jan 24, 2023

[SPARK-42169] Implement code generation for to_csv function (StructsToCsv) #39097

Closed

NarekDW changed the title ~~[SPARK-42169] Implement code generation for to_csv function (StructsToCsv)~~ [SPARK-42169] [SQL] Implement code generation for to_csv function (StructsToCsv) Jan 24, 2023

MaxGekk reviewed Jan 31, 2023

View reviewed changes

NarekDW force-pushed the SPARK-42169 branch from 146d80f to 28eea47 Compare February 2, 2023 15:58

jaceklaskowski approved these changes Apr 20, 2023

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/CsvFunctionsSuite.scala Outdated Show resolved Hide resolved

NarekDW changed the title ~~[SPARK-42169] [SQL] Implement code generation for to_csv function (StructsToCsv)~~ [WIP][SPARK-42169] [SQL] Implement code generation for to_csv function (StructsToCsv) Apr 20, 2023

NarekDW changed the title ~~[WIP][SPARK-42169] [SQL] Implement code generation for to_csv function (StructsToCsv)~~ [SPARK-42169] [SQL] Implement code generation for to_csv function (StructsToCsv) Apr 21, 2023

MaxGekk reviewed Apr 22, 2023

View reviewed changes

NarekDW added 6 commits June 28, 2023 21:03

make to_csv function deterministic

fcab121

remove redundant Serializable definition from StructsToCsv case class

1440aca

remove redundant test case

be3810a

add test case to check generated code doesn't exceed the limit by size

7d44e88

minor scala style fix

767a22e

move test case from CsvFunctionsSuite to CsvExpressionsSuite

e3b1b52

NarekDW force-pushed the SPARK-42169 branch from 234d88b to e3b1b52 Compare June 28, 2023 17:07

MaxGekk reviewed Jul 1, 2023

View reviewed changes

update benchmarks

b8f94e3

HyukjinKwon approved these changes Jul 3, 2023

View reviewed changes

MaxGekk approved these changes Jul 3, 2023

View reviewed changes

MaxGekk closed this in 45ae9c5 Jul 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-42169] [SQL] Implement code generation for to_csv function (StructsToCsv) #39719

[SPARK-42169] [SQL] Implement code generation for to_csv function (StructsToCsv) #39719

NarekDW commented Jan 24, 2023

MaxGekk left a comment

NarekDW commented Feb 2, 2023 •

edited

Loading

MaxGekk commented Feb 5, 2023

NarekDW commented Feb 12, 2023

NarekDW commented Apr 21, 2023

MaxGekk Apr 22, 2023

NarekDW Apr 22, 2023

MaxGekk commented Jun 28, 2023

NarekDW commented Jun 28, 2023

MaxGekk left a comment

NarekDW commented Jul 2, 2023

MaxGekk commented Jul 3, 2023

[SPARK-42169] [SQL] Implement code generation for to_csv function (StructsToCsv) #39719

[SPARK-42169] [SQL] Implement code generation for to_csv function (StructsToCsv) #39719

Conversation

NarekDW commented Jan 24, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

MaxGekk left a comment

Choose a reason for hiding this comment

NarekDW commented Feb 2, 2023 • edited Loading

MaxGekk commented Feb 5, 2023

NarekDW commented Feb 12, 2023

NarekDW commented Apr 21, 2023

MaxGekk Apr 22, 2023

Choose a reason for hiding this comment

NarekDW Apr 22, 2023

Choose a reason for hiding this comment

MaxGekk commented Jun 28, 2023

NarekDW commented Jun 28, 2023

MaxGekk left a comment

Choose a reason for hiding this comment

NarekDW commented Jul 2, 2023

MaxGekk commented Jul 3, 2023

NarekDW commented Feb 2, 2023 •

edited

Loading