Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-42169] [SQL] Implement code generation for to_csv function (StructsToCsv) #39719

Closed
wants to merge 7 commits into from

Conversation

NarekDW
Copy link
Contributor

@NarekDW NarekDW commented Jan 24, 2023

What changes were proposed in this pull request?

This PR enhances StructsToCsv class with doGenCode function instead of extending it from CodegenFallback trait (performance improvement).

Why are the changes needed?

It will improve performance.

Does this PR introduce any user-facing change?

No

How was this patch tested?

an additional test case were added to org.apache.spark.sql.CsvFunctionsSuite class.

@github-actions github-actions bot added the SQL label Jan 24, 2023
@NarekDW NarekDW changed the title [SPARK-42169] Implement code generation for to_csv function (StructsToCsv) [SPARK-42169] [SQL] Implement code generation for to_csv function (StructsToCsv) Jan 24, 2023
Copy link
Member

@MaxGekk MaxGekk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does CSVBenchmark show any performance improvements?

@NarekDW
Copy link
Contributor Author

NarekDW commented Feb 2, 2023

@MaxGekk
This is from master branch:

image

This is from current branch:

image

On my local machine - to_csv related operations in average are about ~20 % faster with this change in CSVBenchmark.

@MaxGekk
Copy link
Member

MaxGekk commented Feb 5, 2023

@NarekDW Could you regenerate benchmark results using GitHub actions, see
https://spark.apache.org/developer-tools.html (Running benchmarks in your forked repository)
and update:

  • CSVBenchmark-results.txt
  • CSVBenchmark-jdk11-results.txt
  • CSVBenchmark-jdk17-results.txt

in your PR.

@NarekDW
Copy link
Contributor Author

NarekDW commented Feb 12, 2023

@NarekDW Could you regenerate benchmark results using GitHub actions, see https://spark.apache.org/developer-tools.html (Running benchmarks in your forked repository) and update:

  • CSVBenchmark-results.txt
  • CSVBenchmark-jdk11-results.txt
  • CSVBenchmark-jdk17-results.txt

in your PR.

@MaxGekk sorry for the late response. I've added benchmark results from GitHub actions.
I've added 2 commits: in pre-last commit I've updated benchmarks results from master branch and in the last commit I've added benchmarks results from current branch.

Execution links for master branch:
Java 8
Java 11
Java 17

Execution links for current branch:
Java 8
Java 11
Java 17

P.S. Scala 2.12 was used for all executions.

@NarekDW NarekDW changed the title [SPARK-42169] [SQL] Implement code generation for to_csv function (StructsToCsv) [WIP][SPARK-42169] [SQL] Implement code generation for to_csv function (StructsToCsv) Apr 20, 2023
@NarekDW NarekDW changed the title [WIP][SPARK-42169] [SQL] Implement code generation for to_csv function (StructsToCsv) [SPARK-42169] [SQL] Implement code generation for to_csv function (StructsToCsv) Apr 21, 2023
@NarekDW
Copy link
Contributor Author

NarekDW commented Apr 21, 2023

@jaceklaskowski thank you for the review. @MaxGekk just a reminder.

@@ -577,4 +578,11 @@ class CsvFunctionsSuite extends QueryTest with SharedSparkSession {
$"csv", schema_of_csv("1,2\n2"), Map.empty[String, String].asJava))
checkAnswer(actual, Row(Row(1, "2\n2")))
}

test("StructsToCsv should not generate codes beyond 64KB") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you clarify this test title, please. I don't see anything related to 64KB in the test.

Doesn't the test just duplicates existing one: to_csv - struct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, checkEvaluation will execute this test in both modes without codegen and with codegen modes. In case of codegen mode, if the generated java code (method) will be larger than 64 kb in size it won't be able to compile it, and the test will fail, as Java has a 64kb limit on the size of methods.
It doesn't duplicate to_csv - struct test case, because the purpose of this test - is to generate big StructType(with 5000 literals in current case) and test it on StructsToCsv expression to be sure that generated java code doesn't exceed the limit of 64 kb in size.

@MaxGekk
Copy link
Member

MaxGekk commented Jun 28, 2023

@NarekDW Could you rebase this PR on the recent master, please.

@NarekDW
Copy link
Contributor Author

NarekDW commented Jun 28, 2023

@MaxGekk sure, I've rebased, but there were conflicts in benchmarks. It will take some time to regenerate them.

Copy link
Member

@MaxGekk MaxGekk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, waiting for benchmark results.

@NarekDW
Copy link
Contributor Author

NarekDW commented Jul 2, 2023

@MaxGekk sorry for delay, benchmark results are updated.

@MaxGekk
Copy link
Member

MaxGekk commented Jul 3, 2023

+1, LGTM. Merging to master.
Thank you, @NarekDW and @HyukjinKwon @jaceklaskowski for review.

@MaxGekk MaxGekk closed this in 45ae9c5 Jul 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants