-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-42169] [SQL] Implement code generation for to_csv function (StructsToCsv) #39719
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does CSVBenchmark
show any performance improvements?
@MaxGekk This is from current branch: On my local machine - |
@NarekDW Could you regenerate benchmark results using GitHub actions, see
in your PR. |
@MaxGekk sorry for the late response. I've added benchmark results from GitHub actions. Execution links for master branch: Execution links for current branch: P.S. Scala 2.12 was used for all executions. |
sql/core/src/test/scala/org/apache/spark/sql/CsvFunctionsSuite.scala
Outdated
Show resolved
Hide resolved
@jaceklaskowski thank you for the review. @MaxGekk just a reminder. |
@@ -577,4 +578,11 @@ class CsvFunctionsSuite extends QueryTest with SharedSparkSession { | |||
$"csv", schema_of_csv("1,2\n2"), Map.empty[String, String].asJava)) | |||
checkAnswer(actual, Row(Row(1, "2\n2"))) | |||
} | |||
|
|||
test("StructsToCsv should not generate codes beyond 64KB") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you clarify this test title, please. I don't see anything related to 64KB in the test.
Doesn't the test just duplicates existing one: to_csv - struct
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, checkEvaluation
will execute this test in both modes without codegen and with codegen modes. In case of codegen mode, if the generated java code (method) will be larger than 64 kb in size it won't be able to compile it, and the test will fail, as Java has a 64kb limit on the size of methods.
It doesn't duplicate to_csv - struct
test case, because the purpose of this test - is to generate big StructType(with 5000 literals in current case) and test it on StructsToCsv
expression to be sure that generated java code doesn't exceed the limit of 64 kb in size.
sql/core/src/test/scala/org/apache/spark/sql/CsvFunctionsSuite.scala
Outdated
Show resolved
Hide resolved
@NarekDW Could you rebase this PR on the recent master, please. |
@MaxGekk sure, I've rebased, but there were conflicts in benchmarks. It will take some time to regenerate them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, waiting for benchmark results.
@MaxGekk sorry for delay, benchmark results are updated. |
+1, LGTM. Merging to master. |
What changes were proposed in this pull request?
This PR enhances
StructsToCsv
class withdoGenCode
function instead of extending it fromCodegenFallback
trait (performance improvement).Why are the changes needed?
It will improve performance.
Does this PR introduce any user-facing change?
No
How was this patch tested?
an additional test case were added to
org.apache.spark.sql.CsvFunctionsSuite
class.