fix: Optimize read_side_padding #772

kazuyukitanimura · 2024-08-03T21:27:38Z

Which issue does this PR close?

Rationale for this change

This PR improves read_side_padding that is used for CHAR() schema

What changes are included in this PR?

Optimized spark_read_side_padding

How are these changes tested?

Added tests

kazuyukitanimura · 2024-08-03T21:29:04Z

Before

After

comphead · 2024-08-03T23:13:38Z

spark/src/test/resources/tpcds-micro-benchmarks/char_type.sql

@@ -0,0 +1,7 @@
+SELECT


how this test related to rpad? 🤔

They are related as their schema types are CHAR()

comphead

lgtm thanks @kazuyukitanimura and the benchmark results are promising

viirya · 2024-08-04T00:41:08Z

native/spark-expr/src/scalar_funcs.rs

+                if length <= char_len {
+                    builder.append_value(string);


If the required len is less than string's length, don't we need to take substring of it? Spark RPad does it.

Current implementation already has this issue.

In the line 389 there is an existing comment

/// Similar to DataFusion `rpad`, but not to truncate when the string is already longer than length

Perhaps I should change the name of this method, this is not used for rpad

viirya · 2024-08-04T00:53:07Z

native/spark-expr/src/scalar_funcs.rs

+                // It looks Spark's UTF8String is closer to chars rather than graphemes
+                // https://stackoverflow.com/a/46290728


Can you add an unit test for that?

spark/src/test/scala/org/apache/comet/CometExpressionSuite.scala

andygrove

LGTM. I left a comment about expanding the test

kazuyukitanimura · 2024-08-08T16:48:35Z

Merged, Thanks @comphead @viirya @andygrove

## Which issue does this PR close? ## Rationale for this change This PR improves read_side_padding that is used for CHAR() schema ## What changes are included in this PR? Optimized spark_read_side_padding ## How are these changes tested? Added tests (cherry picked from commit 457d9d1)

kazuyukitanimura added 2 commits August 3, 2024 02:59

fix: Optimize rpad

ed1a846

fix: Optimize rpad

efc6286

kazuyukitanimura marked this pull request as ready for review August 3, 2024 21:30

fix: Optimize rpad

567b3ec

kazuyukitanimura requested review from viirya, andygrove, comphead and huaxingao August 3, 2024 23:07

comphead reviewed Aug 3, 2024

View reviewed changes

comphead approved these changes Aug 3, 2024

View reviewed changes

viirya reviewed Aug 4, 2024

View reviewed changes

kazuyukitanimura added 3 commits August 7, 2024 14:19

address review comments

4643405

address review comments

f5d128c

Merge remote-tracking branch 'upstream/main' into optimize-rpad

d4b0c66

andygrove reviewed Aug 7, 2024

View reviewed changes

spark/src/test/scala/org/apache/comet/CometExpressionSuite.scala Show resolved Hide resolved

andygrove approved these changes Aug 7, 2024

View reviewed changes

kazuyukitanimura changed the title ~~fix: Optimize rpad~~ fix: Optimize ead_side_padding" Aug 7, 2024

kazuyukitanimura changed the title ~~fix: Optimize ead_side_padding"~~ fix: Optimize read_side_padding" Aug 7, 2024

kazuyukitanimura added 2 commits August 7, 2024 17:01

address review comments

d647fe4

address review comments

a5f75a1

kazuyukitanimura merged commit 457d9d1 into apache:main Aug 8, 2024
74 checks passed

kazuyukitanimura changed the title ~~fix: Optimize read_side_padding"~~ fix: Optimize read_side_padding Aug 8, 2024

andygrove added the performance label Aug 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Optimize read_side_padding #772

fix: Optimize read_side_padding #772

kazuyukitanimura commented Aug 3, 2024 •

edited

Loading

kazuyukitanimura commented Aug 3, 2024

comphead Aug 3, 2024

kazuyukitanimura Aug 7, 2024

comphead left a comment

viirya Aug 4, 2024

viirya Aug 4, 2024

kazuyukitanimura Aug 7, 2024

viirya Aug 4, 2024

kazuyukitanimura Aug 7, 2024

andygrove left a comment

kazuyukitanimura commented Aug 8, 2024

		// It looks Spark's UTF8String is closer to chars rather than graphemes
		// https://stackoverflow.com/a/46290728

fix: Optimize read_side_padding #772

fix: Optimize read_side_padding #772

Conversation

kazuyukitanimura commented Aug 3, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

kazuyukitanimura commented Aug 3, 2024

Before

After

comphead Aug 3, 2024

Choose a reason for hiding this comment

kazuyukitanimura Aug 7, 2024

Choose a reason for hiding this comment

comphead left a comment

Choose a reason for hiding this comment

viirya Aug 4, 2024

Choose a reason for hiding this comment

viirya Aug 4, 2024

Choose a reason for hiding this comment

kazuyukitanimura Aug 7, 2024

Choose a reason for hiding this comment

viirya Aug 4, 2024

Choose a reason for hiding this comment

kazuyukitanimura Aug 7, 2024

Choose a reason for hiding this comment

andygrove left a comment

Choose a reason for hiding this comment

kazuyukitanimura commented Aug 8, 2024

kazuyukitanimura commented Aug 3, 2024 •

edited

Loading