-
Notifications
You must be signed in to change notification settings - Fork 169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Optimize read_side_padding #772
Conversation
@@ -0,0 +1,7 @@ | |||
SELECT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how this test related to rpad? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are related as their schema types are CHAR()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm thanks @kazuyukitanimura and the benchmark results are promising
if length <= char_len { | ||
builder.append_value(string); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the required len is less than string's length, don't we need to take substring of it? Spark RPad does it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Current implementation already has this issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the line 389 there is an existing comment
/// Similar to DataFusion `rpad`, but not to truncate when the string is already longer than length
Perhaps I should change the name of this method, this is not used for rpad
// It looks Spark's UTF8String is closer to chars rather than graphemes | ||
// https://stackoverflow.com/a/46290728 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add an unit test for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I left a comment about expanding the test
Merged, Thanks @comphead @viirya @andygrove |
## Which issue does this PR close? ## Rationale for this change This PR improves read_side_padding that is used for CHAR() schema ## What changes are included in this PR? Optimized spark_read_side_padding ## How are these changes tested? Added tests (cherry picked from commit 457d9d1)
Which issue does this PR close?
Rationale for this change
This PR improves read_side_padding that is used for CHAR() schema
What changes are included in this PR?
Optimized spark_read_side_padding
How are these changes tested?
Added tests