Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Optimize read_side_padding #772

Merged
merged 8 commits into from
Aug 8, 2024

Conversation

kazuyukitanimura
Copy link
Contributor

@kazuyukitanimura kazuyukitanimura commented Aug 3, 2024

Which issue does this PR close?

Rationale for this change

This PR improves read_side_padding that is used for CHAR() schema

What changes are included in this PR?

Optimized spark_read_side_padding

How are these changes tested?

Added tests

@kazuyukitanimura
Copy link
Contributor Author

Before

Screenshot 2024-08-03 at 3 43 38 AM

After

Screenshot 2024-08-03 at 3 43 54 AM

@kazuyukitanimura kazuyukitanimura marked this pull request as ready for review August 3, 2024 21:30
@@ -0,0 +1,7 @@
SELECT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how this test related to rpad? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are related as their schema types are CHAR()

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm thanks @kazuyukitanimura and the benchmark results are promising

Comment on lines +425 to +426
if length <= char_len {
builder.append_value(string);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the required len is less than string's length, don't we need to take substring of it? Spark RPad does it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current implementation already has this issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the line 389 there is an existing comment

/// Similar to DataFusion `rpad`, but not to truncate when the string is already longer than length

Perhaps I should change the name of this method, this is not used for rpad

Comment on lines +422 to +423
// It looks Spark's UTF8String is closer to chars rather than graphemes
// https://stackoverflow.com/a/46290728
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add an unit test for that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I left a comment about expanding the test

@kazuyukitanimura kazuyukitanimura changed the title fix: Optimize rpad fix: Optimize ead_side_padding" Aug 7, 2024
@kazuyukitanimura kazuyukitanimura changed the title fix: Optimize ead_side_padding" fix: Optimize read_side_padding" Aug 7, 2024
@kazuyukitanimura kazuyukitanimura merged commit 457d9d1 into apache:main Aug 8, 2024
74 checks passed
@kazuyukitanimura
Copy link
Contributor Author

Merged, Thanks @comphead @viirya @andygrove

@kazuyukitanimura kazuyukitanimura changed the title fix: Optimize read_side_padding" fix: Optimize read_side_padding Aug 8, 2024
himadripal pushed a commit to himadripal/datafusion-comet that referenced this pull request Sep 7, 2024
## Which issue does this PR close?

## Rationale for this change

This PR improves read_side_padding that is used for CHAR() schema

## What changes are included in this PR?

Optimized spark_read_side_padding

## How are these changes tested?

Added tests

(cherry picked from commit 457d9d1)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants