Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-48576][SQL] Rename UTF8_BINARY_LCASE to UTF8_LCASE #46924

Closed
wants to merge 5 commits into from

Conversation

uros-db
Copy link
Contributor

@uros-db uros-db commented Jun 10, 2024

What changes were proposed in this pull request?

Renaming UTF8_BINARY_LCASE collation to UTF8_LCASE.

Why are the changes needed?

As part of the collation effort in Spark, we've moved away from byte-by-byte logic towards character-by-character logic, so what we used to call UTF8_BINARY_LCASE is now more precisely UTF8_LCASE. For example, string searching in UTF8_LCASE now works on character-level (rather than on byte-level), which is reflected in this PRs: #46511, #46589, #46682, #46761, #46762. In addition, string comparison also works on character-level now, as per the changes introduced in this PR: #46700.

Does this PR introduce any user-facing change?

Yes, what was previously named UTF8_BINARY_LCASE collation, will from now on be named UTF8_LCASE.

How was this patch tested?

Existing tests.

Was this patch authored or co-authored using generative AI tooling?

No.

@uros-db uros-db changed the title [WIP][SQL] Rename UTF8_BINARY_LCASE to UTF8_LCASE [SPARK-48576][SQL] Rename UTF8_BINARY_LCASE to UTF8_LCASE Jun 10, 2024
@superdiaodiao
Copy link
Contributor

Hi~
May I ask why we need to rename it?

@uros-db
Copy link
Contributor Author

uros-db commented Jun 10, 2024

of course, I'll update the PR description with more details soon

but shortly: as part of the collation effort in Spark, we've moved away from byte-by-byte logic towards code point per code point logic, so what we used to call UTF8_BINARY_LCASE is now UTF8_LCASE, as this describes more precisely what is going on

here's a couple PRs regarding these changes:
#46700
#46761
#46762
#46682
#46589
#46682

@superdiaodiao
Copy link
Contributor

of...

got it, thanks

@uros-db
Copy link
Contributor Author

uros-db commented Jun 10, 2024

all checks good, but several conflicts popping up

adding @mkaravel @dbatomic @cloud-fan for review

Copy link
Contributor

@mkaravel mkaravel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but one of the tests is failing.

@uros-db
Copy link
Contributor Author

uros-db commented Jun 11, 2024

adding @dbatomic and @cloud-fan to review

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in aad6771 Jun 11, 2024
cloud-fan pushed a commit that referenced this pull request Jun 21, 2024
### What changes were proposed in this pull request?
Re-running the collation benchmark with two modifications:

- UTF8_BINARY_LCASE has been renamed to UTF8_LCASE in #46924
- UTF8_BINARY should appear first in the collation benchmark results, so performance is relative to it

### Why are the changes needed?
We've changed the meaning of LCASE collation in Spark, and also modified how equality checks / hashing/ expressions work with this collation, so we need to re-run the benchmarks and identify areas of improvement.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Rxisting tests.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #47030 from uros-db/collation-benchmarks.

Authored-by: Uros Bojanic <157381213+uros-db@users.noreply.github.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
attilapiros pushed a commit to attilapiros/spark that referenced this pull request Oct 4, 2024
### What changes were proposed in this pull request?
Re-running the collation benchmark with two modifications:

- UTF8_BINARY_LCASE has been renamed to UTF8_LCASE in apache#46924
- UTF8_BINARY should appear first in the collation benchmark results, so performance is relative to it

### Why are the changes needed?
We've changed the meaning of LCASE collation in Spark, and also modified how equality checks / hashing/ expressions work with this collation, so we need to re-run the benchmarks and identify areas of improvement.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Rxisting tests.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#47030 from uros-db/collation-benchmarks.

Authored-by: Uros Bojanic <157381213+uros-db@users.noreply.github.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants