[SPARK-48576][SQL] Rename UTF8_BINARY_LCASE to UTF8_LCASE #46924

uros-db · 2024-06-10T05:27:34Z

What changes were proposed in this pull request?

Renaming UTF8_BINARY_LCASE collation to UTF8_LCASE.

Why are the changes needed?

As part of the collation effort in Spark, we've moved away from byte-by-byte logic towards character-by-character logic, so what we used to call UTF8_BINARY_LCASE is now more precisely UTF8_LCASE. For example, string searching in UTF8_LCASE now works on character-level (rather than on byte-level), which is reflected in this PRs: #46511, #46589, #46682, #46761, #46762. In addition, string comparison also works on character-level now, as per the changes introduced in this PR: #46700.

Does this PR introduce any user-facing change?

Yes, what was previously named UTF8_BINARY_LCASE collation, will from now on be named UTF8_LCASE.

How was this patch tested?

Existing tests.

Was this patch authored or co-authored using generative AI tooling?

No.

superdiaodiao · 2024-06-10T08:02:48Z

Hi~
May I ask why we need to rename it?

uros-db · 2024-06-10T08:09:32Z

of course, I'll update the PR description with more details soon

but shortly: as part of the collation effort in Spark, we've moved away from byte-by-byte logic towards code point per code point logic, so what we used to call UTF8_BINARY_LCASE is now UTF8_LCASE, as this describes more precisely what is going on

here's a couple PRs regarding these changes:
#46700
#46761
#46762
#46682
#46589
#46682

superdiaodiao · 2024-06-10T08:16:52Z

of...

got it, thanks

uros-db · 2024-06-10T17:37:51Z

all checks good, but several conflicts popping up

adding @mkaravel @dbatomic @cloud-fan for review

mkaravel

LGTM, but one of the tests is failing.

uros-db · 2024-06-11T10:38:54Z

adding @dbatomic and @cloud-fan to review

cloud-fan · 2024-06-11T17:25:52Z

thanks, merging to master!

### What changes were proposed in this pull request? Re-running the collation benchmark with two modifications: - UTF8_BINARY_LCASE has been renamed to UTF8_LCASE in #46924 - UTF8_BINARY should appear first in the collation benchmark results, so performance is relative to it ### Why are the changes needed? We've changed the meaning of LCASE collation in Spark, and also modified how equality checks / hashing/ expressions work with this collation, so we need to re-run the benchmarks and identify areas of improvement. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Rxisting tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47030 from uros-db/collation-benchmarks. Authored-by: Uros Bojanic <157381213+uros-db@users.noreply.github.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? Re-running the collation benchmark with two modifications: - UTF8_BINARY_LCASE has been renamed to UTF8_LCASE in apache#46924 - UTF8_BINARY should appear first in the collation benchmark results, so performance is relative to it ### Why are the changes needed? We've changed the meaning of LCASE collation in Spark, and also modified how equality checks / hashing/ expressions work with this collation, so we need to re-run the benchmarks and identify areas of improvement. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Rxisting tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47030 from uros-db/collation-benchmarks. Authored-by: Uros Bojanic <157381213+uros-db@users.noreply.github.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

Initial commit

eef8759

github-actions bot added SQL STRUCTURED STREAMING PYTHON CONNECT labels Jun 10, 2024

uros-db changed the title ~~[WIP][SQL] Rename UTF8_BINARY_LCASE to UTF8_LCASE~~ [SPARK-48576][SQL] Rename UTF8_BINARY_LCASE to UTF8_LCASE Jun 10, 2024

uros-db added 3 commits June 10, 2024 11:25

Fix CollationFactory

bfdd18b

Fix CollationFactory

4956121

Fix golden files

d89d395

Merge branch 'master' into rename-lcase

fbbc956

mkaravel approved these changes Jun 11, 2024

View reviewed changes

cloud-fan approved these changes Jun 11, 2024

View reviewed changes

cloud-fan closed this in aad6771 Jun 11, 2024

uros-db mentioned this pull request Jun 11, 2024

[SPARK-48576][SQL][FOLLOWUP] Rename UTF8_BINARY_LCASE to UTF8_LCASE #46939

Closed

uros-db mentioned this pull request Jun 19, 2024

[SQL][TEST] Re-run collation benchmark #47030

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-48576][SQL] Rename UTF8_BINARY_LCASE to UTF8_LCASE #46924

[SPARK-48576][SQL] Rename UTF8_BINARY_LCASE to UTF8_LCASE #46924

uros-db commented Jun 10, 2024 •

edited

Loading

superdiaodiao commented Jun 10, 2024

uros-db commented Jun 10, 2024

superdiaodiao commented Jun 10, 2024

uros-db commented Jun 10, 2024

mkaravel left a comment

uros-db commented Jun 11, 2024

cloud-fan commented Jun 11, 2024

[SPARK-48576][SQL] Rename UTF8_BINARY_LCASE to UTF8_LCASE #46924

[SPARK-48576][SQL] Rename UTF8_BINARY_LCASE to UTF8_LCASE #46924

Conversation

uros-db commented Jun 10, 2024 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

superdiaodiao commented Jun 10, 2024

uros-db commented Jun 10, 2024

superdiaodiao commented Jun 10, 2024

uros-db commented Jun 10, 2024

mkaravel left a comment

Choose a reason for hiding this comment

uros-db commented Jun 11, 2024

cloud-fan commented Jun 11, 2024

uros-db commented Jun 10, 2024 •

edited

Loading