-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-48576][SQL] Rename UTF8_BINARY_LCASE to UTF8_LCASE #46924
Conversation
Hi~ |
of course, I'll update the PR description with more details soon but shortly: as part of the collation effort in Spark, we've moved away from byte-by-byte logic towards code point per code point logic, so what we used to call UTF8_BINARY_LCASE is now UTF8_LCASE, as this describes more precisely what is going on here's a couple PRs regarding these changes: |
got it, thanks |
all checks good, but several conflicts popping up adding @mkaravel @dbatomic @cloud-fan for review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but one of the tests is failing.
adding @dbatomic and @cloud-fan to review |
thanks, merging to master! |
### What changes were proposed in this pull request? Re-running the collation benchmark with two modifications: - UTF8_BINARY_LCASE has been renamed to UTF8_LCASE in #46924 - UTF8_BINARY should appear first in the collation benchmark results, so performance is relative to it ### Why are the changes needed? We've changed the meaning of LCASE collation in Spark, and also modified how equality checks / hashing/ expressions work with this collation, so we need to re-run the benchmarks and identify areas of improvement. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Rxisting tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47030 from uros-db/collation-benchmarks. Authored-by: Uros Bojanic <157381213+uros-db@users.noreply.github.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request? Re-running the collation benchmark with two modifications: - UTF8_BINARY_LCASE has been renamed to UTF8_LCASE in apache#46924 - UTF8_BINARY should appear first in the collation benchmark results, so performance is relative to it ### Why are the changes needed? We've changed the meaning of LCASE collation in Spark, and also modified how equality checks / hashing/ expressions work with this collation, so we need to re-run the benchmarks and identify areas of improvement. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Rxisting tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47030 from uros-db/collation-benchmarks. Authored-by: Uros Bojanic <157381213+uros-db@users.noreply.github.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
Renaming
UTF8_BINARY_LCASE
collation toUTF8_LCASE
.Why are the changes needed?
As part of the collation effort in Spark, we've moved away from byte-by-byte logic towards character-by-character logic, so what we used to call
UTF8_BINARY_LCASE
is now more preciselyUTF8_LCASE
. For example, string searching in UTF8_LCASE now works on character-level (rather than on byte-level), which is reflected in this PRs: #46511, #46589, #46682, #46761, #46762. In addition, string comparison also works on character-level now, as per the changes introduced in this PR: #46700.Does this PR introduce any user-facing change?
Yes, what was previously named
UTF8_BINARY_LCASE
collation, will from now on be namedUTF8_LCASE
.How was this patch tested?
Existing tests.
Was this patch authored or co-authored using generative AI tooling?
No.