[SPARK-48681][SQL] Use ICU in Lower/Upper expressions for UTF8_BINARY strings #47043

uros-db · 2024-06-20T11:19:54Z

What changes were proposed in this pull request?

Update Lower & Upper Spark expressions to use ICU case mappings for UTF8_BINARY collation, instead of the currently used JVM case mappings. This behaviour is put under the ICU_CASE_MAPPINGS_ENABLED flag in SQLConf, which is true by default.

Why are the changes needed?

To keep the consistency between collations - all collations shouls use ICU-based case mappings, including the UTF8_BINARY collation.

Does this PR introduce any user-facing change?

Yes, the behaviour of lower & upper string functions for UTF8_BINARY will now rely on ICU-based case mappings. However, by turning the ICU_CASE_MAPPINGS_ENABLED flag off, users can get the old JVM-based case mappings. Note that the difference between the two is really subtle.

How was this patch tested?

Existing tests, with extended CollationSupport unit tests for Lower/Upper to verify both ICU and JVM behaviour.

Was this patch authored or co-authored using generative AI tooling?

No.

mkaravel

LGTM.

common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java

mkaravel · 2024-06-21T07:07:51Z

Please also update the PR description and explain why we are making this change.

cloud-fan · 2024-06-24T08:19:33Z

thanks, merging to master!

### What changes were proposed in this pull request? Update `InitCap` Spark expressions to use ICU case mappings for UTF8_BINARY collation, instead of the currently used JVM case mappings. This behaviour is put under the `ICU_CASE_MAPPINGS_ENABLED` flag in SQLConf, which is true by default. Note: the same flag is used for `Lower` & `Upper` expressions, with changes introduced in: #47043. ### Why are the changes needed? To keep the consistency between collations - all collations shouls use ICU-based case mappings, including the UTF8_BINARY collation. ### Does this PR introduce _any_ user-facing change? Yes, the behaviour of `initcap` string function for UTF8_BINARY will now rely on ICU-based case mappings. However, by turning the `ICU_CASE_MAPPINGS_ENABLED` flag off, users can get the old JVM-based case mappings. Note that the difference between the two is really subtle. ### How was this patch tested? Existing tests, with extended `CollationSupport` unit tests for InitCap to verify both ICU and JVM behaviour. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47100 from uros-db/change-initcap. Authored-by: Uros Bojanic <157381213+uros-db@users.noreply.github.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? Update `InitCap` Spark expressions to use ICU case mappings for UTF8_BINARY collation, instead of the currently used JVM case mappings. This behaviour is put under the `ICU_CASE_MAPPINGS_ENABLED` flag in SQLConf, which is true by default. Note: the same flag is used for `Lower` & `Upper` expressions, with changes introduced in: apache#47043. ### Why are the changes needed? To keep the consistency between collations - all collations shouls use ICU-based case mappings, including the UTF8_BINARY collation. ### Does this PR introduce _any_ user-facing change? Yes, the behaviour of `initcap` string function for UTF8_BINARY will now rely on ICU-based case mappings. However, by turning the `ICU_CASE_MAPPINGS_ENABLED` flag off, users can get the old JVM-based case mappings. Note that the difference between the two is really subtle. ### How was this patch tested? Existing tests, with extended `CollationSupport` unit tests for InitCap to verify both ICU and JVM behaviour. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47100 from uros-db/change-initcap. Authored-by: Uros Bojanic <157381213+uros-db@users.noreply.github.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

… strings ### What changes were proposed in this pull request? Update `Lower` & `Upper` Spark expressions to use ICU case mappings for UTF8_BINARY collation, instead of the currently used JVM case mappings. This behaviour is put under the `ICU_CASE_MAPPINGS_ENABLED` flag in SQLConf, which is `true` by default. ### Why are the changes needed? To keep the consistency between collations - all collations shouls use ICU-based case mappings, including the UTF8_BINARY collation. ### Does this PR introduce _any_ user-facing change? Yes, the behaviour of `lower` & `upper` string functions for UTF8_BINARY will now rely on ICU-based case mappings. However, by turning the `ICU_CASE_MAPPINGS_ENABLED` flag off, users can get the old JVM-based case mappings. Note that the difference between the two is really subtle. ### How was this patch tested? Existing tests, with extended `CollationSupport` unit tests for Lower/Upper to verify both ICU and JVM behaviour. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47043 from uros-db/change-lower-upper. Authored-by: Uros Bojanic <157381213+uros-db@users.noreply.github.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? Update `InitCap` Spark expressions to use ICU case mappings for UTF8_BINARY collation, instead of the currently used JVM case mappings. This behaviour is put under the `ICU_CASE_MAPPINGS_ENABLED` flag in SQLConf, which is true by default. Note: the same flag is used for `Lower` & `Upper` expressions, with changes introduced in: apache#47043. ### Why are the changes needed? To keep the consistency between collations - all collations shouls use ICU-based case mappings, including the UTF8_BINARY collation. ### Does this PR introduce _any_ user-facing change? Yes, the behaviour of `initcap` string function for UTF8_BINARY will now rely on ICU-based case mappings. However, by turning the `ICU_CASE_MAPPINGS_ENABLED` flag off, users can get the old JVM-based case mappings. Note that the difference between the two is really subtle. ### How was this patch tested? Existing tests, with extended `CollationSupport` unit tests for InitCap to verify both ICU and JVM behaviour. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47100 from uros-db/change-initcap. Authored-by: Uros Bojanic <157381213+uros-db@users.noreply.github.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

github-actions bot added the SQL label Jun 20, 2024

uros-db changed the title ~~[WIP][SQL] Use ICU for Lower/Upper for JVM version 17+~~ [WIP][SQL] Use ICU in Lower/Upper expressions for UTF8_BINARY strings Jun 20, 2024

Use ICU

194177d

uros-db force-pushed the change-lower-upper branch from ca93b88 to 194177d Compare June 20, 2024 15:11

uros-db changed the title ~~[WIP][SQL] Use ICU in Lower/Upper expressions for UTF8_BINARY strings~~ [SPARK-48681][SQL] Use ICU in Lower/Upper expressions for UTF8_BINARY strings Jun 21, 2024

mkaravel approved these changes Jun 21, 2024

View reviewed changes

common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java Outdated Show resolved Hide resolved

Add some comments

e170e45

cloud-fan approved these changes Jun 24, 2024

View reviewed changes

cloud-fan closed this in a7dc020 Jun 24, 2024

uros-db mentioned this pull request Jun 26, 2024

[SPARK-48682][SQL] Use ICU in InitCap expression for UTF8_BINARY strings #47100

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-48681][SQL] Use ICU in Lower/Upper expressions for UTF8_BINARY strings #47043

[SPARK-48681][SQL] Use ICU in Lower/Upper expressions for UTF8_BINARY strings #47043

uros-db commented Jun 20, 2024 •

edited

Loading

mkaravel left a comment

mkaravel commented Jun 21, 2024

cloud-fan commented Jun 24, 2024

[SPARK-48681][SQL] Use ICU in Lower/Upper expressions for UTF8_BINARY strings #47043

[SPARK-48681][SQL] Use ICU in Lower/Upper expressions for UTF8_BINARY strings #47043

Conversation

uros-db commented Jun 20, 2024 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

mkaravel left a comment

Choose a reason for hiding this comment

mkaravel commented Jun 21, 2024

cloud-fan commented Jun 24, 2024

uros-db commented Jun 20, 2024 •

edited

Loading