Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-48681][SQL] Use ICU in Lower/Upper expressions for UTF8_BINARY strings #47043

Closed
wants to merge 2 commits into from

Conversation

uros-db
Copy link
Contributor

@uros-db uros-db commented Jun 20, 2024

What changes were proposed in this pull request?

Update Lower & Upper Spark expressions to use ICU case mappings for UTF8_BINARY collation, instead of the currently used JVM case mappings. This behaviour is put under the ICU_CASE_MAPPINGS_ENABLED flag in SQLConf, which is true by default.

Why are the changes needed?

To keep the consistency between collations - all collations shouls use ICU-based case mappings, including the UTF8_BINARY collation.

Does this PR introduce any user-facing change?

Yes, the behaviour of lower & upper string functions for UTF8_BINARY will now rely on ICU-based case mappings. However, by turning the ICU_CASE_MAPPINGS_ENABLED flag off, users can get the old JVM-based case mappings. Note that the difference between the two is really subtle.

How was this patch tested?

Existing tests, with extended CollationSupport unit tests for Lower/Upper to verify both ICU and JVM behaviour.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Jun 20, 2024
@uros-db uros-db changed the title [WIP][SQL] Use ICU for Lower/Upper for JVM version 17+ [WIP][SQL] Use ICU in Lower/Upper expressions for UTF8_BINARY strings Jun 20, 2024
@uros-db uros-db changed the title [WIP][SQL] Use ICU in Lower/Upper expressions for UTF8_BINARY strings [SPARK-48681][SQL] Use ICU in Lower/Upper expressions for UTF8_BINARY strings Jun 21, 2024
Copy link
Contributor

@mkaravel mkaravel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@mkaravel
Copy link
Contributor

Please also update the PR description and explain why we are making this change.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in a7dc020 Jun 24, 2024
cloud-fan pushed a commit that referenced this pull request Jun 28, 2024
### What changes were proposed in this pull request?
Update `InitCap` Spark expressions to use ICU case mappings for UTF8_BINARY collation, instead of the currently used JVM case mappings. This behaviour is put under the `ICU_CASE_MAPPINGS_ENABLED` flag in SQLConf, which is true by default. Note: the same flag is used for `Lower` & `Upper` expressions, with changes introduced in: #47043.

### Why are the changes needed?
To keep the consistency between collations - all collations shouls use ICU-based case mappings, including the UTF8_BINARY collation.

### Does this PR introduce _any_ user-facing change?
Yes, the behaviour of `initcap` string function for UTF8_BINARY will now rely on ICU-based case mappings. However, by turning the `ICU_CASE_MAPPINGS_ENABLED` flag off, users can get the old JVM-based case mappings. Note that the difference between the two is really subtle.

### How was this patch tested?
Existing tests, with extended `CollationSupport` unit tests for InitCap to verify both ICU and JVM behaviour.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #47100 from uros-db/change-initcap.

Authored-by: Uros Bojanic <157381213+uros-db@users.noreply.github.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
asl3 pushed a commit to asl3/spark that referenced this pull request Jul 1, 2024
### What changes were proposed in this pull request?
Update `InitCap` Spark expressions to use ICU case mappings for UTF8_BINARY collation, instead of the currently used JVM case mappings. This behaviour is put under the `ICU_CASE_MAPPINGS_ENABLED` flag in SQLConf, which is true by default. Note: the same flag is used for `Lower` & `Upper` expressions, with changes introduced in: apache#47043.

### Why are the changes needed?
To keep the consistency between collations - all collations shouls use ICU-based case mappings, including the UTF8_BINARY collation.

### Does this PR introduce _any_ user-facing change?
Yes, the behaviour of `initcap` string function for UTF8_BINARY will now rely on ICU-based case mappings. However, by turning the `ICU_CASE_MAPPINGS_ENABLED` flag off, users can get the old JVM-based case mappings. Note that the difference between the two is really subtle.

### How was this patch tested?
Existing tests, with extended `CollationSupport` unit tests for InitCap to verify both ICU and JVM behaviour.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#47100 from uros-db/change-initcap.

Authored-by: Uros Bojanic <157381213+uros-db@users.noreply.github.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
attilapiros pushed a commit to attilapiros/spark that referenced this pull request Oct 4, 2024
… strings

### What changes were proposed in this pull request?
Update `Lower` & `Upper` Spark expressions to use ICU case mappings for UTF8_BINARY collation, instead of the currently used JVM case mappings. This behaviour is put under the `ICU_CASE_MAPPINGS_ENABLED` flag in SQLConf, which is `true` by default.

### Why are the changes needed?
To keep the consistency between collations - all collations shouls use ICU-based case mappings, including the UTF8_BINARY collation.

### Does this PR introduce _any_ user-facing change?
Yes, the behaviour of `lower` & `upper` string functions for UTF8_BINARY will now rely on ICU-based case mappings. However, by turning the `ICU_CASE_MAPPINGS_ENABLED` flag off, users can get the old JVM-based case mappings. Note that the difference between the two is really subtle.

### How was this patch tested?
Existing tests, with extended `CollationSupport` unit tests for Lower/Upper to verify both ICU and JVM behaviour.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#47043 from uros-db/change-lower-upper.

Authored-by: Uros Bojanic <157381213+uros-db@users.noreply.github.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
attilapiros pushed a commit to attilapiros/spark that referenced this pull request Oct 4, 2024
### What changes were proposed in this pull request?
Update `InitCap` Spark expressions to use ICU case mappings for UTF8_BINARY collation, instead of the currently used JVM case mappings. This behaviour is put under the `ICU_CASE_MAPPINGS_ENABLED` flag in SQLConf, which is true by default. Note: the same flag is used for `Lower` & `Upper` expressions, with changes introduced in: apache#47043.

### Why are the changes needed?
To keep the consistency between collations - all collations shouls use ICU-based case mappings, including the UTF8_BINARY collation.

### Does this PR introduce _any_ user-facing change?
Yes, the behaviour of `initcap` string function for UTF8_BINARY will now rely on ICU-based case mappings. However, by turning the `ICU_CASE_MAPPINGS_ENABLED` flag off, users can get the old JVM-based case mappings. Note that the difference between the two is really subtle.

### How was this patch tested?
Existing tests, with extended `CollationSupport` unit tests for InitCap to verify both ICU and JVM behaviour.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#47100 from uros-db/change-initcap.

Authored-by: Uros Bojanic <157381213+uros-db@users.noreply.github.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants