Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-47414][SQL] Lowercase collation support for regexp expressions #46077

Closed
wants to merge 20 commits into from

Conversation

uros-db
Copy link
Contributor

@uros-db uros-db commented Apr 16, 2024

What changes were proposed in this pull request?

Introduce collation awareness for regexp expressions: like, ilike, like all, not like all, like any, not like any, rlike, split, regexp_replace, regexp_extract, regexp_extract_all, regexp_count, regexp_substr, regexp_instr. Note: collation support is only enabled for binary (UTF8_BINARY, UNICODE) & lowercase (UTF8_BINARY_LCASE) collation.

Why are the changes needed?

Add collation support for built-in regexp functions in Spark.

Does this PR introduce any user-facing change?

Yes, users should now be able to use collated strings within arguments for built-in regexp functions: like, ilike, like all, not like all, like any, not like any, rlike, split, regexp_replace, regexp_extract, regexp_extract_all, regexp_count, regexp_substr, regexp_instr.

How was this patch tested?

Unit regexp expression tests and e2e sql tests.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Apr 16, 2024
@uros-db uros-db changed the title [DRAFT][SPARK-47414][SQL] Lowercase collation support for regexp expressions [SPARK-47414][SQL] Lowercase collation support for regexp expressions Apr 19, 2024
Copy link
Contributor

@mihailom-db mihailom-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@nikolamand-db nikolamand-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor concerns, otherwise looks good.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in b4624bf Apr 25, 2024
JacobZheng0927 pushed a commit to JacobZheng0927/spark that referenced this pull request May 11, 2024
### What changes were proposed in this pull request?
Introduce collation awareness for regexp expressions: like, ilike, like all, not like all, like any, not like any, rlike, split, regexp_replace, regexp_extract, regexp_extract_all, regexp_count, regexp_substr, regexp_instr. Note: collation support is only enabled for binary (UTF8_BINARY, UNICODE) & lowercase (UTF8_BINARY_LCASE) collation.

### Why are the changes needed?
Add collation support for built-in regexp functions in Spark.

### Does this PR introduce _any_ user-facing change?
Yes, users should now be able to use collated strings within arguments for built-in regexp functions: like, ilike, like all, not like all, like any, not like any, rlike, split, regexp_replace, regexp_extract, regexp_extract_all, regexp_count, regexp_substr, regexp_instr.

### How was this patch tested?
Unit regexp expression tests and e2e sql tests.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#46077 from uros-db/SPARK-47414.

Authored-by: Uros Bojanic <157381213+uros-db@users.noreply.github.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants