-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-47476][SQL] Support REPLACE function to work with collated strings #45704
Conversation
Hi, @miland-db and @cloud-fan . I saw a series of
|
SQL tag is sufficient, but I don't mind people adding more grouping if the number of PRs is large enough. |
sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala
Outdated
Show resolved
Hide resolved
Thank you, @cloud-fan . Then, let's not use this. I don't think this is a permanent grouping. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's improve PR's title since it is too generic.
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
Outdated
Show resolved
Hide resolved
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java
Outdated
Show resolved
Hide resolved
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java
Outdated
Show resolved
Hide resolved
heads up: we’ve done some major code restructuring in #45978, so please sync these changes before moving on @miland-db you’ll likely need to rewrite the code in this PR, so please follow the guidelines outlined in https://issues.apache.org/jira/browse/SPARK-47410 |
# Conflicts: # common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java # sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala
# Conflicts: # sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala
# Conflicts: # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala
# Conflicts: # common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java
# Conflicts: # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just flagging this PR will need a fix for the ICU implementation
(you already added some tests for this that are failing)
# Conflicts: # common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java # common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala
# Conflicts: # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala
# Conflicts: # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, @cloud-fan ready for review
the Spark Connect test failure is flaky and unrelated here, I'm merging it to master, thanks! |
…ings ### What changes were proposed in this pull request? Extend built-in string functions to support non-binary, non-lowercase collation for: replace. ### Why are the changes needed? Update collation support for built-in string functions in Spark. ### Does this PR introduce _any_ user-facing change? Yes, users should now be able to use COLLATE within arguments for built-in string function REPLACE in Spark SQL queries, using non-binary collations such as UNICODE_CI. ### How was this patch tested? Unit tests for queries using StringReplace (`CollationStringExpressionsSuite.scala`). ### Was this patch authored or co-authored using generative AI tooling? No ### Algorithm explanation - StringSearch.next() returns position of the first character of `search` string in the `source` source. We need to convert this position to position in bytes so we can perform replace operation correctly. - For UTF8_BINARY_LCASE collation there is no corresponding collator so we have to implement custom logic (`lowercaseReplace`). It is done by performing matching on **lowercase strings** (`source & search`) and using that information to do operations on the **original** `source` string. String building is performed in the same way as for other non-binary collations. Similar logic can be found in existing `int find(UTF8String str, int start)` & `int indexOf(UTF8String v, int start)` methods. Closes apache#45704 from miland-db/miland-db/string-replace. Lead-authored-by: Milan Dankovic <milan.dankovic@databricks.com> Co-authored-by: Uros Bojanic <157381213+uros-db@users.noreply.github.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
Extend built-in string functions to support non-binary, non-lowercase collation for: replace.
Why are the changes needed?
Update collation support for built-in string functions in Spark.
Does this PR introduce any user-facing change?
Yes, users should now be able to use COLLATE within arguments for built-in string function REPLACE in Spark SQL queries, using non-binary collations such as UNICODE_CI.
How was this patch tested?
Unit tests for queries using StringReplace (
CollationStringExpressionsSuite.scala
).Was this patch authored or co-authored using generative AI tooling?
No
Algorithm explanation
search
string in thesource
source. We need to convert this position to position in bytes so we can perform replace operation correctly.lowercaseReplace
). It is done by performing matching on lowercase strings (source & search
) and using that information to do operations on the originalsource
string. String building is performed in the same way as for other non-binary collations.Similar logic can be found in existing
int find(UTF8String str, int start)
&int indexOf(UTF8String v, int start)
methods.