Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-47357][SQL] Add support for Upper, Lower, InitCap (all collations) #46008

Closed
wants to merge 4 commits into from

Conversation

mihailom-db
Copy link
Contributor

What changes were proposed in this pull request?

Addition of InitCap support and tests for Upper, Lower, InitCap.

Why are the changes needed?

We need to support functions with collations.

Does this PR introduce any user-facing change?

Yes, we enabled more functions.

How was this patch tested?

Test in CollationStringExpressionsSuite.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Apr 11, 2024
@@ -89,6 +89,45 @@ class CollationStringExpressionsSuite
testRepeat("UNICODE_CI", 3, "abc", 2)
}

test("UPPER check output type on collated string") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's more readable to combine multiple simple checks for expressions like this into one test function like here. In this case we could also combine testUpper, testLower, testInitCap into one function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, agreed. Will refactor a bit.

# Conflicts:
#	sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala
testLower("UNICODE_CI", 3, "abc")
}

test("INITCAP check output type on collated string") {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@uros-db @nikolamand-db @dbatomic @stefankandic @stevomitric @cloud-fan @MaxGekk I am a bit sceptical about the recent change we introduced in this Suite with UTF8String refactoring. Why are we opting to e2e tests for each expression? Correct me if I am wrong, but these expressions are proven to work e2e with existing tests. Shouldn't we write only unit tests for each expression to test whether these "black boxes" work correctly under different types of collations and reserve e2e tests for multiple stuff, like correctly working from parsing through analysis to execution.

Copy link
Contributor

@uros-db uros-db Apr 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The line of reasoning is to cover both unit tests and e2e tests:

  1. first use CollationSupportSuite.java with a lot of rigorous tests to see whether collation-aware implementation introduced in CollationSupport.scala is correct; these tests should only cover "brand new" stuff - i.e. execBinary/execLowercase/execICU behvaiour
  2. then use CollationStringExpressionsSuite.scala with just a couple of SQL tests to verify whether full queries with collation-aware expressions in stringExpressions.scala work as expected; these tests should also cover expression return datatype checks, and test whether casting and collation mismatch is handled properly

I tried to boil this down in: https://issues.apache.org/jira/browse/SPARK-47410

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good practice to always have end-to-end tests for each function. The test coverage doesn't need to be very high (the unit tests should cover all the corner cases), but we should have one.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in f3a6ca9 Apr 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants