-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-48779][SQL][TESTS] Improve collation support testing - add golden files #47828
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some changes are needed to the tests
also, let's limit the number of tests per function to 2-3, for example:
- without explicit collate expressions, e.g.
contains(t1.utf8_lcase, '...')
- with explicit collate expressions, e.g.
contains('...' collate utf8_lcase, t1.utf8_binary)
- verify collation mismatch, e.g.
contains(t1.utf8_lcase, t1.utf8_binary)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the goal of this PR is to improve the collation support testing area, so let's try to put an effort into making testing coverage as complete as possible
here is one proposal, but please take some time to think about it more carefully and feel free to come up with examples of your own - the core idea is to keep things simple, make tests complete, and not to overkill because they are e2e sql tests
- for the purpose of the examples included below, suppose the following setup:
create table t (str string, str_binary string collate utf8_binary, str_lcase string collate utf8_lcase);
insert into t values ("abc", "b", "B")
so here's what we may want to consider:
- two collated columns with the same collation -> works as expected (using the implicit collation)
example:
select contains(str, str_binary) from t1; -- returns: true
- two collated columns with different collations -> collation mismatch error (implicit)
example:
select contains(str, str_lcase) from t1; -- COLLATION_MISMATCH.IMPLICIT
- two collated columns with different collations + explicit "collate" clauses (different collations) -> collation mismatch error (explicit)
example:
select contains(str collate utf8_lcase, str_lcase collate utf8_binary) from t1; -- COLLATION_MISMATCH.EXPLICIT
- two collated columns with different collations + explicit "collate" clauses (same collations) -> works as expected (using the explicit collation)
example:
select contains(str collate utf8_lcase, str_lcase) from t1; -- returns: true
- one collated column and one regular string literal -> works as expected (using the implicit collation)
example:
select contains("ABC", str_binary) from t1; -- returns: false
- one collated column and one collated string literal -> works as expected (using the explicit collation)
example:
select contains("abc" collate utf8_lcase, str_lcase) from t1; -- returns: true
note that this is by no means exhaustive, but can help with coming up with a final approach
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viktorluc-db Could you resolve conflicts, please.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viktorluc-db let's revive this PR
@MaxGekk could you take a look? |
+1, LGTM. Merging to master. |
…den files ### What changes were proposed in this pull request? Moving certain tests to collations.sql and generating golden files, as said in the PR apache#47620. ### Why are the changes needed? For testing. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? These are tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47828 from viktorluc-db/GoldenFiles. Authored-by: viktorluc-db <viktor.lucic@databricks.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>
What changes were proposed in this pull request?
Moving certain tests to collations.sql and generating golden files, as said in the PR #47620.
Why are the changes needed?
For testing.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
These are tests.
Was this patch authored or co-authored using generative AI tooling?
No.