Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-48779][SQL][TESTS] Improve collation support testing - add golden files #47828

Closed
wants to merge 10 commits into from

Conversation

viktorluc-db
Copy link
Contributor

What changes were proposed in this pull request?

Moving certain tests to collations.sql and generating golden files, as said in the PR #47620.

Why are the changes needed?

For testing.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

These are tests.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Aug 21, 2024
Copy link
Contributor

@uros-db uros-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some changes are needed to the tests

also, let's limit the number of tests per function to 2-3, for example:

  1. without explicit collate expressions, e.g. contains(t1.utf8_lcase, '...')
  2. with explicit collate expressions, e.g. contains('...' collate utf8_lcase, t1.utf8_binary)
  3. verify collation mismatch, e.g. contains(t1.utf8_lcase, t1.utf8_binary)

Copy link
Contributor

@uros-db uros-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the goal of this PR is to improve the collation support testing area, so let's try to put an effort into making testing coverage as complete as possible

here is one proposal, but please take some time to think about it more carefully and feel free to come up with examples of your own - the core idea is to keep things simple, make tests complete, and not to overkill because they are e2e sql tests

  • for the purpose of the examples included below, suppose the following setup:
create table t (str string, str_binary string collate utf8_binary, str_lcase string collate utf8_lcase);
insert into t values ("abc", "b", "B")

so here's what we may want to consider:

  1. two collated columns with the same collation -> works as expected (using the implicit collation)

example:

select contains(str, str_binary) from t1; -- returns: true
  1. two collated columns with different collations -> collation mismatch error (implicit)

example:

select contains(str, str_lcase) from t1; -- COLLATION_MISMATCH.IMPLICIT
  1. two collated columns with different collations + explicit "collate" clauses (different collations) -> collation mismatch error (explicit)
    example:
select contains(str collate utf8_lcase, str_lcase collate utf8_binary) from t1; -- COLLATION_MISMATCH.EXPLICIT
  1. two collated columns with different collations + explicit "collate" clauses (same collations) -> works as expected (using the explicit collation)
    example:
select contains(str collate utf8_lcase, str_lcase) from t1; -- returns: true
  1. one collated column and one regular string literal -> works as expected (using the implicit collation)
    example:
select contains("ABC", str_binary) from t1; -- returns: false
  1. one collated column and one collated string literal -> works as expected (using the explicit collation)

example:

select contains("abc" collate utf8_lcase, str_lcase) from t1; -- returns: true

note that this is by no means exhaustive, but can help with coming up with a final approach

Copy link
Member

@MaxGekk MaxGekk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viktorluc-db Could you resolve conflicts, please.

Copy link
Contributor

@uros-db uros-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viktorluc-db let's revive this PR

@viktorluc-db
Copy link
Contributor Author

@MaxGekk could you take a look?

@MaxGekk
Copy link
Member

MaxGekk commented Sep 13, 2024

+1, LGTM. Merging to master.
Thank you, @viktorluc-db and @uros-db for review.

@MaxGekk MaxGekk closed this in 08a26bb Sep 13, 2024
attilapiros pushed a commit to attilapiros/spark that referenced this pull request Oct 4, 2024
…den files

### What changes were proposed in this pull request?

Moving certain tests to collations.sql and generating golden files, as said in the PR apache#47620.

### Why are the changes needed?

For testing.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

These are tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#47828 from viktorluc-db/GoldenFiles.

Authored-by: viktorluc-db <viktor.lucic@databricks.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants