Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-49911][SQL] Fix semantic of support binary equality #48472

Conversation

jovanpavl-db
Copy link
Contributor

What changes were proposed in this pull request?

With introduction of trim collation, what was known as supportsBinaryEquality changes, it is now split in isUtf8BinaryType and usesTrimCollation so that it has correct semantics.

Why are the changes needed?

With introduction of trim collation, what was known as supportsBinaryEquality changes, it is now split in isUtf8BinaryType and usesTrimCollation so that it has correct semantics.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Everything is covered with existing tests, no new functionality is added.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Oct 15, 2024
Copy link
Contributor

@uros-db uros-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so let me check if I understand correctly

we keep supportsBinaryEquality, which will be:

  • true for UTF8_BINARY
  • false for UTF8_BINARY_RTRIM

and we also introduce isUtf8BinaryType, which will be:

  • true for UTF8_BINARY
  • true for UTF8_BINARY_RTRIM

then, we refactor the execution-related code path branches to rely on isUtf8BinaryType, rather than supportsBinaryEquality?

@jovanpavl-db
Copy link
Contributor Author

jovanpavl-db commented Oct 15, 2024

so let me check if I understand correctly

we keep supportsBinaryEquality, which will be:

  • true for UTF8_BINARY
  • false for UTF8_BINARY_RTRIM

and we also introduce isUtf8BinaryType, which will be:

  • true for UTF8_BINARY
  • true for UTF8_BINARY_RTRIM

then, we refactor the execution-related code path branches to rely on isUtf8BinaryType, rather than supportsBinaryEquality?

Everything is correct except the last sentence. We will keep relying on supportsBinaryEquality in most cases (see for example hashing and aggregate). We will use isUtf8BinaryType when we don't want to have different behaviour for UTF8_BINARY_RTRIM and UTF8_BINARY collations. For example take a look at getCollationKey in CollationFactory.
That's why we need both.
I would just like to highlight also semantic value of this change- we are not lying in code (with naming) that UTF_BINARY_RTRIM supports binary equality - it simply does not you can't just take bytes and compare them to get correct result. On the other hand we would like to know type of the collation in some case (for example when using hashing function after trimming) so here isUtf8BinaryType comes to play.

Copy link
Contributor

@uros-db uros-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, this makes sense - I'm on board with the approach

one small ask for you tho - let's hold off all unnecessary changes in this PR (for example: modifying the branching pattern in CollationSupport classes). I don't like the idea of modifying some code (that the execution flow can't reach anyways at this time, because analysis will block it) and not adding the appropriate tests

please opt for one of the following:

  1. undo all needless & untested changes here, and only keep the minimally required scaffolding for further implementation
  2. keep the supportsBinaryEquality -> isUtf8BinaryType changes, but also apply the proper space trimming policy, whitelist the expression and add corresponding tests

so let's just avoid half-baked code, it can be confusing both for you and the reviewers

@jovanpavl-db
Copy link
Contributor Author

I see, this makes sense - I'm on board with the approach

one small ask for you tho - let's hold off all unnecessary changes in this PR (for example: modifying the branching pattern in CollationSupport classes). I don't like the idea of modifying some code (that the execution flow can't reach anyways at this time, because analysis will block it) and not adding the appropriate tests

please opt for one of the following:

  1. undo all needless & untested changes here, and only keep the minimally required scaffolding for further implementation
  2. keep the supportsBinaryEquality -> isUtf8BinaryType changes, but also apply the proper space trimming policy, whitelist the expression and add corresponding tests

so let's just avoid half-baked code, it can be confusing both for you and the reviewers

Agree, makes sense. Let's go with 1, i.e we want have any changes without tests. Will (today) follow up with prs that will have tests and use that code path.

@jovanpavl-db jovanpavl-db changed the title Fix semantic of support binary equality [SPARK-49911][SQL] Fix semantic of support binary equality Oct 15, 2024
@MaxGekk
Copy link
Member

MaxGekk commented Oct 15, 2024

+1, LGTM. Merging to master.
Thank you, @jovanpavl-db and @uros-db for review.

@MaxGekk MaxGekk closed this in 0b48828 Oct 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants