Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-47296][SQL][COLLATION] Fail unsupported functions for non-binary collations #45422

Closed
wants to merge 29 commits into from

Conversation

uros-db
Copy link
Contributor

@uros-db uros-db commented Mar 7, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Currently, all StringType arguments passed to built-in string functions in Spark SQL get treated as binary strings. This behaviour is incorrect for almost all collationIds except the default (0), and we should instead warn the user if they try to use an unsupported collation for the given function. Over time, we should implement the appropriate support for these (function, collation) pairs, but until then - we should have a way to fail unsupported statements in query analysis.

Does this PR introduce any user-facing change?

Yes, users will now get appropriate errors when they try to use an unsupported collation with a given string function.

How was this patch tested?

Tests in CollationSuite to check if these functions work for binary collations and throw exceptions for others.

Was this patch authored or co-authored using generative AI tooling?

Yes.

@github-actions github-actions bot added the SQL label Mar 7, 2024
@cloud-fan
Copy link
Contributor

Without updating StringType.acceptsType, I'm not confident to find out all functions that expect StringType but do not support collation.

@uros-db
Copy link
Contributor Author

uros-db commented Mar 12, 2024

@cloud-fan yes, that is a problem... should we settle only on string functions for now? I think these functions that are meant to work with Strings are more sensitive to this error

on a more important note, even if we were to update StringType.acceptsType, it would still not solve the problem of collation support level (introduced in this PR) that would prevent passing correctly matched arguments to a function that simply does not (yet) support that particular collation - we will be needing this much in the near future, and while we're at it - overriding checkInputTypes seems to solve both

while type coercion is a separate effort, and will probably cover other parts of the codebase, what do we think about implementing this for now? @dbatomic

a bit more context for readers: for now, everything in the codebase that supports StringType will take StringType(#) (any collation) and treat it as the default collation (UTF8_BINARY); this is especially problematic for string functions that essentially return incorrect results without warning

@cloud-fan
Copy link
Contributor

I don't think it's safe to only handle expressions in regexpExpressions.scala. For example, Substring is not there. I don't know how to collect all functions that take StringType, unless we check all expressions one by one.

I'd prefer only updating functions that support collation to have more fine-grained collation check, which shouldn't be many right now.

@uros-db
Copy link
Contributor Author

uros-db commented Mar 13, 2024

@cloud-fan that makes a lot of sense, now new case classes should handle this:

  1. StringType no longer accepts all collationIds, but only the default collationId (0) - i.e. UTF8_BINARY
  2. StringTypeBinary is added to allow binary collations (for now: UTF8_BINARY and UNICODE), but at this time we need this for full lockdown because casting is not ready (casting is a separate effort, and when it's done, we can have StringType accept all binary collations directly; for now, it's incorrect)
  3. StringTypeBinaryLcase is added to allow binary & lowercase (UTF8_BINARY_LCASE, UTF8_BINARY, UNICODE) - this class is important because some expressions will support binary & lowercase, but not other collations at a given time
  4. StringTypeAllCollations is added to allow all collations (for now this is supported only in StringPredicate expressions: Contains, StartsWith, EndsWith) - note that these expressions handle all collations, but can't guarantee that all string arguments have exactly the same collation type, so we still need checkCollationCompatibility in CollationTypeConstraints; once casting is ready, we will delete this

Copy link
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly LGTM

@uros-db uros-db requested a review from cloud-fan March 15, 2024 10:51
@@ -40,6 +40,7 @@ class StringType private(val collationId: Int) extends AtomicType with Serializa
* equality and hashing).
*/
def isBinaryCollation: Boolean = CollationFactory.fetchCollation(collationId).isBinaryCollation
def isLowercaseCollation: Boolean = collationId == CollationFactory.LOWERCASE_COLLATION_ID
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remove even this guy and push the check into StringTypeBinaryLcase?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure this is possible. StringTypeBinaryLcase does not extend StringType, and the point of this function for now is to call it on StringType object in acceptsType to check if we should let the function proceed with that input.

@dbatomic
Copy link
Contributor

LGTM

As a follow up we should revisit error messages. IMO it is weird to expose message with "string_any_collation" type to customer. But I think that we can do that as a follow up.

// Cast any atomic type to string.
case (any: AtomicType, StringType) if any != StringType => StringType
case (any: AtomicType, _: StringTypeCollated) if any != StringType => StringType
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code can be more readable if we call StringTypeCollated#defaultConcreteType, which is StringType(0)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this what you meant?

@@ -205,6 +205,11 @@ object AnsiTypeCoercion extends TypeCoercionBase {
case (StringType, AnyTimestampType) =>
Some(AnyTimestampType.defaultConcreteType)

// If a function expects a StringType, no StringType instance should be implicitly cast to
// StringType with a collation that's not accepted (aka. lockdown unsupported collations).
case (StringType, StringType) => None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this case match covered by the first case match case _ if expectedType.acceptsType(inType) => Some(inType)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be case (_: StringType, StringType) ...

@@ -215,6 +220,10 @@ object AnsiTypeCoercion extends TypeCoercionBase {
None
}

// "canANSIStoreAssign" doesn't account for targets extending StringTypeCollated, but
// ANSIStoreAssign is generally expected to return "true" for (AtomicType, StringType)
case (_: AtomicType, _: StringTypeCollated) => Some(StringType)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this correct? StringType does not satisfy StringTypeCollated

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is. canANSIStoreAssign has a rule for casting AtomicType to StringType, but since StringTypeCollated does not extend StringType, but only AbstractDataType, this cast rule will not be picked up. But I would say this rule has to be improved to check for all canANsiStoreAssign rules.

@@ -994,8 +994,12 @@ object TypeCoercion extends TypeCoercionBase {
case (StringType, datetime: DatetimeType) => datetime
case (StringType, AnyTimestampType) => AnyTimestampType.defaultConcreteType
case (StringType, BinaryType) => BinaryType
case (st: StringType, StringType) => st
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reading the code around here, I think null means no implicit cast.

// If a function expects a StringType, no StringType instance should be implicitly cast to
// StringType with a collation that's not accepted (aka. lockdown unsupported collations).
case (StringType, StringType) => None
case (StringType, _: StringTypeCollated) => None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case should be put before case (StringType, a: AtomicType) =>, otherwise it's useless

@@ -994,8 +994,12 @@ object TypeCoercion extends TypeCoercionBase {
case (StringType, datetime: DatetimeType) => datetime
case (StringType, AnyTimestampType) => AnyTimestampType.defaultConcreteType
case (StringType, BinaryType) => BinaryType
case (st: StringType, StringType) => st
case (st: StringType, _: StringTypeCollated) => st
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will be covered by the last default case match?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes and no. The following two lines would have made a cast, but I changed them so they doesn't.

@mihailom-db
Copy link
Contributor

LGTM

As a follow up we should revisit error messages. IMO it is weird to expose message with "string_any_collation" type to customer. But I think that we can do that as a follow up.

Could this be an onboarding task?

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 8762e25 Mar 20, 2024
sweisdb pushed a commit to sweisdb/spark that referenced this pull request Apr 1, 2024
…ry collations

### What changes were proposed in this pull request?

### Why are the changes needed?
Currently, all `StringType` arguments passed to built-in string functions in Spark SQL get treated as binary strings. This behaviour is incorrect for almost all collationIds except the default (0), and we should instead warn the user if they try to use an unsupported collation for the given function. Over time, we should implement the appropriate support for these (function, collation) pairs, but until then - we should have a way to fail unsupported statements in query analysis.

### Does this PR introduce _any_ user-facing change?
Yes, users will now get appropriate errors when they try to use an unsupported collation with a given string function.

### How was this patch tested?
Tests in CollationSuite to check if these functions work for binary collations and throw exceptions for others.

### Was this patch authored or co-authored using generative AI tooling?
Yes.

Closes apache#45422 from uros-db/regexp-functions.

Lead-authored-by: Uros Bojanic <157381213+uros-db@users.noreply.github.com>
Co-authored-by: Mihailo Milosevic <mihailo.milosevic@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
cloud-fan pushed a commit that referenced this pull request Apr 8, 2024
…ypeCollated

### What changes were proposed in this pull request?
Renaming simpleString in StringTypeAnyCollation. This PR should only be merged after #45383 is merged.

### Why are the changes needed?
[SPARK-47296](#45422) introduced a change to fail all unsupported functions. Because of this change expected inputTypes in ExpectsInputTypes had to be changed. This change introduced a change on user side which will print "STRING_ANY_COLLATION" in places where before we printed "STRING" when an error occurred. Concretely if we get an input of Int where StringTypeAnyCollation was expected, we will throw this faulty message for users.

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
Existing tests were changed back to "STRING" notation.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #45694 from mihailom-db/SPARK-47504.

Authored-by: Mihailo Milosevic <mihailo.milosevic@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants