[SPARK-47210][SQL] Addition of implicit casting without indeterminate support #45383

mihailom-db · 2024-03-05T08:04:06Z

What changes were proposed in this pull request?

This PR adds automatic casting and collations resolution as per PGSQL behaviour:

Collations set on the metadata level are implicit
Collations set using the COLLATE expression are explicit
When there is a combination of expressions of multiple collations the output will be:

if there are explicit collations and all of them are equal then that collation will be the output
if there are multiple different explicit collations COLLATION_MISMATCH.EXPLICIT will be thrown
if there are no explicit collations and only a single type of non default collation, that one will be used
if there are no explicit collations and multiple non-default implicit ones COLLATION_MISMATCH.IMPLICIT will be thrown

INDETERMINATE_COLLATION should only be thrown on comparison operations and memory storing of data, and we should be able to combine different implicit collations for certain operations like concat and possible others in the future.
This is why we have to add another predefined collation id named INDETERMINATE_COLLATION_ID which means that the result is a combination of conflicting non-default implicit collations. Right now it would an id of -1 so it fail if it ever goes to the CollatorFactory.

Why are the changes needed?

We need to be able to compare columns and values with different collations and set a way of explicitly changing the collation we want to use.

Does this PR introduce any user-facing change?

Yes. We add 3 new errors and enable collation casting.

How was this patch tested?

Tests in CollationSuite were done to check code validity.

Was this patch authored or co-authored using generative AI tooling?

No.

common/utils/src/main/resources/error/error-classes.json

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala

common/utils/src/main/resources/error/error-classes.json

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala

cloud-fan · 2024-04-02T12:44:47Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala

+    case sc@(_: In
+             | _: InSubquery
+             | _: CreateArray
+             | _: ComplexTypeMergingExpression


let's narrow down the scope. Not all ComplexTypeMergingExpression expressions require all its inputs to be the same type. Please follow existing implicit cast rules and match If, CaseWhen, etc. explicitly.

cloud-fan · 2024-04-02T12:45:23Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala

+ *  This rule is used for managing expressions that have possible implicit casts from different
+ *  types in ImplicitTypeCasts rule.
+ */
+object PostCollationTypeCasts extends CollationTypeCasts {


why does this rule need two phases?

Catalyst rules are executed with a loop, it's OK to wait for the next iteration and adjust the implicit casts.

Some expressions might be cast from different types, e.g. (NullType or IntegerType) to StringType, and we want to make sure these are picked up and accordingly cast to the proper collation for expressions that lie into that category (i.e. expressions that extend ImpicitInputTypeCast and ExpectsInputType, as well as BinaryExpressions). It is either this or fine-picking these expressions in ImplicitTypeCasts, but this approach seemed more appropriate. What do you think? (We only do this for ArrayJoin, Substring and BinaryExpressions, as these are the only supported ones for collations so far)

Can't PreCollationTypeCasts alone do the work?

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala

cloud-fan · 2024-04-02T12:49:51Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala

          // If we cannot do the implicit cast, just use the original input.
-          implicitCast(in, expected).getOrElse(in)
+          case (in, expected) => implicitCast(in, expected).getOrElse(in)


please revert these cosmetic changes.

cloud-fan · 2024-04-02T12:49:59Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala

        }
        e.withNewChildren(children)

      case e: ExpectsInputTypes if e.inputTypes.nonEmpty =>
        // Convert NullType into some specific target type for ExpectsInputTypes that don't do
        // general implicit casting.
-        val children: Seq[Expression] = e.children.zip(e.inputTypes).map { case (in, expected) =>
+        val children: Seq[Expression] =
+          e.children.zip(e.inputTypes).map { case (in, expected) =>


cloud-fan · 2024-04-03T00:58:09Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala

+    case sc@(_: In
+             | _: InSubquery
+             | _: CreateArray
+             | _: If


We should add more case matches as some expressions do not require all its inputs to be the same type

case if: If => if.withNewChildren(if.predicate +: collateToSingleType(if.trueValue, if.falseValue))

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala

cloud-fan · 2024-04-03T08:25:07Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala

+      val newValues = collateToSingleType(
+        caseWhenExpr.branches.map(b => b._2) ++ caseWhenExpr.elseValue)
+      caseWhenExpr.withNewChildren(
+        interleave(Seq.empty, caseWhenExpr.branches.map(b => b._1), newValues))


The code looks a bit complicated now. Can we follow the existing rule?

object CaseWhenCoercion extends TypeCoercionRule { override val transform: PartialFunction[Expression, Expression] = { case c: CaseWhen if c.childrenResolved && !haveSameType(c.inputTypesForMerging) => val maybeCommonType = findWiderCommonType(c.inputTypesForMerging) maybeCommonType.map { commonType => val newBranches = c.branches.map { case (condition, value) => (condition, castIfNotSameType(value, commonType)) } val newElseValue = c.elseValue.map(castIfNotSameType(_, commonType)) CaseWhen(newBranches, newElseValue) }.getOrElse(c) } }

Actually this exposes some gaps in this new rule

We should add a trigger condition and only enter the branch if !haveSameType(...)

We should not blindly add cast but use castIfNotSameType

We actually do not blindly Cast in CollationTypeCasts. We only cast if we get different collations, as all other cases go to null branch in castStringType. Maybe it is better to keep this check internal to this rule, as we will add the priority flag and we will need to handle casting of priority in this rule as well later. Am adding the haveSameType for now to make it check the input types, but will change this code in the following PR to include priorities in the internal implementation of haveSameType check.

cloud-fan · 2024-04-03T08:28:19Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala

+    case substrExpr: Substring =>
+      // This case is necessary for changing Substring input to implicit collation
+      substrExpr.withNewChildren(
+        collateToSingleType(Seq(substrExpr.str)) :+ substrExpr.pos :+ substrExpr.len)


I don't get it. Why do we find the common collation for a single type?

For now we do not need it, I can revert this change and leave it for default collation handling. Because if we add a flag for collation priority, we will need to change the priority of the input to implicit, if it was default before.

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collationExpressions.scala

cloud-fan · 2024-04-03T23:49:33Z

the yarn and docker test failures are unrelated, thanks, merging to master!

…ypeCollated ### What changes were proposed in this pull request? Renaming simpleString in StringTypeAnyCollation. This PR should only be merged after #45383 is merged. ### Why are the changes needed? [SPARK-47296](#45422) introduced a change to fail all unsupported functions. Because of this change expected inputTypes in ExpectsInputTypes had to be changed. This change introduced a change on user side which will print "STRING_ANY_COLLATION" in places where before we printed "STRING" when an error occurred. Concretely if we get an input of Int where StringTypeAnyCollation was expected, we will throw this faulty message for users. ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? Existing tests were changed back to "STRING" notation. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45694 from mihailom-db/SPARK-47504. Authored-by: Mihailo Milosevic <mihailo.milosevic@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

MaxGekk · 2024-08-14T07:06:31Z

sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala

+    )
+  }
+
+  def indeterminateCollationError(): Throwable = {


Is it used somewhere or supposed to use? BTW, this PR #47740 is going to remove it.

github-actions bot added the SQL label Mar 5, 2024

Implicit casting on collated expressions

b34544a

mihailom-db force-pushed the SPARK-47210 branch from 16460cf to b34544a Compare March 5, 2024 08:07

Fix doc files

fdbfa44

github-actions bot added the DOCS label Mar 5, 2024

Fix contains, startWith, endWith tests

ce9b027

mihailom-db changed the title ~~[SPARK-47210][SQL][COLLATION] Implicit casting on collated expressions~~ [SPARK-47210][SQL][COLLATION][WIP] Implicit casting on collated expressions Mar 5, 2024

Fix imports

e537190

stefankandic reviewed Mar 5, 2024

View reviewed changes

common/utils/src/main/resources/error/error-classes.json Outdated Show resolved Hide resolved

stefankandic reviewed Mar 5, 2024

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala Outdated Show resolved Hide resolved