Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-47210][SQL] Addition of implicit casting without indeterminate support #45383

Closed
wants to merge 80 commits into from

Conversation

mihailom-db
Copy link
Contributor

@mihailom-db mihailom-db commented Mar 5, 2024

What changes were proposed in this pull request?

This PR adds automatic casting and collations resolution as per PGSQL behaviour:

  1. Collations set on the metadata level are implicit
  2. Collations set using the COLLATE expression are explicit
  3. When there is a combination of expressions of multiple collations the output will be:
  • if there are explicit collations and all of them are equal then that collation will be the output
  • if there are multiple different explicit collations COLLATION_MISMATCH.EXPLICIT will be thrown
  • if there are no explicit collations and only a single type of non default collation, that one will be used
  • if there are no explicit collations and multiple non-default implicit ones COLLATION_MISMATCH.IMPLICIT will be thrown

INDETERMINATE_COLLATION should only be thrown on comparison operations and memory storing of data, and we should be able to combine different implicit collations for certain operations like concat and possible others in the future.
This is why we have to add another predefined collation id named INDETERMINATE_COLLATION_ID which means that the result is a combination of conflicting non-default implicit collations. Right now it would an id of -1 so it fail if it ever goes to the CollatorFactory.

Why are the changes needed?

We need to be able to compare columns and values with different collations and set a way of explicitly changing the collation we want to use.

Does this PR introduce any user-facing change?

Yes. We add 3 new errors and enable collation casting.

How was this patch tested?

Tests in CollationSuite were done to check code validity.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Mar 5, 2024
@github-actions github-actions bot added the DOCS label Mar 5, 2024
@mihailom-db mihailom-db changed the title [SPARK-47210][SQL][COLLATION] Implicit casting on collated expressions [SPARK-47210][SQL][COLLATION][WIP] Implicit casting on collated expressions Mar 5, 2024
@mihailom-db mihailom-db changed the title [SPARK-47210][SQL][COLLATION][WIP] Implicit casting on collated expressions [SPARK-47210][SQL][COLLATION] Implicit casting on collated expressions Mar 7, 2024
case sc@(_: In
| _: InSubquery
| _: CreateArray
| _: ComplexTypeMergingExpression
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's narrow down the scope. Not all ComplexTypeMergingExpression expressions require all its inputs to be the same type. Please follow existing implicit cast rules and match If, CaseWhen, etc. explicitly.

* This rule is used for managing expressions that have possible implicit casts from different
* types in ImplicitTypeCasts rule.
*/
object PostCollationTypeCasts extends CollationTypeCasts {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does this rule need two phases?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Catalyst rules are executed with a loop, it's OK to wait for the next iteration and adjust the implicit casts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some expressions might be cast from different types, e.g. (NullType or IntegerType) to StringType, and we want to make sure these are picked up and accordingly cast to the proper collation for expressions that lie into that category (i.e. expressions that extend ImpicitInputTypeCast and ExpectsInputType, as well as BinaryExpressions). It is either this or fine-picking these expressions in ImplicitTypeCasts, but this approach seemed more appropriate. What do you think? (We only do this for ArrayJoin, Substring and BinaryExpressions, as these are the only supported ones for collations so far)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't PreCollationTypeCasts alone do the work?

// If we cannot do the implicit cast, just use the original input.
implicitCast(in, expected).getOrElse(in)
case (in, expected) => implicitCast(in, expected).getOrElse(in)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please revert these cosmetic changes.

}
e.withNewChildren(children)

case e: ExpectsInputTypes if e.inputTypes.nonEmpty =>
// Convert NullType into some specific target type for ExpectsInputTypes that don't do
// general implicit casting.
val children: Seq[Expression] = e.children.zip(e.inputTypes).map { case (in, expected) =>
val children: Seq[Expression] =
e.children.zip(e.inputTypes).map { case (in, expected) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

case sc@(_: In
| _: InSubquery
| _: CreateArray
| _: If
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add more case matches as some expressions do not require all its inputs to be the same type

case if: If =>
  if.withNewChildren(if.predicate +: collateToSingleType(if.trueValue, if.falseValue))

val newValues = collateToSingleType(
caseWhenExpr.branches.map(b => b._2) ++ caseWhenExpr.elseValue)
caseWhenExpr.withNewChildren(
interleave(Seq.empty, caseWhenExpr.branches.map(b => b._1), newValues))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks a bit complicated now. Can we follow the existing rule?

  object CaseWhenCoercion extends TypeCoercionRule {
    override val transform: PartialFunction[Expression, Expression] = {
      case c: CaseWhen if c.childrenResolved && !haveSameType(c.inputTypesForMerging) =>
        val maybeCommonType = findWiderCommonType(c.inputTypesForMerging)
        maybeCommonType.map { commonType =>
          val newBranches = c.branches.map { case (condition, value) =>
            (condition, castIfNotSameType(value, commonType))
          }
          val newElseValue = c.elseValue.map(castIfNotSameType(_, commonType))
          CaseWhen(newBranches, newElseValue)
        }.getOrElse(c)
    }
  }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this exposes some gaps in this new rule

  1. We should add a trigger condition and only enter the branch if !haveSameType(...)
  2. We should not blindly add cast but use castIfNotSameType

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually do not blindly Cast in CollationTypeCasts. We only cast if we get different collations, as all other cases go to null branch in castStringType. Maybe it is better to keep this check internal to this rule, as we will add the priority flag and we will need to handle casting of priority in this rule as well later. Am adding the haveSameType for now to make it check the input types, but will change this code in the following PR to include priorities in the internal implementation of haveSameType check.

case substrExpr: Substring =>
// This case is necessary for changing Substring input to implicit collation
substrExpr.withNewChildren(
collateToSingleType(Seq(substrExpr.str)) :+ substrExpr.pos :+ substrExpr.len)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get it. Why do we find the common collation for a single type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now we do not need it, I can revert this change and leave it for default collation handling. Because if we add a flag for collation priority, we will need to change the priority of the input to implicit, if it was default before.

@cloud-fan
Copy link
Contributor

the yarn and docker test failures are unrelated, thanks, merging to master!

@cloud-fan cloud-fan closed this in e3aab8c Apr 3, 2024
cloud-fan pushed a commit that referenced this pull request Apr 8, 2024
…ypeCollated

### What changes were proposed in this pull request?
Renaming simpleString in StringTypeAnyCollation. This PR should only be merged after #45383 is merged.

### Why are the changes needed?
[SPARK-47296](#45422) introduced a change to fail all unsupported functions. Because of this change expected inputTypes in ExpectsInputTypes had to be changed. This change introduced a change on user side which will print "STRING_ANY_COLLATION" in places where before we printed "STRING" when an error occurred. Concretely if we get an input of Int where StringTypeAnyCollation was expected, we will throw this faulty message for users.

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
Existing tests were changed back to "STRING" notation.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #45694 from mihailom-db/SPARK-47504.

Authored-by: Mihailo Milosevic <mihailo.milosevic@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
)
}

def indeterminateCollationError(): Throwable = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it used somewhere or supposed to use? BTW, this PR #47740 is going to remove it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants