-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-47210][SQL] Addition of implicit casting without indeterminate support #45383
Conversation
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
Outdated
Show resolved
Hide resolved
case sc@(_: In | ||
| _: InSubquery | ||
| _: CreateArray | ||
| _: ComplexTypeMergingExpression |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's narrow down the scope. Not all ComplexTypeMergingExpression
expressions require all its inputs to be the same type. Please follow existing implicit cast rules and match If, CaseWhen, etc. explicitly.
* This rule is used for managing expressions that have possible implicit casts from different | ||
* types in ImplicitTypeCasts rule. | ||
*/ | ||
object PostCollationTypeCasts extends CollationTypeCasts { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why does this rule need two phases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Catalyst rules are executed with a loop, it's OK to wait for the next iteration and adjust the implicit casts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some expressions might be cast from different types, e.g. (NullType or IntegerType) to StringType, and we want to make sure these are picked up and accordingly cast to the proper collation for expressions that lie into that category (i.e. expressions that extend ImpicitInputTypeCast and ExpectsInputType, as well as BinaryExpressions). It is either this or fine-picking these expressions in ImplicitTypeCasts, but this approach seemed more appropriate. What do you think? (We only do this for ArrayJoin, Substring and BinaryExpressions, as these are the only supported ones for collations so far)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't PreCollationTypeCasts
alone do the work?
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala
Show resolved
Hide resolved
// If we cannot do the implicit cast, just use the original input. | ||
implicitCast(in, expected).getOrElse(in) | ||
case (in, expected) => implicitCast(in, expected).getOrElse(in) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please revert these cosmetic changes.
} | ||
e.withNewChildren(children) | ||
|
||
case e: ExpectsInputTypes if e.inputTypes.nonEmpty => | ||
// Convert NullType into some specific target type for ExpectsInputTypes that don't do | ||
// general implicit casting. | ||
val children: Seq[Expression] = e.children.zip(e.inputTypes).map { case (in, expected) => | ||
val children: Seq[Expression] = | ||
e.children.zip(e.inputTypes).map { case (in, expected) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
case sc@(_: In | ||
| _: InSubquery | ||
| _: CreateArray | ||
| _: If |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add more case matches as some expressions do not require all its inputs to be the same type
case if: If =>
if.withNewChildren(if.predicate +: collateToSingleType(if.trueValue, if.falseValue))
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala
Outdated
Show resolved
Hide resolved
val newValues = collateToSingleType( | ||
caseWhenExpr.branches.map(b => b._2) ++ caseWhenExpr.elseValue) | ||
caseWhenExpr.withNewChildren( | ||
interleave(Seq.empty, caseWhenExpr.branches.map(b => b._1), newValues)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks a bit complicated now. Can we follow the existing rule?
object CaseWhenCoercion extends TypeCoercionRule {
override val transform: PartialFunction[Expression, Expression] = {
case c: CaseWhen if c.childrenResolved && !haveSameType(c.inputTypesForMerging) =>
val maybeCommonType = findWiderCommonType(c.inputTypesForMerging)
maybeCommonType.map { commonType =>
val newBranches = c.branches.map { case (condition, value) =>
(condition, castIfNotSameType(value, commonType))
}
val newElseValue = c.elseValue.map(castIfNotSameType(_, commonType))
CaseWhen(newBranches, newElseValue)
}.getOrElse(c)
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually this exposes some gaps in this new rule
- We should add a trigger condition and only enter the branch if
!haveSameType(...)
- We should not blindly add cast but use
castIfNotSameType
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We actually do not blindly Cast in CollationTypeCasts. We only cast if we get different collations, as all other cases go to null branch in castStringType. Maybe it is better to keep this check internal to this rule, as we will add the priority flag and we will need to handle casting of priority in this rule as well later. Am adding the haveSameType for now to make it check the input types, but will change this code in the following PR to include priorities in the internal implementation of haveSameType check.
case substrExpr: Substring => | ||
// This case is necessary for changing Substring input to implicit collation | ||
substrExpr.withNewChildren( | ||
collateToSingleType(Seq(substrExpr.str)) :+ substrExpr.pos :+ substrExpr.len) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get it. Why do we find the common collation for a single type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now we do not need it, I can revert this change and leave it for default collation handling. Because if we add a flag for collation priority, we will need to change the priority of the input to implicit, if it was default before.
...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collationExpressions.scala
Outdated
Show resolved
Hide resolved
the yarn and docker test failures are unrelated, thanks, merging to master! |
…ypeCollated ### What changes were proposed in this pull request? Renaming simpleString in StringTypeAnyCollation. This PR should only be merged after #45383 is merged. ### Why are the changes needed? [SPARK-47296](#45422) introduced a change to fail all unsupported functions. Because of this change expected inputTypes in ExpectsInputTypes had to be changed. This change introduced a change on user side which will print "STRING_ANY_COLLATION" in places where before we printed "STRING" when an error occurred. Concretely if we get an input of Int where StringTypeAnyCollation was expected, we will throw this faulty message for users. ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? Existing tests were changed back to "STRING" notation. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45694 from mihailom-db/SPARK-47504. Authored-by: Mihailo Milosevic <mihailo.milosevic@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
) | ||
} | ||
|
||
def indeterminateCollationError(): Throwable = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it used somewhere or supposed to use? BTW, this PR #47740 is going to remove it.
What changes were proposed in this pull request?
This PR adds automatic casting and collations resolution as per
PGSQL
behaviour:COLLATE
expression are explicitCOLLATION_MISMATCH.EXPLICIT
will be thrownCOLLATION_MISMATCH.IMPLICIT
will be thrownINDETERMINATE_COLLATION should only be thrown on comparison operations and memory storing of data, and we should be able to combine different implicit collations for certain operations like concat and possible others in the future.
This is why we have to add another predefined collation id named INDETERMINATE_COLLATION_ID which means that the result is a combination of conflicting non-default implicit collations. Right now it would an id of -1 so it fail if it ever goes to the CollatorFactory.
Why are the changes needed?
We need to be able to compare columns and values with different collations and set a way of explicitly changing the collation we want to use.
Does this PR introduce any user-facing change?
Yes. We add 3 new errors and enable collation casting.
How was this patch tested?
Tests in
CollationSuite
were done to check code validity.Was this patch authored or co-authored using generative AI tooling?
No.