Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-47692][SQL] Fix default StringType meaning in implicit casting #45819

Closed
wants to merge 112 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
112 commits
Select commit Hold shift + click to select a range
b34544a
Implicit casting on collated expressions
mihailom-db Mar 5, 2024
fdbfa44
Fix doc files
mihailom-db Mar 5, 2024
ce9b027
Fix contains, startWith, endWith tests
mihailom-db Mar 5, 2024
e537190
Fix imports
mihailom-db Mar 5, 2024
b5a79c1
Fix docs and incorporate changes
mihailom-db Mar 6, 2024
8321d0c
Fix tests in CollationSuite
mihailom-db Mar 6, 2024
d178233
Add test and incorporate changes
mihailom-db Mar 7, 2024
a4b9be7
Fix godlen files
mihailom-db Mar 7, 2024
a6e7662
Incorporate StringType in findWiderCommonType
mihailom-db Mar 8, 2024
e1d7ad5
Merge branch 'master' into SPARK-47210
mihailom-db Mar 8, 2024
b3b1356
Fix ArrayType(StringType, _) casting in findWiderCommonType
mihailom-db Mar 11, 2024
7773d13
Fix type mismatch error
mihailom-db Mar 11, 2024
198a728
Merge branch 'apache:master' into SPARK-47210
mihailom-db Mar 11, 2024
255b1ab
Incorporate changes and fix errors
mihailom-db Mar 11, 2024
9ce417f
Merge branch 'master' into SPARK-47210
mihailom-db Mar 12, 2024
50f3aa2
Fix errors
mihailom-db Mar 12, 2024
ca0c84d
Rework casting
mihailom-db Mar 13, 2024
880a1b1
Merge branch 'master' into SPARK-47210
mihailom-db Mar 13, 2024
56d6c7c
Fix failing tests
mihailom-db Mar 14, 2024
94e5259
Fix array cast errors
mihailom-db Mar 14, 2024
ccb52ba
Fix additional errors
mihailom-db Mar 14, 2024
9b1387b
Fix explicit collation search
mihailom-db Mar 17, 2024
c9974e1
Fix scala style errors
mihailom-db Mar 18, 2024
fca9a65
Add support for ImplicitCastInputTypes
mihailom-db Mar 18, 2024
660d664
Fix accidental change in license header
mihailom-db Mar 18, 2024
c8edd93
Fix null casting
mihailom-db Mar 19, 2024
a91490b
Fix failing tests
mihailom-db Mar 19, 2024
49a8d61
Move implicit casting when strings present
mihailom-db Mar 19, 2024
4c4cd84
Fix unintentional changes
mihailom-db Mar 19, 2024
66122a6
improve types.py
mihailom-db Mar 20, 2024
50f46e4
Refactor code
mihailom-db Mar 21, 2024
cc86a87
Merge branch 'master' into SPARK-47210
mihailom-db Mar 21, 2024
c01e80c
Fix imports and failing tests
mihailom-db Mar 21, 2024
cc797a2
Disable casting of StructTypes
mihailom-db Mar 21, 2024
5d001ee
Fix imports
mihailom-db Mar 21, 2024
c68fc7d
Fix concat tests
mihailom-db Mar 21, 2024
1c926ab
Fix unnecessary repetition
mihailom-db Mar 21, 2024
dec39bf
Remove Elt test
mihailom-db Mar 21, 2024
e808446
Remove tests for Repeat
mihailom-db Mar 21, 2024
ca1a23a
Merge branch 'master' into SPARK-47210
mihailom-db Mar 21, 2024
116931c
Merge branch 'apache:master' into SPARK-47210
mihailom-db Mar 22, 2024
af487a2
Fix failing tests
mihailom-db Mar 22, 2024
4ba7055
Fix nullability for StringType->StringType
mihailom-db Mar 22, 2024
e490e42
Improve comments and switch tests from E2E to unit tests
mihailom-db Mar 24, 2024
00e88e7
Add new tests and remove compatibility test
mihailom-db Mar 25, 2024
85b4d16
Fix conflict resolution mistake
mihailom-db Mar 25, 2024
30f7225
Merge branch 'apache:master' into SPARK-47210
mihailom-db Mar 25, 2024
e89a354
Add indeterminate collation tests
mihailom-db Mar 26, 2024
788dc06
Fix test
mihailom-db Mar 26, 2024
75c0140
Block Alias on Indeterminate
mihailom-db Mar 27, 2024
2918413
Merge remote-tracking branch 'upstream/master' into SPARK-47210
mihailom-db Mar 28, 2024
f6ed55a
Remove introduction of indeterminate collation
mihailom-db Mar 28, 2024
98960c0
Fix import problem
mihailom-db Mar 28, 2024
de623c8
Fix failing tests
mihailom-db Mar 28, 2024
a92b4e1
Fix pyspark error
mihailom-db Mar 28, 2024
f7f3011
Merge branch 'apache:master' into SPARK-47210
mihailom-db Mar 28, 2024
f67808e
Fix errors
mihailom-db Mar 29, 2024
815ce42
Fix schema error
mihailom-db Mar 29, 2024
7fca38a
Merge remote-tracking branch 'upstream/master' into SPARK-47210
mihailom-db Mar 29, 2024
b19b0eb
Fix collated tests
mihailom-db Mar 29, 2024
a111f03
Add isExplicit flag
mihailom-db Mar 29, 2024
55bdd9b
Fix import error
mihailom-db Mar 29, 2024
a7228be
Fix imports in TypeCoercion
mihailom-db Mar 31, 2024
27a72c6
Merge remote-tracking branch 'upstream/master' into SPARK-47210
mihailom-db Apr 1, 2024
18ada04
Add support for explicit propagation in arrays
mihailom-db Apr 1, 2024
38670af
Fix tests to follow recent changes
mihailom-db Apr 1, 2024
01d891e
Incorporate changes
mihailom-db Apr 1, 2024
c5daf86
Fix error
mihailom-db Apr 1, 2024
9ac5678
Change var to val in StringType
mihailom-db Apr 1, 2024
0f1757d
Fix import style
mihailom-db Apr 1, 2024
506c8c0
Revert explicit flag addition
mihailom-db Apr 1, 2024
f743cf8
Narrow down expressions casting
mihailom-db Apr 2, 2024
2ad32c0
Merge branch 'SPARK-47210' into SPARK-47692
mihailom-db Apr 2, 2024
3f46919
Add priority flag
mihailom-db Apr 2, 2024
4f8fe1d
Incorporate minor changes
mihailom-db Apr 2, 2024
52bf4dc
Incorporate changes
mihailom-db Apr 2, 2024
7cbeafe
Special case expressions
mihailom-db Apr 3, 2024
3e92e92
Return new line
mihailom-db Apr 3, 2024
b23e106
Remove indentation cosmetic
mihailom-db Apr 3, 2024
880ebed
Add more cosmetic changes
mihailom-db Apr 3, 2024
c5200fb
Merge branch 'SPARK-47210' into SPARK-47692
mihailom-db Apr 3, 2024
1578b68
Merge remote-tracking branch 'upstream/master' into SPARK-47692
mihailom-db Apr 3, 2024
00bd361
Propagate default collation
mihailom-db Apr 3, 2024
f96ecd9
Incorporate changes
mihailom-db Apr 3, 2024
a1c6f8b
Merge branch 'SPARK-47210' into SPARK-47692
mihailom-db Apr 3, 2024
5002028
Fix priority casting
mihailom-db Apr 3, 2024
35cfeb2
Merge remote-tracking branch 'upstream/master' into SPARK-47692
mihailom-db Apr 4, 2024
a96f3aa
Fix cosmetics in StringType
mihailom-db Apr 4, 2024
3f35fd7
Comment out Substring casting
mihailom-db Apr 5, 2024
f601f8f
Fix substring error
mihailom-db Apr 5, 2024
736c931
Fix import error
mihailom-db Apr 5, 2024
b2dfdab
Merge branch 'master' into SPARK-47692
mihailom-db Apr 8, 2024
b439ad1
Add support for parameter markers
mihailom-db Apr 8, 2024
7dba64b
Merge remote-tracking branch 'upstream/master' into SPARK-47692
mihailom-db Apr 8, 2024
b0d4365
Merge remote-tracking branch 'origin/SPARK-47692' into SPARK-47692
mihailom-db Apr 8, 2024
008a795
Improve test
mihailom-db Apr 11, 2024
f5f45c9
Merge remote-tracking branch 'upstream/master' into SPARK-47692
mihailom-db Apr 11, 2024
d4b72cf
Resolve conflicts
mihailom-db Apr 11, 2024
a8ad4ae
Merge remote-tracking branch 'upstream/master' into SPARK-47692
mihailom-db Apr 12, 2024
de3c660
Fix test
mihailom-db Apr 12, 2024
fa8f7ed
Merge remote-tracking branch 'upstream/master' into SPARK-47692
mihailom-db Apr 12, 2024
5b62bf8
Merge remote-tracking branch 'upstream/master' into SPARK-47692
mihailom-db Apr 19, 2024
c4a61a2
Rework default collation meaning
mihailom-db Apr 19, 2024
5864a9a
Revert unnecessary changes
mihailom-db Apr 19, 2024
5cd6da3
Remove more unrelated changes
mihailom-db Apr 19, 2024
5daff51
Improve tests
mihailom-db Apr 22, 2024
5be0a16
Remove unnecessary test
mihailom-db Apr 22, 2024
6a6175d
Fix imports
mihailom-db Apr 22, 2024
ff58fa0
Remove incorrect test
mihailom-db Apr 22, 2024
e5856fd
Merge remote-tracking branch 'upstream/master' into SPARK-47692
mihailom-db Apr 23, 2024
5e378b0
Fix Cast meaning in collations casting
mihailom-db Apr 23, 2024
a7d2481
Fix casting
mihailom-db Apr 23, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -17,13 +17,14 @@

package org.apache.spark.sql.internal.types

import org.apache.spark.sql.internal.SqlApiConf
import org.apache.spark.sql.types.{AbstractDataType, DataType, StringType}

/**
* StringTypeCollated is an abstract class for StringType with collation support.
*/
abstract class AbstractStringType extends AbstractDataType {
override private[sql] def defaultConcreteType: DataType = StringType
override private[sql] def defaultConcreteType: DataType = SqlApiConf.get.defaultStringType
override private[sql] def simpleString: String = "string"
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ import javax.annotation.Nullable
import scala.annotation.tailrec

import org.apache.spark.sql.catalyst.analysis.TypeCoercion.{hasStringType, haveSameType}
import org.apache.spark.sql.catalyst.expressions.{ArrayJoin, BinaryExpression, CaseWhen, Cast, Coalesce, Collate, Concat, ConcatWs, CreateArray, Elt, Expression, Greatest, If, In, InSubquery, Least, Overlay, StringLPad, StringRPad}
import org.apache.spark.sql.catalyst.expressions.{ArrayJoin, BinaryExpression, CaseWhen, Cast, Coalesce, Collate, Concat, ConcatWs, CreateArray, Elt, Expression, Greatest, If, In, InSubquery, Least, Literal, Overlay, StringLPad, StringRPad}
import org.apache.spark.sql.errors.QueryCompilationErrors
import org.apache.spark.sql.internal.SQLConf
import org.apache.spark.sql.types.{ArrayType, DataType, StringType}
Expand All @@ -48,9 +48,9 @@ object CollationTypeCasts extends TypeCoercionRule {
case eltExpr: Elt =>
eltExpr.withNewChildren(eltExpr.children.head +: collateToSingleType(eltExpr.children.tail))

case overlay: Overlay =>
overlay.withNewChildren(collateToSingleType(Seq(overlay.input, overlay.replace))
++ Seq(overlay.pos, overlay.len))
case overlayExpr: Overlay =>
overlayExpr.withNewChildren(collateToSingleType(Seq(overlayExpr.input, overlayExpr.replace))
++ Seq(overlayExpr.pos, overlayExpr.len))

case stringPadExpr @ (_: StringRPad | _: StringLPad) =>
val Seq(str, len, pad) = stringPadExpr.children
Expand Down Expand Up @@ -108,7 +108,12 @@ object CollationTypeCasts extends TypeCoercionRule {
* complex DataTypes with collated StringTypes (e.g. ArrayType)
*/
def getOutputCollation(expr: Seq[Expression]): StringType = {
val explicitTypes = expr.filter(_.isInstanceOf[Collate])
val explicitTypes = expr.filter {
case _: Collate => true
case cast: Cast if cast.getTagValue(Cast.USER_SPECIFIED_CAST).isDefined =>
cast.dataType.isInstanceOf[StringType]
case _ => false
}
.map(_.dataType.asInstanceOf[StringType].collationId)
.distinct

Expand All @@ -123,17 +128,22 @@ object CollationTypeCasts extends TypeCoercionRule {
)
// Only implicit or default collations present
case 0 =>
val implicitTypes = expr.map(_.dataType)
val implicitTypes = expr.filter {
case Literal(_, _: StringType) => false
case cast: Cast if cast.getTagValue(Cast.USER_SPECIFIED_CAST).isEmpty =>
cast.child.dataType.isInstanceOf[StringType]
case _ => true
}
.map(_.dataType)
.filter(hasStringType)
.map(extractStringType)
.filter(dt => dt.collationId != SQLConf.get.defaultStringType.collationId)
.distinctBy(_.collationId)
.map(extractStringType(_).collationId)
.distinct

if (implicitTypes.length > 1) {
throw QueryCompilationErrors.implicitCollationMismatchError()
}
else {
implicitTypes.headOption.getOrElse(SQLConf.get.defaultStringType)
implicitTypes.headOption.map(StringType(_)).getOrElse(SQLConf.get.defaultStringType)
}
}
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -998,9 +998,10 @@ object TypeCoercion extends TypeCoercionBase {
case (_: StringType, AnyTimestampType) => AnyTimestampType.defaultConcreteType
case (_: StringType, BinaryType) => BinaryType
// Cast any atomic type to string.
case (any: AtomicType, _: StringType) if !any.isInstanceOf[StringType] => StringType
case (any: AtomicType, st: StringType) if !any.isInstanceOf[StringType] => st
case (any: AtomicType, st: AbstractStringType)
if !any.isInstanceOf[StringType] => st.defaultConcreteType
if !any.isInstanceOf[StringType] =>
st.defaultConcreteType

// When we reach here, input type is not acceptable for any types in this type collection,
// try to find the first one we can implicitly cast.
Expand Down
66 changes: 63 additions & 3 deletions sql/core/src/test/scala/org/apache/spark/sql/CollationSuite.scala
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ import scala.jdk.CollectionConverters.MapHasAsJava

import org.apache.spark.SparkException
import org.apache.spark.sql.catalyst.ExtendedAnalysisException
import org.apache.spark.sql.catalyst.expressions.Literal
import org.apache.spark.sql.catalyst.util.CollationFactory
import org.apache.spark.sql.connector.{DatasourceV2SQLBase, FakeV2ProviderWithCustomSchema}
import org.apache.spark.sql.connector.catalog.{Identifier, InMemoryTable}
Expand Down Expand Up @@ -412,7 +413,7 @@ class CollationSuite extends DatasourceV2SQLBase with AdaptiveSparkPlanHelper {
}
}

test("implicit casting of collated strings") {
test("SPARK-47210: Implicit casting of collated strings") {
val tableName = "parquet_dummy_implicit_cast_t22"
withTable(tableName) {
spark.sql(
Expand Down Expand Up @@ -566,7 +567,66 @@ class CollationSuite extends DatasourceV2SQLBase with AdaptiveSparkPlanHelper {
}
}

test("cast of default collated strings in IN expression") {
test("SPARK-47692: Parameter marker with EXECUTE IMMEDIATE implicit casting") {
sql(s"DECLARE stmtStr1 = 'SELECT collation(:var1 || :var2)';")
sql(s"DECLARE stmtStr2 = 'SELECT collation(:var1 || (\\\'a\\\' COLLATE UNICODE))';")

checkAnswer(
sql(
"""EXECUTE IMMEDIATE stmtStr1 USING
| 'a' AS var1,
| 'b' AS var2;""".stripMargin),
Seq(Row("UTF8_BINARY"))
)

withSQLConf(SqlApiConf.DEFAULT_COLLATION -> "UNICODE") {
checkAnswer(
sql(
"""EXECUTE IMMEDIATE stmtStr1 USING
| 'a' AS var1,
| 'b' AS var2;""".stripMargin),
Seq(Row("UNICODE"))
)
}

checkAnswer(
sql(
"""EXECUTE IMMEDIATE stmtStr2 USING
| 'a' AS var1;""".stripMargin),
Seq(Row("UNICODE"))
)

withSQLConf(SqlApiConf.DEFAULT_COLLATION -> "UNICODE") {
checkAnswer(
sql(
"""EXECUTE IMMEDIATE stmtStr2 USING
| 'a' AS var1;""".stripMargin),
Seq(Row("UNICODE"))
)
}
}

test("SPARK-47692: Parameter markers with variable mapping") {
checkAnswer(
spark.sql(
"SELECT collation(:var1 || :var2)",
Map("var1" -> Literal.create('a', StringType("UTF8_BINARY")),
"var2" -> Literal.create('b', StringType("UNICODE")))),
Seq(Row("UTF8_BINARY"))
)

withSQLConf(SqlApiConf.DEFAULT_COLLATION -> "UNICODE") {
checkAnswer(
spark.sql(
"SELECT collation(:var1 || :var2)",
Map("var1" -> Literal.create('a', StringType("UTF8_BINARY")),
"var2" -> Literal.create('b', StringType("UNICODE")))),
Seq(Row("UNICODE"))
)
}
}

test("SPARK-47210: Cast of default collated strings in IN expression") {
val tableName = "t1"
withTable(tableName) {
spark.sql(
Expand All @@ -591,7 +651,7 @@ class CollationSuite extends DatasourceV2SQLBase with AdaptiveSparkPlanHelper {
}

// TODO(SPARK-47210): Add indeterminate support
test("indeterminate collation checks") {
test("SPARK-47210: Indeterminate collation checks") {
val tableName = "t1"
val newTableName = "t2"
withTable(tableName) {
Expand Down