Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-49334][SQL] str_to_map should check whether the collation values of all parameter types are the same #47825

Closed
wants to merge 7 commits into from
Closed
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ import org.apache.spark.SparkException
import org.apache.spark.sql.catalyst.InternalRow
import org.apache.spark.sql.catalyst.analysis.{Resolver, TypeCheckResult, TypeCoercion, UnresolvedAttribute, UnresolvedExtractValue}
import org.apache.spark.sql.catalyst.analysis.FunctionRegistry.{FUNC_ALIAS, FunctionBuilder}
import org.apache.spark.sql.catalyst.analysis.TypeCheckResult.DataTypeMismatch
import org.apache.spark.sql.catalyst.analysis.TypeCheckResult.{DataTypeMismatch, TypeCheckSuccess}
import org.apache.spark.sql.catalyst.expressions.Cast._
import org.apache.spark.sql.catalyst.expressions.codegen._
import org.apache.spark.sql.catalyst.expressions.codegen.Block._
Expand Down Expand Up @@ -565,11 +565,12 @@ case class StringToMap(text: Expression, pairDelim: Expression, keyValueDelim: E
extends TernaryExpression with ExpectsInputTypes with NullIntolerant {

def this(child: Expression, pairDelim: Expression) = {
this(child, pairDelim, Literal(":"))
this(child, pairDelim, Literal(UTF8String.fromString(":"), child.dataType))
}

def this(child: Expression) = {
this(child, Literal(","), Literal(":"))
this(child, Literal(UTF8String.fromString(","), child.dataType),
Literal(UTF8String.fromString(":"), child.dataType))
}

override def stateful: Boolean = true
Expand All @@ -583,6 +584,23 @@ case class StringToMap(text: Expression, pairDelim: Expression, keyValueDelim: E

override def dataType: DataType = MapType(first.dataType, first.dataType)

override def checkInputDataTypes(): TypeCheckResult = {
val defaultCheck = super.checkInputDataTypes()
if (defaultCheck.isFailure) {
defaultCheck
} else if (!TypeCoercion.haveSameType(children.map(_.dataType))) {
DataTypeMismatch(
errorSubClass = "DATA_DIFF_TYPES",
messageParameters = Map(
"functionName" -> toSQLId(prettyName),
"dataType" -> children.map(_.dataType).map(toSQLType).mkString("[", ", ", "]")
)
)
} else {
TypeCheckSuccess
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not how we usually enforce collation type coercion

please see CollationTypeCasts.scala

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me take a closer look, thank you!


private lazy val mapBuilder = new ArrayBasedMapBuilder(first.dataType, first.dataType)

private final lazy val collationId: Int = text.dataType.asInstanceOf[StringType].collationId
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -996,6 +996,30 @@ class CollationSQLExpressionsSuite
assert(sql(query).schema.fields.head.dataType.sameType(dataType))
}
})

val tableName = "t_diff_collation"
withTable(tableName) {
sql(s"CREATE TABLE $tableName (" +
s"text STRING COLLATE UTF8_BINARY, " +
s"pairDelim STRING COLLATE UTF8_LCASE, " +
s"keyValueDelim STRING COLLATE UTF8_BINARY) " +
s"USING parquet")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, there is an ongoing effort to move such tests to collations.sql golden file

please see: #47828 and https://issues.apache.org/jira/browse/SPARK-48779

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, I have moved the related tests to file collations.sql.

checkError(
exception = intercept[AnalysisException] {
sql(s"SELECT str_to_map(text, pairDelim, keyValueDelim) from $tableName")
},
errorClass = "DATATYPE_MISMATCH.DATA_DIFF_TYPES",
sqlState = "42K09",
parameters = Map(
"functionName" -> "`str_to_map`",
"dataType" -> "[\"STRING\", \"STRING COLLATE UTF8_LCASE\", \"STRING\"]",
"sqlExpr" -> "\"str_to_map(text, pairDelim, keyValueDelim)\""),
context = ExpectedContext(
fragment = "str_to_map(text, pairDelim, keyValueDelim)",
start = 7,
stop = 48)
)
}
Copy link
Contributor

@uros-db uros-db Aug 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we shouldn't use DATATYPE_MISMATCH.DATA_DIFF_TYPES for collation match verification

please see COLLATION_MISMATCH.EXPLICIT, there should be many tests across the codebase market with "// Collation mismatch"

}

test("Support RaiseError misc expression with collation") {
Expand Down