Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-47296][SQL][COLLATION] Fail unsupported functions for non-binary collations #45422

Closed
wants to merge 29 commits into from
Closed
Show file tree
Hide file tree
Changes from 20 commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
e0485fc
Fail all 10 regexp expressions for non-binary collations
uros-db Mar 7, 2024
355eee9
Reformat code
uros-db Mar 8, 2024
cde9eff
Refactoring and better test coverage
uros-db Mar 12, 2024
616335b
Lockdown using StringType
uros-db Mar 13, 2024
e6bec45
Merge branch 'apache:master' into regexp-functions
uros-db Mar 13, 2024
77e606c
Lockdown for regexpExpressions
uros-db Mar 13, 2024
ee72ed3
Renaming and small fixes
uros-db Mar 13, 2024
8bc10f5
Code style fixes
uros-db Mar 13, 2024
8ee0f58
Move StringType lockdown condition to case object
uros-db Mar 13, 2024
d3a9e70
Fix checkInputDataTypes
uros-db Mar 14, 2024
c11c458
Implicit cast for new StringTypes
uros-db Mar 14, 2024
4778951
Implicit cast for new StringTypes
uros-db Mar 14, 2024
1582f24
Separate suite for stringExpressions
uros-db Mar 14, 2024
8d75846
Collation test fix
uros-db Mar 14, 2024
63caef6
Reformat
uros-db Mar 15, 2024
38c3906
Remove extra changes
uros-db Mar 15, 2024
1d60cef
Rewrite collation mismatch check
uros-db Mar 15, 2024
2f5aba9
Improve lockdown handling
uros-db Mar 15, 2024
b9b185d
Add ANSI type coercion
uros-db Mar 15, 2024
3fed6cb
Better ANSI type coercion explanation
uros-db Mar 15, 2024
638c6d4
Fix type coercion
uros-db Mar 15, 2024
af7611f
Fix ANSI type coercion
uros-db Mar 17, 2024
ceabb4a
Merge branch 'apache:master' into regexp-functions
uros-db Mar 17, 2024
0f4a6b1
Fix failing tests
mihailom-db Mar 19, 2024
9b24fba
Incorporate requested changes
mihailom-db Mar 19, 2024
831197a
Merge branch 'apache:master' into regexp-functions
mihailom-db Mar 20, 2024
99b36b0
Refactor code and fix comments
mihailom-db Mar 20, 2024
9051923
Fix newlines
mihailom-db Mar 20, 2024
6986b8b
Add back newlines
mihailom-db Mar 20, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,7 @@ class StringType private(val collationId: Int) extends AtomicType with Serializa
* equality and hashing).
*/
def isBinaryCollation: Boolean = CollationFactory.fetchCollation(collationId).isBinaryCollation
def isLowercaseCollation: Boolean = collationId == CollationFactory.LOWERCASE_COLLATION_ID
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remove even this guy and push the check into StringTypeBinaryLcase?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure this is possible. StringTypeBinaryLcase does not extend StringType, and the point of this function for now is to call it on StringType object in acceptsType to check if we should let the function proceed with that input.


/**
* Type name that is shown to the customer.
Expand All @@ -54,8 +55,6 @@ class StringType private(val collationId: Int) extends AtomicType with Serializa

override def hashCode(): Int = collationId.hashCode()

override private[sql] def acceptsType(other: DataType): Boolean = other.isInstanceOf[StringType]

/**
* The default size of a value of the StringType is 20 bytes.
*/
Expand All @@ -65,6 +64,8 @@ class StringType private(val collationId: Int) extends AtomicType with Serializa
}

/**
* Use StringType for expressions supporting only binary collation.
*
* @since 1.3.0
*/
@Stable
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -205,6 +205,10 @@ object AnsiTypeCoercion extends TypeCoercionBase {
case (StringType, AnyTimestampType) =>
Some(AnyTimestampType.defaultConcreteType)

// If a function expects StringType, no StringType instance should be implicitly cast to
// StringType with default collation.
case (st: StringType, StringType) => Some(st)

case (DateType, AnyTimestampType) =>
Some(AnyTimestampType.defaultConcreteType)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -994,8 +994,10 @@ object TypeCoercion extends TypeCoercionBase {
case (StringType, datetime: DatetimeType) => datetime
case (StringType, AnyTimestampType) => AnyTimestampType.defaultConcreteType
case (StringType, BinaryType) => BinaryType
case (st: StringType, StringType) => st
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reading the code around here, I think null means no implicit cast.

// Cast any atomic type to string.
case (any: AtomicType, StringType) if any != StringType => StringType
uros-db marked this conversation as resolved.
Show resolved Hide resolved
case (any: AtomicType, _: StringTypeCollated) if any != StringType => StringType
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code can be more readable if we call StringTypeCollated#defaultConcreteType, which is StringType(0)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this what you meant?


// When we reach here, input type is not acceptable for any types in this type collection,
// try to find the first one we can implicitly cast.
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package org.apache.spark.sql.catalyst.expressions

import org.apache.spark.sql.catalyst.analysis.TypeCheckResult
import org.apache.spark.sql.catalyst.analysis.TypeCheckResult.DataTypeMismatch
import org.apache.spark.sql.catalyst.util.CollationFactory
import org.apache.spark.sql.types.{AbstractDataType, DataType, StringType}

object CollationTypeConstraints {

def checkCollationCompatibility(collationId: Int, dataTypes: Seq[DataType]): TypeCheckResult = {
val collationName = CollationFactory.fetchCollation(collationId).collationName
// Additional check needed for collation compatibility
dataTypes.collectFirst {
case stringType: StringType if stringType.collationId != collationId =>
val collation = CollationFactory.fetchCollation(stringType.collationId)
DataTypeMismatch(
errorSubClass = "COLLATION_MISMATCH",
messageParameters = Map(
"collationNameLeft" -> collationName,
"collationNameRight" -> collation.collationName
)
)
} getOrElse TypeCheckResult.TypeCheckSuccess
}

}

/**
* StringTypeCollated is an abstract class for StringType with collation support.
*/
abstract class StringTypeCollated extends AbstractDataType {
override private[sql] def defaultConcreteType: DataType = StringType
}

/**
* Use StringTypeBinary for expressions supporting only binary collation.
*/
case object StringTypeBinary extends StringTypeCollated {
override private[sql] def simpleString: String = "string_binary"
override private[sql] def acceptsType(other: DataType): Boolean =
other.isInstanceOf[StringType] && other.asInstanceOf[StringType].isBinaryCollation
}

/**
* Use StringTypeBinaryLcase for expressions supporting only binary and lowercase collation.
*/
case object StringTypeBinaryLcase extends StringTypeCollated {
override private[sql] def simpleString: String = "string_binary_lcase"
override private[sql] def acceptsType(other: DataType): Boolean =
other.isInstanceOf[StringType] && (other.asInstanceOf[StringType].isBinaryCollation ||
other.asInstanceOf[StringType].isLowercaseCollation)
}

/**
* Use StringTypeAnyCollation for expressions supporting all possible collation types.
*/
case object StringTypeAnyCollation extends StringTypeCollated {
override private[sql] def simpleString: String = "string_any_collation"
override private[sql] def acceptsType(other: DataType): Boolean = other.isInstanceOf[StringType]
}
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,7 @@ case class Collate(child: Expression, collationName: String)
extends UnaryExpression with ExpectsInputTypes {
private val collationId = CollationFactory.collationNameToId(collationName)
override def dataType: DataType = StringType(collationId)
override def inputTypes: Seq[AbstractDataType] = Seq(StringType)
override def inputTypes: Seq[AbstractDataType] = Seq(StringTypeAnyCollation)

override protected def withNewChildInternal(
newChild: Expression): Expression = copy(newChild)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -501,26 +501,15 @@ abstract class StringPredicate extends BinaryExpression

def compare(l: UTF8String, r: UTF8String): Boolean

override def inputTypes: Seq[DataType] = Seq(StringType, StringType)
override def inputTypes: Seq[AbstractDataType] =
Seq(StringTypeAnyCollation, StringTypeAnyCollation)

override def checkInputDataTypes(): TypeCheckResult = {
val checkResult = super.checkInputDataTypes()
if (checkResult.isFailure) {
return checkResult
}
// Additional check needed for collation compatibility
val rightCollationId: Int = right.dataType.asInstanceOf[StringType].collationId
if (collationId != rightCollationId) {
DataTypeMismatch(
errorSubClass = "COLLATION_MISMATCH",
messageParameters = Map(
"collationNameLeft" -> CollationFactory.fetchCollation(collationId).collationName,
"collationNameRight" -> CollationFactory.fetchCollation(rightCollationId).collationName
)
)
} else {
TypeCheckResult.TypeCheckSuccess
val defaultCheck = super.checkInputDataTypes()
if (defaultCheck.isFailure) {
return defaultCheck
}
CollationTypeConstraints.checkCollationCompatibility(collationId, children.map(_.dataType))
}

protected override def nullSafeEval(input1: Any, input2: Any): Any =
Expand Down
Loading