-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-21247][SQL] Type comparison should respect case-sensitive SQL conf #18460
Conversation
Test build #78831 has finished for PR 18460 at commit
|
Retest this please. |
Test build #78845 has finished for PR 18460 at commit
|
Hi, @hvanhovell . |
Hi, @gatorsmile . |
Hi, @cloud-fan . |
@@ -79,8 +80,12 @@ abstract class DataType extends AbstractDataType { | |||
* Check if `this` and `other` are the same data type when ignoring nullability | |||
* (`StructField.nullable`, `ArrayType.containsNull`, and `MapType.valueContainsNull`). | |||
*/ | |||
private[spark] def sameType(other: DataType): Boolean = | |||
DataType.equalsIgnoreNullability(this, other) | |||
private[spark] def sameType(other: DataType, isCaseSensitive: Boolean = true): Boolean = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we should not consider field names in sameType
, @gatorsmile what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, that sounds to be a big change. Is there any side-effect to users with JSON and Parquet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May we have some cases that we do care about field names in sameType
? To completely ignore it in sameType
seems risky?
Hi, @cloud-fan and @gatorsmile . |
Test build #79192 has finished for PR 18460 at commit
|
Retest this please . |
Test build #79201 has finished for PR 18460 at commit
|
Retest this please |
Test build #79226 has finished for PR 18460 at commit
|
Hi, @cloud-fan and @gatorsmile . |
if (SQLConf.get.caseSensitiveAnalysis) { | ||
DataType.equalsIgnoreNullability(this, other) | ||
} else { | ||
DataType.equalsIgnoreCaseAndNullability(this, other) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we already have DataType.equalsIgnoreCaseAndNullability
, we can use this according to the SQL configuration.
Hi, @hvanhovell . |
@@ -144,6 +144,8 @@ object TypeCoercion { | |||
.orElse((t1, t2) match { | |||
case (ArrayType(et1, containsNull1), ArrayType(et2, containsNull2)) => | |||
findWiderTypeForTwo(et1, et2).map(ArrayType(_, containsNull1 || containsNull2)) | |||
case (st1 @ StructType(_), st2 @ StructType(_)) if st1.sameType(st2) => | |||
Some(st1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should follow the ArrayType
case and update the nullability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for review, @cloud-fan . Sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can we handle metadata?
case (st1 @ StructType(fields1), st2 @ StructType(fields2)) if st1.sameType(st2) => | ||
Some(StructType(fields1.zip(fields2).map { case (sf1, sf2) => | ||
val name = if (sf1.name == sf2.name) sf1.name else sf1.name.toLowerCase(Locale.ROOT) | ||
val dataType = findWiderTypeForTwo(sf1.dataType, sf2.dataType).get |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is <i long>
a wider type of <i int>
? can we check with Hive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for making this confused.
I added the comment in the test.
StructType does not widen the types, but supports case-sensitive options.
This line are guarded by if st1.sameType(st2)
. So, we always have the same dataType
.
The reason to use findWiderTypeForTwo
is to get the final nested complex type with the new nullability.
Also, this function is findWiderTypeForTwo
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Hive, it's the same.
hive> select * from t1 union all select * from t2;
FAILED: SemanticException 1:41 Schema of both sides of union should match: Column _c0 is of type struct<a:int> on first table and type struct<a:bigint> on second table. Error encountered near token 't2'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we add the comment here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure!
Test build #79275 has finished for PR 18460 at commit
|
Rebased to the master to resolve conflicts. |
Test build #79295 has finished for PR 18460 at commit
|
Test build #79319 has finished for PR 18460 at commit
|
Retest this please. |
Test build #79322 has finished for PR 18460 at commit
|
Rebased to resolve conflicts. |
Test build #79338 has finished for PR 18460 at commit
|
Hi, @cloud-fan and @gatorsmile and @viirya . |
Hi, @cloud-fan and @gatorsmile . |
val name = if (sf1.name == sf2.name) sf1.name else sf1.name.toLowerCase(Locale.ROOT) | ||
val dataType = findWiderTypeForTwo(sf1.dataType, sf2.dataType).get | ||
StructField(name, dataType, nullable = sf1.nullable || sf2.nullable) | ||
})) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we also do this in findWiderTypeWithoutStringPromotionForTwo
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, let me try.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you again, @viirya . I updated it.
The test cases are added. Thank you, @gatorsmile ! |
Test build #82523 has finished for PR 18460 at commit
|
Test build #82527 has finished for PR 18460 at commit
|
When you have a chance, could you review this please, @gatorsmile ? |
Retest this please. |
Test build #82544 has finished for PR 18460 at commit
|
Gentle ping~, @gatorsmile . :) |
Hi, @gatorsmile . |
// - Different nullabilities: `nullable` is true iff one of them is nullable. | ||
val name = if (f1.name == f2.name) f1.name else f1.name.toLowerCase(Locale.ROOT) | ||
val dataType = findTightestCommonType(f1.dataType, f2.dataType).get | ||
StructField(name, dataType, nullable = f1.nullable || f2.nullable) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we follow what we are doing for union
/except
/intersect
? Always pick the name of the head one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the example,
sql("SELECT 1 as a UNION ALL (SELECT 1 as A)").show()
sql("SELECT 1 as A UNION ALL (SELECT 1 as a)").show()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR works as you want. This function is used to compare the equality only. BTW, for this function, it should use one of lower or upper case because it should be commutative.
scala> sql("SELECT struct(1 a) UNION ALL (SELECT struct(2 A))").printSchema
root
|-- named_struct(a, 1 AS `a`): struct (nullable = false)
| |-- a: integer (nullable = false)
scala> sql("SELECT struct(1 A) UNION ALL (SELECT struct(2 a))").printSchema
root
|-- named_struct(A, 1 AS `A`): struct (nullable = false)
| |-- A: integer (nullable = false)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
val name = if (f1.name == f2.name) f1.name else f1.name.toLowerCase(Locale.ROOT)
The above code changes the case, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, right. It's for commutativity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see TypeCoercionSuite.checkWidenType.
In order to use the first type name, we need to loosen this test helper function and to break the existing commutative assumption. I'm ok for that if you want.
Please let me know if there is something to do more~ Thank you always, @gatorsmile . |
@@ -131,14 +131,17 @@ class TypeCoercionSuite extends AnalysisTest { | |||
widenFunc: (DataType, DataType) => Option[DataType], | |||
t1: DataType, | |||
t2: DataType, | |||
expected: Option[DataType]): Unit = { | |||
expected: Option[DataType], | |||
isSymmetric: Boolean = true): Unit = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gatorsmile . I extended this function for using non-symmetric tests and addressed your comments.
Test build #82647 has finished for PR 18460 at commit
|
It seems to be an irrelevant Python failure.
|
Retest this please. |
Test build #82649 has finished for PR 18460 at commit
|
Thanks you, @gatorsmile . Now, it's simplified more. |
Hi, @gatorsmile and @cloud-fan . |
LGTM cc @cloud-fan |
BTW, we are unable to merge this to Spark 2.2 although this is a bug fix. |
Thank you, @gatorsmile . Sure, I agree. |
LGTM, merging to master! |
Thank you, @cloud-fan , @gatorsmile , and @viirya !!! |
What changes were proposed in this pull request?
This is an effort to reduce the difference between Hive and Spark. Spark supports case-sensitivity in columns. Especially, for Struct types, with
spark.sql.caseSensitive=true
, the following is supported.And vice versa, with
spark.sql.caseSensitive=false
, the following is supported.However, types are considered different. For example, SET operations fail.
This PR aims to support case-insensitive type equality. For example, in Set operation, the above operation succeed when
spark.sql.caseSensitive=false
.How was this patch tested?
Pass the Jenkins with a newly add test case.