[SPARK-21247][SQL] Type comparison should respect case-sensitive SQL conf #18460

dongjoon-hyun · 2017-06-28T22:35:14Z

What changes were proposed in this pull request?

This is an effort to reduce the difference between Hive and Spark. Spark supports case-sensitivity in columns. Especially, for Struct types, with spark.sql.caseSensitive=true, the following is supported.

scala> sql("select named_struct('a', 1, 'A', 2).a").show
+--------------------------+
|named_struct(a, 1, A, 2).a|
+--------------------------+
|                         1|
+--------------------------+

scala> sql("select named_struct('a', 1, 'A', 2).A").show
+--------------------------+
|named_struct(a, 1, A, 2).A|
+--------------------------+
|                         2|
+--------------------------+

And vice versa, with spark.sql.caseSensitive=false, the following is supported.

scala> sql("select named_struct('a', 1).A, named_struct('A', 1).a").show
+--------------------+--------------------+
|named_struct(a, 1).A|named_struct(A, 1).a|
+--------------------+--------------------+
|                   1|                   1|
+--------------------+--------------------+

However, types are considered different. For example, SET operations fail.

scala> sql("SELECT named_struct('a',1) union all (select named_struct('A',2))").show
org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the compatible column types. struct<A:int> <> struct<a:int> at the first column of the second table;;
'Union
:- Project [named_struct(a, 1) AS named_struct(a, 1)#57]
:  +- OneRowRelation$
+- Project [named_struct(A, 2) AS named_struct(A, 2)#58]
   +- OneRowRelation$

This PR aims to support case-insensitive type equality. For example, in Set operation, the above operation succeed when spark.sql.caseSensitive=false.

scala> sql("SELECT named_struct('a',1) union all (select named_struct('A',2))").show
+------------------+
|named_struct(a, 1)|
+------------------+
|               [1]|
|               [2]|
+------------------+

How was this patch tested?

Pass the Jenkins with a newly add test case.

SparkQA · 2017-06-29T00:36:31Z

Test build #78831 has finished for PR 18460 at commit f600448.

This patch fails due to an unknown error code, -10.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2017-06-29T00:49:07Z

Retest this please.

SparkQA · 2017-06-29T03:13:28Z

Test build #78845 has finished for PR 18460 at commit f600448.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2017-06-29T17:28:04Z

Hi, @hvanhovell .
Could you review this PR?

dongjoon-hyun · 2017-06-30T18:27:31Z

Hi, @gatorsmile .
Could you review this PR?

dongjoon-hyun · 2017-07-03T10:21:41Z

Hi, @cloud-fan .
Could you review this, too?

cloud-fan · 2017-07-04T03:30:31Z

sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala

@@ -79,8 +80,12 @@ abstract class DataType extends AbstractDataType {
   * Check if `this` and `other` are the same data type when ignoring nullability
   * (`StructField.nullable`, `ArrayType.containsNull`, and `MapType.valueContainsNull`).
   */
-  private[spark] def sameType(other: DataType): Boolean =
-    DataType.equalsIgnoreNullability(this, other)
+  private[spark] def sameType(other: DataType, isCaseSensitive: Boolean = true): Boolean =


maybe we should not consider field names in sameType, @gatorsmile what do you think?

Oh, that sounds to be a big change. Is there any side-effect to users with JSON and Parquet?

May we have some cases that we do care about field names in sameType? To completely ignore it in sameType seems risky?

dongjoon-hyun · 2017-07-05T06:03:45Z

Hi, @cloud-fan and @gatorsmile .
Could you review this PR again?
I simplified this PR with SQLConf.get.

SparkQA · 2017-07-05T07:04:57Z

Test build #79192 has finished for PR 18460 at commit b41a6b4.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2017-07-05T07:09:49Z

Retest this please .

SparkQA · 2017-07-05T09:30:11Z

Test build #79201 has finished for PR 18460 at commit b41a6b4.

This patch fails PySpark pip packaging tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2017-07-05T14:51:45Z

Retest this please

SparkQA · 2017-07-05T17:11:59Z

Test build #79226 has finished for PR 18460 at commit b41a6b4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2017-07-05T17:14:38Z

Hi, @cloud-fan and @gatorsmile .
Could you review this PR when you have sometime?

dongjoon-hyun · 2017-07-06T05:51:06Z

sql/catalyst/src/main/scala/org/apache/spark/sql/types/DataType.scala

+    if (SQLConf.get.caseSensitiveAnalysis) {
+      DataType.equalsIgnoreNullability(this, other)
+    } else {
+      DataType.equalsIgnoreCaseAndNullability(this, other)


Since we already have DataType.equalsIgnoreCaseAndNullability, we can use this according to the SQL configuration.

dongjoon-hyun · 2017-07-06T06:47:28Z

Hi, @hvanhovell .
Could you review this PR about case-sensitive/insensitive Type comparision?

cloud-fan · 2017-07-06T07:53:30Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala

@@ -144,6 +144,8 @@ object TypeCoercion {
      .orElse((t1, t2) match {
        case (ArrayType(et1, containsNull1), ArrayType(et2, containsNull2)) =>
          findWiderTypeForTwo(et1, et2).map(ArrayType(_, containsNull1 || containsNull2))
+        case (st1 @ StructType(_), st2 @ StructType(_)) if st1.sameType(st2) =>
+          Some(st1)


We should follow the ArrayType case and update the nullability.

Thank you for review, @cloud-fan . Sure.

How can we handle metadata?

cloud-fan · 2017-07-06T10:54:03Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala

+        case (st1 @ StructType(fields1), st2 @ StructType(fields2)) if st1.sameType(st2) =>
+          Some(StructType(fields1.zip(fields2).map { case (sf1, sf2) =>
+            val name = if (sf1.name == sf2.name) sf1.name else sf1.name.toLowerCase(Locale.ROOT)
+            val dataType = findWiderTypeForTwo(sf1.dataType, sf2.dataType).get


is <i long> a wider type of <i int>? can we check with Hive?

Sorry for making this confused.
I added the comment in the test.

StructType does not widen the types, but supports case-sensitive options.

This line are guarded by if st1.sameType(st2). So, we always have the same dataType.
The reason to use findWiderTypeForTwo is to get the final nested complex type with the new nullability.
Also, this function is findWiderTypeForTwo.

For Hive, it's the same.

hive> select * from t1 union all select * from t2; FAILED: SemanticException 1:41 Schema of both sides of union should match: Column _c0 is of type struct<a:int> on first table and type struct<a:bigint> on second table. Error encountered near token 't2'

Shall we add the comment here?

SparkQA · 2017-07-06T11:57:38Z

Test build #79275 has finished for PR 18460 at commit d3a9f73.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2017-07-06T17:51:24Z

Rebased to the master to resolve conflicts.

SparkQA · 2017-07-06T20:04:44Z

Test build #79295 has finished for PR 18460 at commit b46f067.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-07-07T07:04:56Z

Test build #79319 has finished for PR 18460 at commit 268367e.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2017-07-07T07:10:28Z

Retest this please.

SparkQA · 2017-07-07T09:27:47Z

Test build #79322 has finished for PR 18460 at commit 268367e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2017-07-07T16:22:06Z

Rebased to resolve conflicts.

SparkQA · 2017-07-07T18:44:39Z

Test build #79338 has finished for PR 18460 at commit 7c9bc7e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2017-07-07T21:30:05Z

Hi, @cloud-fan and @gatorsmile and @viirya .
I updated the PR according to the comments.
Could you review this PR about type comparision when you have some time?

dongjoon-hyun · 2017-07-08T07:29:24Z

Hi, @cloud-fan and @gatorsmile .
Please let me know if there is something to do more.
Thank you!

viirya · 2017-07-08T09:38:36Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala

+            val name = if (sf1.name == sf2.name) sf1.name else sf1.name.toLowerCase(Locale.ROOT)
+            val dataType = findWiderTypeForTwo(sf1.dataType, sf2.dataType).get
+            StructField(name, dataType, nullable = sf1.nullable || sf2.nullable)
+          }))


Shall we also do this in findWiderTypeWithoutStringPromotionForTwo?

Sure, let me try.

Thank you again, @viirya . I updated it.

dongjoon-hyun · 2017-10-06T19:59:18Z

The test cases are added. Thank you, @gatorsmile !

SparkQA · 2017-10-06T22:44:29Z

Test build #82523 has finished for PR 18460 at commit c72aa18.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-10-06T23:51:24Z

Test build #82527 has finished for PR 18460 at commit 67a037c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2017-10-08T19:02:30Z

When you have a chance, could you review this please, @gatorsmile ?

dongjoon-hyun · 2017-10-08T19:02:42Z

Retest this please.

SparkQA · 2017-10-08T21:48:04Z

Test build #82544 has finished for PR 18460 at commit 67a037c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2017-10-09T16:36:36Z

Gentle ping~, @gatorsmile . :)

dongjoon-hyun · 2017-10-10T19:40:08Z

Hi, @gatorsmile .
Could you review this?

gatorsmile · 2017-10-11T03:46:16Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala

+        // - Different nullabilities: `nullable` is true iff one of them is nullable.
+        val name = if (f1.name == f2.name) f1.name else f1.name.toLowerCase(Locale.ROOT)
+        val dataType = findTightestCommonType(f1.dataType, f2.dataType).get
+        StructField(name, dataType, nullable = f1.nullable || f2.nullable)


Should we follow what we are doing for union/except/intersect? Always pick the name of the head one?

See the example,

sql("SELECT 1 as a UNION ALL (SELECT 1 as A)").show() sql("SELECT 1 as A UNION ALL (SELECT 1 as a)").show()

This PR works as you want. This function is used to compare the equality only. BTW, for this function, it should use one of lower or upper case because it should be commutative.

scala> sql("SELECT struct(1 a) UNION ALL (SELECT struct(2 A))").printSchema root |-- named_struct(a, 1 AS `a`): struct (nullable = false) | |-- a: integer (nullable = false) scala> sql("SELECT struct(1 A) UNION ALL (SELECT struct(2 a))").printSchema root |-- named_struct(A, 1 AS `A`): struct (nullable = false) | |-- A: integer (nullable = false)

val name = if (f1.name == f2.name) f1.name else f1.name.toLowerCase(Locale.ROOT)

The above code changes the case, right?

Sure, right. It's for commutativity.

Please see TypeCoercionSuite.checkWidenType.

In order to use the first type name, we need to loosen this test helper function and to break the existing commutative assumption. I'm ok for that if you want.

dongjoon-hyun · 2017-10-11T17:07:59Z

Please let me know if there is something to do more~ Thank you always, @gatorsmile .

dongjoon-hyun · 2017-10-11T20:04:55Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercionSuite.scala

@@ -131,14 +131,17 @@ class TypeCoercionSuite extends AnalysisTest {
      widenFunc: (DataType, DataType) => Option[DataType],
      t1: DataType,
      t2: DataType,
-      expected: Option[DataType]): Unit = {
+      expected: Option[DataType],
+      isSymmetric: Boolean = true): Unit = {


@gatorsmile . I extended this function for using non-symmetric tests and addressed your comments.

SparkQA · 2017-10-11T22:18:40Z

Test build #82647 has finished for PR 18460 at commit 52d19d3.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2017-10-11T23:04:47Z

It seems to be an irrelevant Python failure.

FAIL: test_package_dependency_on_cluster (pyspark.sql.tests.HiveSparkSubmitTests)
Submit and test a script with a dependency on a Spark Package on a cluster

dongjoon-hyun · 2017-10-11T23:04:54Z

Retest this please.

SparkQA · 2017-10-12T01:53:25Z

Test build #82649 has finished for PR 18460 at commit 52d19d3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2017-10-12T19:29:46Z

Thanks you, @gatorsmile . Now, it's simplified more.

dongjoon-hyun · 2017-10-13T04:29:14Z

Hi, @gatorsmile and @cloud-fan .
Could you review this again, too?

gatorsmile · 2017-10-13T05:27:51Z

LGTM cc @cloud-fan

gatorsmile · 2017-10-13T05:28:32Z

BTW, we are unable to merge this to Spark 2.2 although this is a bug fix.

dongjoon-hyun · 2017-10-13T05:47:27Z

Thank you, @gatorsmile . Sure, I agree.

cloud-fan · 2017-10-13T16:35:27Z

LGTM, merging to master!

dongjoon-hyun · 2017-10-13T16:43:53Z

Thank you, @cloud-fan , @gatorsmile , and @viirya !!!

cloud-fan reviewed Jul 4, 2017

View reviewed changes

dongjoon-hyun changed the title ~~[SPARK-21247][SQL] Allow case-insensitive type equality in Set operation~~ [SPARK-21247][SQL] Type comparision should respect case-sensitive SQL conf Jul 5, 2017

dongjoon-hyun commented Jul 6, 2017

View reviewed changes

cloud-fan reviewed Jul 6, 2017

View reviewed changes

viirya reviewed Jul 8, 2017

View reviewed changes

Add test cases.

c72aa18

Update test cases.

67a037c

gatorsmile reviewed Oct 11, 2017

View reviewed changes

Address comments.

52d19d3

dongjoon-hyun commented Oct 11, 2017

View reviewed changes

dongjoon-hyun mentioned this pull request Oct 13, 2017

[SPARK-14387][SPARK-16628][SPARK-18355][SQL] Use Spark schema to read ORC table instead of ORC file schema #19470

Closed

asfgit closed this in 6412ea1 Oct 13, 2017

dongjoon-hyun deleted the SPARK-21247 branch October 13, 2017 16:43

[SPARK-21247][SQL] Type comparison should respect case-sensitive SQL conf #18460

[SPARK-21247][SQL] Type comparison should respect case-sensitive SQL conf #18460

Conversation

dongjoon-hyun commented Jun 28, 2017 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Jun 29, 2017

dongjoon-hyun commented Jun 29, 2017

SparkQA commented Jun 29, 2017

dongjoon-hyun commented Jun 29, 2017

dongjoon-hyun commented Jun 30, 2017

dongjoon-hyun commented Jul 3, 2017

Choose a reason for hiding this comment

dongjoon-hyun Jul 4, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun commented Jul 5, 2017

SparkQA commented Jul 5, 2017

dongjoon-hyun commented Jul 5, 2017

SparkQA commented Jul 5, 2017

dongjoon-hyun commented Jul 5, 2017

SparkQA commented Jul 5, 2017

dongjoon-hyun commented Jul 5, 2017

Choose a reason for hiding this comment

dongjoon-hyun commented Jul 6, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jul 6, 2017

dongjoon-hyun commented Jul 6, 2017

SparkQA commented Jul 6, 2017

SparkQA commented Jul 7, 2017

dongjoon-hyun commented Jul 7, 2017

SparkQA commented Jul 7, 2017

dongjoon-hyun commented Jul 7, 2017

SparkQA commented Jul 7, 2017

dongjoon-hyun commented Jul 7, 2017

dongjoon-hyun commented Jul 8, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun commented Oct 6, 2017

SparkQA commented Oct 6, 2017

SparkQA commented Oct 6, 2017

dongjoon-hyun commented Oct 8, 2017

dongjoon-hyun commented Oct 8, 2017

SparkQA commented Oct 8, 2017

dongjoon-hyun commented Oct 9, 2017

dongjoon-hyun commented Oct 10, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun Oct 11, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun commented Oct 11, 2017

Choose a reason for hiding this comment

SparkQA commented Oct 11, 2017

dongjoon-hyun commented Oct 11, 2017

dongjoon-hyun commented Oct 11, 2017

SparkQA commented Oct 12, 2017

dongjoon-hyun commented Oct 12, 2017

dongjoon-hyun commented Oct 13, 2017

gatorsmile commented Oct 13, 2017

gatorsmile commented Oct 13, 2017

dongjoon-hyun commented Oct 13, 2017

cloud-fan commented Oct 13, 2017

dongjoon-hyun commented Oct 13, 2017

dongjoon-hyun commented Jun 28, 2017 •

edited

Loading

dongjoon-hyun Jul 4, 2017 •

edited

Loading

dongjoon-hyun Oct 11, 2017 •

edited

Loading