Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-21247][SQL] Type comparison should respect case-sensitive SQL conf #18460

Closed
wants to merge 9 commits into from
Closed

[SPARK-21247][SQL] Type comparison should respect case-sensitive SQL conf #18460

wants to merge 9 commits into from

Conversation

dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Jun 28, 2017

What changes were proposed in this pull request?

This is an effort to reduce the difference between Hive and Spark. Spark supports case-sensitivity in columns. Especially, for Struct types, with spark.sql.caseSensitive=true, the following is supported.

scala> sql("select named_struct('a', 1, 'A', 2).a").show
+--------------------------+
|named_struct(a, 1, A, 2).a|
+--------------------------+
|                         1|
+--------------------------+

scala> sql("select named_struct('a', 1, 'A', 2).A").show
+--------------------------+
|named_struct(a, 1, A, 2).A|
+--------------------------+
|                         2|
+--------------------------+

And vice versa, with spark.sql.caseSensitive=false, the following is supported.

scala> sql("select named_struct('a', 1).A, named_struct('A', 1).a").show
+--------------------+--------------------+
|named_struct(a, 1).A|named_struct(A, 1).a|
+--------------------+--------------------+
|                   1|                   1|
+--------------------+--------------------+

However, types are considered different. For example, SET operations fail.

scala> sql("SELECT named_struct('a',1) union all (select named_struct('A',2))").show
org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the compatible column types. struct<A:int> <> struct<a:int> at the first column of the second table;;
'Union
:- Project [named_struct(a, 1) AS named_struct(a, 1)#57]
:  +- OneRowRelation$
+- Project [named_struct(A, 2) AS named_struct(A, 2)#58]
   +- OneRowRelation$

This PR aims to support case-insensitive type equality. For example, in Set operation, the above operation succeed when spark.sql.caseSensitive=false.

scala> sql("SELECT named_struct('a',1) union all (select named_struct('A',2))").show
+------------------+
|named_struct(a, 1)|
+------------------+
|               [1]|
|               [2]|
+------------------+

How was this patch tested?

Pass the Jenkins with a newly add test case.

@SparkQA
Copy link

SparkQA commented Jun 29, 2017

Test build #78831 has finished for PR 18460 at commit f600448.

  • This patch fails due to an unknown error code, -10.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Retest this please.

@SparkQA
Copy link

SparkQA commented Jun 29, 2017

Test build #78845 has finished for PR 18460 at commit f600448.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Hi, @hvanhovell .
Could you review this PR?

@dongjoon-hyun
Copy link
Member Author

Hi, @gatorsmile .
Could you review this PR?

@dongjoon-hyun
Copy link
Member Author

Hi, @cloud-fan .
Could you review this, too?

@@ -79,8 +80,12 @@ abstract class DataType extends AbstractDataType {
* Check if `this` and `other` are the same data type when ignoring nullability
* (`StructField.nullable`, `ArrayType.containsNull`, and `MapType.valueContainsNull`).
*/
private[spark] def sameType(other: DataType): Boolean =
DataType.equalsIgnoreNullability(this, other)
private[spark] def sameType(other: DataType, isCaseSensitive: Boolean = true): Boolean =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should not consider field names in sameType, @gatorsmile what do you think?

Copy link
Member Author

@dongjoon-hyun dongjoon-hyun Jul 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, that sounds to be a big change. Is there any side-effect to users with JSON and Parquet?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May we have some cases that we do care about field names in sameType? To completely ignore it in sameType seems risky?

@dongjoon-hyun
Copy link
Member Author

Hi, @cloud-fan and @gatorsmile .
Could you review this PR again?
I simplified this PR with SQLConf.get.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-21247][SQL] Allow case-insensitive type equality in Set operation [SPARK-21247][SQL] Type comparision should respect case-sensitive SQL conf Jul 5, 2017
@SparkQA
Copy link

SparkQA commented Jul 5, 2017

Test build #79192 has finished for PR 18460 at commit b41a6b4.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Retest this please .

@SparkQA
Copy link

SparkQA commented Jul 5, 2017

Test build #79201 has finished for PR 18460 at commit b41a6b4.

  • This patch fails PySpark pip packaging tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Retest this please

@SparkQA
Copy link

SparkQA commented Jul 5, 2017

Test build #79226 has finished for PR 18460 at commit b41a6b4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Hi, @cloud-fan and @gatorsmile .
Could you review this PR when you have sometime?

if (SQLConf.get.caseSensitiveAnalysis) {
DataType.equalsIgnoreNullability(this, other)
} else {
DataType.equalsIgnoreCaseAndNullability(this, other)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we already have DataType.equalsIgnoreCaseAndNullability, we can use this according to the SQL configuration.

@dongjoon-hyun
Copy link
Member Author

Hi, @hvanhovell .
Could you review this PR about case-sensitive/insensitive Type comparision?

@@ -144,6 +144,8 @@ object TypeCoercion {
.orElse((t1, t2) match {
case (ArrayType(et1, containsNull1), ArrayType(et2, containsNull2)) =>
findWiderTypeForTwo(et1, et2).map(ArrayType(_, containsNull1 || containsNull2))
case (st1 @ StructType(_), st2 @ StructType(_)) if st1.sameType(st2) =>
Some(st1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should follow the ArrayType case and update the nullability.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for review, @cloud-fan . Sure.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can we handle metadata?

case (st1 @ StructType(fields1), st2 @ StructType(fields2)) if st1.sameType(st2) =>
Some(StructType(fields1.zip(fields2).map { case (sf1, sf2) =>
val name = if (sf1.name == sf2.name) sf1.name else sf1.name.toLowerCase(Locale.ROOT)
val dataType = findWiderTypeForTwo(sf1.dataType, sf2.dataType).get
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is <i long> a wider type of <i int>? can we check with Hive?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for making this confused.
I added the comment in the test.

StructType does not widen the types, but supports case-sensitive options.

This line are guarded by if st1.sameType(st2). So, we always have the same dataType.
The reason to use findWiderTypeForTwo is to get the final nested complex type with the new nullability.
Also, this function is findWiderTypeForTwo.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Hive, it's the same.

hive> select * from t1 union all select * from t2;
FAILED: SemanticException 1:41 Schema of both sides of union should match: Column _c0 is of type struct<a:int> on first table and type struct<a:bigint> on second table. Error encountered near token 't2'

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we add the comment here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure!

@SparkQA
Copy link

SparkQA commented Jul 6, 2017

Test build #79275 has finished for PR 18460 at commit d3a9f73.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Rebased to the master to resolve conflicts.

@SparkQA
Copy link

SparkQA commented Jul 6, 2017

Test build #79295 has finished for PR 18460 at commit b46f067.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 7, 2017

Test build #79319 has finished for PR 18460 at commit 268367e.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Retest this please.

@SparkQA
Copy link

SparkQA commented Jul 7, 2017

Test build #79322 has finished for PR 18460 at commit 268367e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Rebased to resolve conflicts.

@SparkQA
Copy link

SparkQA commented Jul 7, 2017

Test build #79338 has finished for PR 18460 at commit 7c9bc7e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Hi, @cloud-fan and @gatorsmile and @viirya .
I updated the PR according to the comments.
Could you review this PR about type comparision when you have some time?

@dongjoon-hyun
Copy link
Member Author

Hi, @cloud-fan and @gatorsmile .
Please let me know if there is something to do more.
Thank you!

val name = if (sf1.name == sf2.name) sf1.name else sf1.name.toLowerCase(Locale.ROOT)
val dataType = findWiderTypeForTwo(sf1.dataType, sf2.dataType).get
StructField(name, dataType, nullable = sf1.nullable || sf2.nullable)
}))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we also do this in findWiderTypeWithoutStringPromotionForTwo?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, let me try.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you again, @viirya . I updated it.

@dongjoon-hyun
Copy link
Member Author

The test cases are added. Thank you, @gatorsmile !

@SparkQA
Copy link

SparkQA commented Oct 6, 2017

Test build #82523 has finished for PR 18460 at commit c72aa18.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 6, 2017

Test build #82527 has finished for PR 18460 at commit 67a037c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

When you have a chance, could you review this please, @gatorsmile ?

@dongjoon-hyun
Copy link
Member Author

Retest this please.

@SparkQA
Copy link

SparkQA commented Oct 8, 2017

Test build #82544 has finished for PR 18460 at commit 67a037c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Gentle ping~, @gatorsmile . :)

@dongjoon-hyun
Copy link
Member Author

Hi, @gatorsmile .
Could you review this?

// - Different nullabilities: `nullable` is true iff one of them is nullable.
val name = if (f1.name == f2.name) f1.name else f1.name.toLowerCase(Locale.ROOT)
val dataType = findTightestCommonType(f1.dataType, f2.dataType).get
StructField(name, dataType, nullable = f1.nullable || f2.nullable)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we follow what we are doing for union/except/intersect? Always pick the name of the head one?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the example,

      sql("SELECT 1 as a UNION ALL (SELECT 1 as A)").show()
      sql("SELECT 1 as A UNION ALL (SELECT 1 as a)").show()

Copy link
Member Author

@dongjoon-hyun dongjoon-hyun Oct 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR works as you want. This function is used to compare the equality only. BTW, for this function, it should use one of lower or upper case because it should be commutative.

scala> sql("SELECT struct(1 a) UNION ALL (SELECT struct(2 A))").printSchema
root
 |-- named_struct(a, 1 AS `a`): struct (nullable = false)
 |    |-- a: integer (nullable = false)

scala> sql("SELECT struct(1 A) UNION ALL (SELECT struct(2 a))").printSchema
root
 |-- named_struct(A, 1 AS `A`): struct (nullable = false)
 |    |-- A: integer (nullable = false)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

val name = if (f1.name == f2.name) f1.name else f1.name.toLowerCase(Locale.ROOT)

The above code changes the case, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, right. It's for commutativity.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see TypeCoercionSuite.checkWidenType.

In order to use the first type name, we need to loosen this test helper function and to break the existing commutative assumption. I'm ok for that if you want.

@dongjoon-hyun
Copy link
Member Author

Please let me know if there is something to do more~ Thank you always, @gatorsmile .

@@ -131,14 +131,17 @@ class TypeCoercionSuite extends AnalysisTest {
widenFunc: (DataType, DataType) => Option[DataType],
t1: DataType,
t2: DataType,
expected: Option[DataType]): Unit = {
expected: Option[DataType],
isSymmetric: Boolean = true): Unit = {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gatorsmile . I extended this function for using non-symmetric tests and addressed your comments.

@SparkQA
Copy link

SparkQA commented Oct 11, 2017

Test build #82647 has finished for PR 18460 at commit 52d19d3.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

It seems to be an irrelevant Python failure.

FAIL: test_package_dependency_on_cluster (pyspark.sql.tests.HiveSparkSubmitTests)
Submit and test a script with a dependency on a Spark Package on a cluster

@dongjoon-hyun
Copy link
Member Author

Retest this please.

@SparkQA
Copy link

SparkQA commented Oct 12, 2017

Test build #82649 has finished for PR 18460 at commit 52d19d3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Thanks you, @gatorsmile . Now, it's simplified more.

@dongjoon-hyun
Copy link
Member Author

Hi, @gatorsmile and @cloud-fan .
Could you review this again, too?

@gatorsmile
Copy link
Member

LGTM cc @cloud-fan

@gatorsmile
Copy link
Member

BTW, we are unable to merge this to Spark 2.2 although this is a bug fix.

@dongjoon-hyun
Copy link
Member Author

Thank you, @gatorsmile . Sure, I agree.

@cloud-fan
Copy link
Contributor

LGTM, merging to master!

@asfgit asfgit closed this in 6412ea1 Oct 13, 2017
@dongjoon-hyun
Copy link
Member Author

Thank you, @cloud-fan , @gatorsmile , and @viirya !!!

@dongjoon-hyun dongjoon-hyun deleted the SPARK-21247 branch October 13, 2017 16:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants