[SPARK-26402][SQL] Accessing nested fields with different cases in case insensitive mode #23353

dbtsai · 2018-12-19T18:38:33Z

What changes were proposed in this pull request?

GetStructField with different optional names should be semantically equal. We will use this as building block to compare the nested fields used in the plans to be optimized by catalyst optimizer.

This PR also fixes a bug below that accessing nested fields with different cases in case insensitive mode will result AnalysisException.

sql("create table t (s struct<i: Int>) using json")
sql("select s.I from t group by s.i")

which is currently failing

org.apache.spark.sql.AnalysisException: expression 'default.t.`s`' is neither present in the group by, nor is it an aggregate function

as @cloud-fan pointed out.

How was this patch tested?

New tests are added.

dbtsai · 2018-12-19T18:48:22Z

Cc @dongjoon-hyun @viirya @gatorsmile @cloud-fan

gatorsmile · 2018-12-19T18:55:14Z

Could we also have an end-to-end test case to show it?

dongjoon-hyun · 2018-12-19T19:42:34Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Canonicalize.scala

@@ -41,6 +41,7 @@ object Canonicalize {
  private[expressions] def ignoreNamesTypes(e: Expression): Expression = e match {
    case a: AttributeReference =>
      AttributeReference("none", a.dataType.asNullable)(exprId = a.exprId)
+    case GetStructField(child, ordinal, _) => GetStructField(child, ordinal, None)


+1 for this canonicalization.

dbtsai · 2018-12-19T22:25:19Z

@gatorsmile I added an end-to-end test. Let me know what you think.

SparkQA · 2018-12-19T22:29:12Z

Test build #100314 has finished for PR 23353 at commit 43351df.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-12-20T01:03:55Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Canonicalize.scala

@@ -41,6 +41,7 @@ object Canonicalize {
  private[expressions] def ignoreNamesTypes(e: Expression): Expression = e match {
    case a: AttributeReference =>
      AttributeReference("none", a.dataType.asNullable)(exprId = a.exprId)
+    case GetStructField(child, ordinal, Some(_)) => GetStructField(child, ordinal, None)


It looks not precisely matched the comments of ignoreNamesTypes. It's better to change it accordingly.

/** Remove names and nullability from types. */

I can change it to and / or.

The comment of Canonicalize says:

The following rules are applied: * - Names and nullability hints for [[org.apache.spark.sql.types.DataType]]s are stripped. ...

It's also needed to be update.

/** Remove names and nullability from types. */

Actually after this change it is not only for types.

Thanks. I re-wrote it a bit. Should look okay now.

SparkQA · 2018-12-20T03:07:43Z

Test build #100323 has finished for PR 23353 at commit 4f21a36.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-12-20T03:13:14Z

Test build #100321 has finished for PR 23353 at commit a5998bb.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class CanonicalizeSuite extends SparkFunSuite with ExpressionEvalHelper with PlanTest

SparkQA · 2018-12-20T05:07:03Z

Test build #100326 has finished for PR 23353 at commit a22d13e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-12-20T05:13:26Z

Test build #100325 has finished for PR 23353 at commit 5f1cc66.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-12-20T07:48:42Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CanonicalizeSuite.scala

+    assert(fieldA1.semanticEquals(fieldA2))
+
+    // End-to-end test case
+    val testRelation = LocalRelation('a.int)


This is not a real end-to-end test...

How about add the following test to SQLQuerySuite?

sql("create table t (s struct<i: Int>) using json") sql("select s.I from t group by s.i")

currently it fials with

org.apache.spark.sql.AnalysisException: expression 'default.t.`s`' is neither present in the group by, nor is it an aggregate function

This one makes sense, and is addressed by this PR.

then can we remove this part? i.e. code between L89 to L99

This test can be a part of BinaryComparisonSimplificationSuite for SimplifyBinaryComparison. So far, there is no test case for struct type in BinaryComparisonSimplificationSuite. Since this PR, SimplifyBinaryComparison can remove s.I <=> s.i.

I'm curious that is that removed too when case sensitive mode is turned on?

It will fail at name resolution.

@viirya I added a test to show in case insensitive mode, it will fail.

dongjoon-hyun · 2018-12-20T22:28:37Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala

+  test("SPARK-26402: GetStructField with different names are semantically equal") {
+    sql("create table t (s struct<i: Int>) using json")
+    sql("select s.I from t group by s.i")
+  }


dongjoon-hyun · 2018-12-20T22:37:48Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala

@@ -2330,4 +2330,8 @@ class SQLQuerySuite extends QueryTest with SQLTestUtils with TestHiveSingleton {
    }
  }

+  test("SPARK-26402: GetStructField with different names are semantically equal") {
+    sql("create table t (s struct<i: Int>) using json")
+    sql("select s.I from t group by s.i")


withTable?

withTable("t") { sql("create table t (s struct<i: Int>) using json") sql("select s.I from t group by s.i") }

dongjoon-hyun · 2018-12-20T22:38:54Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Canonicalize.scala

@@ -37,10 +38,11 @@ object Canonicalize {
    expressionReorder(ignoreNamesTypes(e))
  }

-  /** Remove names and nullability from types. */
+  /** Remove names and nullability from types, and names from `GetStructField` */


nit. Ending with . will be more consistent with the other comments around this.

dongjoon-hyun · 2018-12-20T22:39:55Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala

@@ -2330,4 +2330,8 @@ class SQLQuerySuite extends QueryTest with SQLTestUtils with TestHiveSingleton {
    }
  }

+  test("SPARK-26402: GetStructField with different names are semantically equal") {


Shall we move this to org.apache.spark.sql.SQLQuerySuite?

My bad. Two files have the same name. Moved to the right one. Thanks.

SparkQA · 2018-12-21T00:28:05Z

Test build #100347 has finished for PR 23353 at commit 2a4ec20.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
trait CreateHiveTableAsSelectBase extends DataWritingCommand
case class CreateHiveTableAsSelectCommand(
case class OptimizedCreateHiveTableAsSelectCommand(

viirya · 2018-12-21T00:36:57Z

retest this please.

dongjoon-hyun · 2018-12-21T00:38:13Z

Ur, the test on last commit is already running; Test build #100350 has started .

viirya · 2018-12-21T00:42:52Z

Oh, I missed it. ha.

cloud-fan · 2018-12-21T02:13:58Z

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

+  test("SPARK-26402: GetStructField with different names are semantically equal") {
+    withTable("t") {
+      sql("create table t (s struct<i: Int>) using json")
+      sql("select s.I from t group by s.i")


it's a good practice to always check the result, how about checkAnswer(sql("select s.I from t group by s.i"), Nil)

Yes, that's better.

SparkQA · 2018-12-21T03:19:00Z

Test build #100350 has finished for PR 23353 at commit 5273c3c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-12-21T03:55:08Z

Test build #100348 has finished for PR 23353 at commit 9481785.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-12-21T04:29:38Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Canonicalize.scala

@@ -26,6 +26,7 @@ package org.apache.spark.sql.catalyst.expressions
 *
 * The following rules are applied:
 *  - Names and nullability hints for [[org.apache.spark.sql.types.DataType]]s are stripped.
+ *  - Names for [[org.apache.spark.sql.catalyst.expressions.GetStructField]] are stripped.


[[org.apache.spark.sql.catalyst.expressions.GetStructField]] -> [[GetStructField]]? GetStructField is in the same package.

SparkQA · 2018-12-21T04:32:59Z

Test build #100353 has finished for PR 23353 at commit 5273c3c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-12-21T20:36:14Z

...test/scala/org/apache/spark/sql/catalyst/optimizer/BinaryComparisonSimplificationSuite.scala

+        .analyze
+
+    val optimized = Optimize.execute(originalQuery)
+    val correctAnswer = nonNullableRelation.where(Literal.TrueLiteral).analyze


This is removed eventually. To pass the test, we need to remove where(Literal.TrueLiteral) here.

Oh, this has BooleanSimplification. Removed it.

SparkQA · 2018-12-21T21:51:26Z

Test build #100370 has finished for PR 23353 at commit 81f5e5e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-12-21T23:11:17Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CanonicalizeSuite.scala

+    // GetStructField with different names are semantically equal
+    val fieldB1 = GetStructField(
+      AttributeReference("data1", structType, false)(expId, qualifier),
+      0, Some("b1"))


Sorry for nit-picking. This should be a1 (and a2 at line 67) because this is the first level.
Consequently, fieldB1 -> fieldA1?

Thanks! Done.

val fieldA1 = GetStructField( AttributeReference("data1", structType, false)(expId, qualifier), 0, Some("a1")) val fieldA2 = GetStructField( AttributeReference("data2", structType, false)(expId, qualifier), 0, Some("a2")) assert(fieldA1.semanticEquals(fieldA2)) val fieldB1 = GetStructField( GetStructField( AttributeReference("data1", structType, false)(expId, qualifier), 0, Some("a1")), 0, Some("b1")) val fieldB2 = GetStructField( GetStructField( AttributeReference("data2", structType, false)(expId, qualifier), 0, Some("a2")), 0, Some("b2")) assert(fieldB1.semanticEquals(fieldB2))

@dongjoon-hyun I put the ordering wrong. Addressed as you suggested. Thanks!

dongjoon-hyun · 2018-12-21T23:58:05Z

+1, LGTM except a minor comment on misleading test case.

@gatorsmile . Could you review this again?

SparkQA · 2018-12-22T00:45:06Z

Test build #100372 has finished for PR 23353 at commit 82fa2e1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-12-22T04:06:12Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CanonicalizeSuite.scala

+    val fieldA2 = GetStructField(
+      AttributeReference("data2", structType, false)(expId, qualifier),
+      0, Some("a2"))
+    assert(fieldB1.semanticEquals(fieldB2))


This line will fail to build.

SparkQA · 2018-12-22T04:11:28Z

Test build #100387 has finished for PR 23353 at commit f7a64cf.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-12-22T06:21:07Z

Test build #100379 has finished for PR 23353 at commit 1ce6487.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-12-22T07:53:13Z

Test build #100388 has finished for PR 23353 at commit e1da199.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…se insensitive mode ## What changes were proposed in this pull request? GetStructField with different optional names should be semantically equal. We will use this as building block to compare the nested fields used in the plans to be optimized by catalyst optimizer. This PR also fixes a bug below that accessing nested fields with different cases in case insensitive mode will result `AnalysisException`. ``` sql("create table t (s struct<i: Int>) using json") sql("select s.I from t group by s.i") ``` which is currently failing ``` org.apache.spark.sql.AnalysisException: expression 'default.t.`s`' is neither present in the group by, nor is it an aggregate function ``` as cloud-fan pointed out. ## How was this patch tested? New tests are added. Closes #23353 from dbtsai/nestedEqual. Lead-authored-by: DB Tsai <d_tsai@apple.com> Co-authored-by: DB Tsai <dbtsai@dbtsai.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit a5a24d9) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

dongjoon-hyun · 2018-12-22T18:36:22Z

Thank you all. Merged to master/branch-2.4

dongjoon-hyun · 2018-12-22T18:40:02Z

@cloud-fan and @gatorsmile . I thought the error on GROUP BY queries are obvious bugs and this PR is helpful for branch-2.4. If this should not be in branch-2.4, please feel free to revert it in branch-2.4.

HyukjinKwon · 2018-12-24T03:07:05Z

Looks fine to me.

…se insensitive mode ## What changes were proposed in this pull request? GetStructField with different optional names should be semantically equal. We will use this as building block to compare the nested fields used in the plans to be optimized by catalyst optimizer. This PR also fixes a bug below that accessing nested fields with different cases in case insensitive mode will result `AnalysisException`. ``` sql("create table t (s struct<i: Int>) using json") sql("select s.I from t group by s.i") ``` which is currently failing ``` org.apache.spark.sql.AnalysisException: expression 'default.t.`s`' is neither present in the group by, nor is it an aggregate function ``` as cloud-fan pointed out. ## How was this patch tested? New tests are added. Closes apache#23353 from dbtsai/nestedEqual. Lead-authored-by: DB Tsai <d_tsai@apple.com> Co-authored-by: DB Tsai <dbtsai@dbtsai.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…se insensitive mode ## What changes were proposed in this pull request? GetStructField with different optional names should be semantically equal. We will use this as building block to compare the nested fields used in the plans to be optimized by catalyst optimizer. This PR also fixes a bug below that accessing nested fields with different cases in case insensitive mode will result `AnalysisException`. ``` sql("create table t (s struct<i: Int>) using json") sql("select s.I from t group by s.i") ``` which is currently failing ``` org.apache.spark.sql.AnalysisException: expression 'default.t.`s`' is neither present in the group by, nor is it an aggregate function ``` as cloud-fan pointed out. ## How was this patch tested? New tests are added. Closes apache#23353 from dbtsai/nestedEqual. Lead-authored-by: DB Tsai <d_tsai@apple.com> Co-authored-by: DB Tsai <dbtsai@dbtsai.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit a5a24d9) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

Canonicalization on GetStructField

43351df

dongjoon-hyun reviewed Dec 19, 2018

View reviewed changes

dbtsai added 2 commits December 19, 2018 15:18

Added end-to-end test

a5998bb

Minor

4f21a36

viirya reviewed Dec 20, 2018

View reviewed changes

dbtsai added 2 commits December 19, 2018 17:12

address feedback

5f1cc66

addressed feedback

a22d13e

cloud-fan reviewed Dec 20, 2018

View reviewed changes

dbtsai added 2 commits December 20, 2018 14:21

Added one end-to-end test

cd14e14

Merge branch 'master' into nestedEqual

2a4ec20

dongjoon-hyun reviewed Dec 20, 2018

View reviewed changes

address feedback

5273c3c

dbtsai force-pushed the nestedEqual branch from 9481785 to 5273c3c Compare December 20, 2018 23:24

viirya approved these changes Dec 21, 2018

View reviewed changes

cloud-fan reviewed Dec 21, 2018

View reviewed changes

dongjoon-hyun reviewed Dec 21, 2018

View reviewed changes

dbtsai changed the title ~~[SPARK-26402][SQL] Canonicalization on GetStructField~~ [SPARK-26402][SQL] Accessing nested fields with different cases in case insensitive mode Dec 21, 2018

address feedback

81f5e5e

dongjoon-hyun reviewed Dec 21, 2018

View reviewed changes

fix a bug

82fa2e1

dongjoon-hyun reviewed Dec 21, 2018

View reviewed changes

dbtsai added 2 commits December 21, 2018 18:21

address feedback

1ce6487

address feedback

f7a64cf

dongjoon-hyun reviewed Dec 22, 2018

View reviewed changes

minor

e1da199

dongjoon-hyun approved these changes Dec 22, 2018

View reviewed changes

asfgit closed this in a5a24d9 Dec 22, 2018

beliefer mentioned this pull request Jan 22, 2020

[CORE][ElasticSearch][Mongo] Support discover nested type for ElasticSearch. Qihoo360/XSQL#70

Merged

[SPARK-26402][SQL] Accessing nested fields with different cases in case insensitive mode #23353

[SPARK-26402][SQL] Accessing nested fields with different cases in case insensitive mode #23353

Conversation

dbtsai commented Dec 19, 2018 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

dbtsai commented Dec 19, 2018

gatorsmile commented Dec 19, 2018

Choose a reason for hiding this comment

dbtsai commented Dec 19, 2018 • edited Loading

SparkQA commented Dec 19, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Dec 20, 2018

SparkQA commented Dec 20, 2018

SparkQA commented Dec 20, 2018

SparkQA commented Dec 20, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun Dec 21, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Dec 21, 2018

viirya commented Dec 21, 2018

dongjoon-hyun commented Dec 21, 2018

viirya commented Dec 21, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Dec 21, 2018

SparkQA commented Dec 21, 2018

Choose a reason for hiding this comment

SparkQA commented Dec 21, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Dec 21, 2018

dongjoon-hyun Dec 21, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun commented Dec 21, 2018

SparkQA commented Dec 22, 2018

Choose a reason for hiding this comment

SparkQA commented Dec 22, 2018

SparkQA commented Dec 22, 2018

SparkQA commented Dec 22, 2018

dongjoon-hyun commented Dec 22, 2018

dongjoon-hyun commented Dec 22, 2018

HyukjinKwon commented Dec 24, 2018

dbtsai commented Dec 19, 2018 •

edited

Loading

dbtsai commented Dec 19, 2018 •

edited

Loading

dongjoon-hyun Dec 21, 2018 •

edited

Loading

dongjoon-hyun Dec 21, 2018 •

edited

Loading