Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-26402][SQL] Accessing nested fields with different cases in case insensitive mode #23353

Closed
wants to merge 13 commits into from

Conversation

dbtsai
Copy link
Member

@dbtsai dbtsai commented Dec 19, 2018

What changes were proposed in this pull request?

GetStructField with different optional names should be semantically equal. We will use this as building block to compare the nested fields used in the plans to be optimized by catalyst optimizer.

This PR also fixes a bug below that accessing nested fields with different cases in case insensitive mode will result AnalysisException.

sql("create table t (s struct<i: Int>) using json")
sql("select s.I from t group by s.i")

which is currently failing

org.apache.spark.sql.AnalysisException: expression 'default.t.`s`' is neither present in the group by, nor is it an aggregate function

as @cloud-fan pointed out.

How was this patch tested?

New tests are added.

@dbtsai
Copy link
Member Author

dbtsai commented Dec 19, 2018

@gatorsmile
Copy link
Member

Could we also have an end-to-end test case to show it?

@@ -41,6 +41,7 @@ object Canonicalize {
private[expressions] def ignoreNamesTypes(e: Expression): Expression = e match {
case a: AttributeReference =>
AttributeReference("none", a.dataType.asNullable)(exprId = a.exprId)
case GetStructField(child, ordinal, _) => GetStructField(child, ordinal, None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for this canonicalization.

@dbtsai
Copy link
Member Author

dbtsai commented Dec 19, 2018

@gatorsmile I added an end-to-end test. Let me know what you think.

@SparkQA
Copy link

SparkQA commented Dec 19, 2018

Test build #100314 has finished for PR 23353 at commit 43351df.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -41,6 +41,7 @@ object Canonicalize {
private[expressions] def ignoreNamesTypes(e: Expression): Expression = e match {
case a: AttributeReference =>
AttributeReference("none", a.dataType.asNullable)(exprId = a.exprId)
case GetStructField(child, ordinal, Some(_)) => GetStructField(child, ordinal, None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks not precisely matched the comments of ignoreNamesTypes. It's better to change it accordingly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/** Remove names and nullability from types. */

I can change it to and / or.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment of Canonicalize says:

The following rules are applied:
 *  - Names and nullability hints for [[org.apache.spark.sql.types.DataType]]s are stripped.
...

It's also needed to be update.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/** Remove names and nullability from types. */

Actually after this change it is not only for types.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I re-wrote it a bit. Should look okay now.

@SparkQA
Copy link

SparkQA commented Dec 20, 2018

Test build #100323 has finished for PR 23353 at commit 4f21a36.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 20, 2018

Test build #100321 has finished for PR 23353 at commit a5998bb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class CanonicalizeSuite extends SparkFunSuite with ExpressionEvalHelper with PlanTest

@SparkQA
Copy link

SparkQA commented Dec 20, 2018

Test build #100326 has finished for PR 23353 at commit a22d13e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 20, 2018

Test build #100325 has finished for PR 23353 at commit 5f1cc66.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

assert(fieldA1.semanticEquals(fieldA2))

// End-to-end test case
val testRelation = LocalRelation('a.int)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a real end-to-end test...

How about add the following test to SQLQuerySuite?

sql("create table t (s struct<i: Int>) using json")
sql("select s.I from t group by s.i")

currently it fials with

org.apache.spark.sql.AnalysisException: expression 'default.t.`s`' is neither present in the group by, nor is it an aggregate function

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one makes sense, and is addressed by this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then can we remove this part? i.e. code between L89 to L99

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member

@dongjoon-hyun dongjoon-hyun Dec 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test can be a part of BinaryComparisonSimplificationSuite for SimplifyBinaryComparison. So far, there is no test case for struct type in BinaryComparisonSimplificationSuite. Since this PR, SimplifyBinaryComparison can remove s.I <=> s.i.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious that is that removed too when case sensitive mode is turned on?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will fail at name resolution.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya I added a test to show in case insensitive mode, it will fail.

test("SPARK-26402: GetStructField with different names are semantically equal") {
sql("create table t (s struct<i: Int>) using json")
sql("select s.I from t group by s.i")
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@@ -2330,4 +2330,8 @@ class SQLQuerySuite extends QueryTest with SQLTestUtils with TestHiveSingleton {
}
}

test("SPARK-26402: GetStructField with different names are semantically equal") {
sql("create table t (s struct<i: Int>) using json")
sql("select s.I from t group by s.i")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

withTable?

    withTable("t") {
      sql("create table t (s struct<i: Int>) using json")
      sql("select s.I from t group by s.i")
    }

@@ -37,10 +38,11 @@ object Canonicalize {
expressionReorder(ignoreNamesTypes(e))
}

/** Remove names and nullability from types. */
/** Remove names and nullability from types, and names from `GetStructField` */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit. Ending with . will be more consistent with the other comments around this.

@@ -2330,4 +2330,8 @@ class SQLQuerySuite extends QueryTest with SQLTestUtils with TestHiveSingleton {
}
}

test("SPARK-26402: GetStructField with different names are semantically equal") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we move this to org.apache.spark.sql.SQLQuerySuite?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad. Two files have the same name. Moved to the right one. Thanks.

@SparkQA
Copy link

SparkQA commented Dec 21, 2018

Test build #100347 has finished for PR 23353 at commit 2a4ec20.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait CreateHiveTableAsSelectBase extends DataWritingCommand
  • case class CreateHiveTableAsSelectCommand(
  • case class OptimizedCreateHiveTableAsSelectCommand(

@viirya
Copy link
Member

viirya commented Dec 21, 2018

retest this please.

@dongjoon-hyun
Copy link
Member

Ur, the test on last commit is already running; Test build #100350 has started .

@viirya
Copy link
Member

viirya commented Dec 21, 2018

Oh, I missed it. ha.

test("SPARK-26402: GetStructField with different names are semantically equal") {
withTable("t") {
sql("create table t (s struct<i: Int>) using json")
sql("select s.I from t group by s.i")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a good practice to always check the result, how about checkAnswer(sql("select s.I from t group by s.i"), Nil)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's better.

@SparkQA
Copy link

SparkQA commented Dec 21, 2018

Test build #100350 has finished for PR 23353 at commit 5273c3c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 21, 2018

Test build #100348 has finished for PR 23353 at commit 9481785.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -26,6 +26,7 @@ package org.apache.spark.sql.catalyst.expressions
*
* The following rules are applied:
* - Names and nullability hints for [[org.apache.spark.sql.types.DataType]]s are stripped.
* - Names for [[org.apache.spark.sql.catalyst.expressions.GetStructField]] are stripped.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[[org.apache.spark.sql.catalyst.expressions.GetStructField]] -> [[GetStructField]]? GetStructField is in the same package.

@SparkQA
Copy link

SparkQA commented Dec 21, 2018

Test build #100353 has finished for PR 23353 at commit 5273c3c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dbtsai dbtsai changed the title [SPARK-26402][SQL] Canonicalization on GetStructField [SPARK-26402][SQL] Accessing nested fields with different cases in case insensitive mode Dec 21, 2018
.analyze

val optimized = Optimize.execute(originalQuery)
val correctAnswer = nonNullableRelation.where(Literal.TrueLiteral).analyze
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is removed eventually. To pass the test, we need to remove where(Literal.TrueLiteral) here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, this has BooleanSimplification. Removed it.

@SparkQA
Copy link

SparkQA commented Dec 21, 2018

Test build #100370 has finished for PR 23353 at commit 81f5e5e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

// GetStructField with different names are semantically equal
val fieldB1 = GetStructField(
AttributeReference("data1", structType, false)(expId, qualifier),
0, Some("b1"))
Copy link
Member

@dongjoon-hyun dongjoon-hyun Dec 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for nit-picking. This should be a1 (and a2 at line 67) because this is the first level.
Consequently, fieldB1 -> fieldA1?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Done.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    val fieldA1 = GetStructField(
      AttributeReference("data1", structType, false)(expId, qualifier),
      0, Some("a1"))
    val fieldA2 = GetStructField(
      AttributeReference("data2", structType, false)(expId, qualifier),
      0, Some("a2"))
    assert(fieldA1.semanticEquals(fieldA2))

    val fieldB1 = GetStructField(
      GetStructField(
        AttributeReference("data1", structType, false)(expId, qualifier),
        0, Some("a1")),
      0, Some("b1"))
    val fieldB2 = GetStructField(
      GetStructField(
        AttributeReference("data2", structType, false)(expId, qualifier),
        0, Some("a2")),
      0, Some("b2"))
    assert(fieldB1.semanticEquals(fieldB2))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun I put the ordering wrong. Addressed as you suggested. Thanks!

@dongjoon-hyun
Copy link
Member

+1, LGTM except a minor comment on misleading test case.

@gatorsmile . Could you review this again?

@SparkQA
Copy link

SparkQA commented Dec 22, 2018

Test build #100372 has finished for PR 23353 at commit 82fa2e1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val fieldA2 = GetStructField(
AttributeReference("data2", structType, false)(expId, qualifier),
0, Some("a2"))
assert(fieldB1.semanticEquals(fieldB2))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line will fail to build.

@SparkQA
Copy link

SparkQA commented Dec 22, 2018

Test build #100387 has finished for PR 23353 at commit f7a64cf.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 22, 2018

Test build #100379 has finished for PR 23353 at commit 1ce6487.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 22, 2018

Test build #100388 has finished for PR 23353 at commit e1da199.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

asfgit pushed a commit that referenced this pull request Dec 22, 2018
…se insensitive mode

## What changes were proposed in this pull request?

GetStructField with different optional names should be semantically equal. We will use this as building block to compare the nested fields used in the plans to be optimized by catalyst optimizer.

This PR also fixes a bug below that accessing nested fields with different cases in case insensitive mode will result `AnalysisException`.

```
sql("create table t (s struct<i: Int>) using json")
sql("select s.I from t group by s.i")
```
which is currently failing
```
org.apache.spark.sql.AnalysisException: expression 'default.t.`s`' is neither present in the group by, nor is it an aggregate function
```
as cloud-fan pointed out.

## How was this patch tested?

New tests are added.

Closes #23353 from dbtsai/nestedEqual.

Lead-authored-by: DB Tsai <d_tsai@apple.com>
Co-authored-by: DB Tsai <dbtsai@dbtsai.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit a5a24d9)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
@dongjoon-hyun
Copy link
Member

Thank you all. Merged to master/branch-2.4

@asfgit asfgit closed this in a5a24d9 Dec 22, 2018
@dongjoon-hyun
Copy link
Member

@cloud-fan and @gatorsmile . I thought the error on GROUP BY queries are obvious bugs and this PR is helpful for branch-2.4. If this should not be in branch-2.4, please feel free to revert it in branch-2.4.

@HyukjinKwon
Copy link
Member

Looks fine to me.

holdenk pushed a commit to holdenk/spark that referenced this pull request Jan 5, 2019
…se insensitive mode

## What changes were proposed in this pull request?

GetStructField with different optional names should be semantically equal. We will use this as building block to compare the nested fields used in the plans to be optimized by catalyst optimizer.

This PR also fixes a bug below that accessing nested fields with different cases in case insensitive mode will result `AnalysisException`.

```
sql("create table t (s struct<i: Int>) using json")
sql("select s.I from t group by s.i")
```
which is currently failing
```
org.apache.spark.sql.AnalysisException: expression 'default.t.`s`' is neither present in the group by, nor is it an aggregate function
```
as cloud-fan pointed out.

## How was this patch tested?

New tests are added.

Closes apache#23353 from dbtsai/nestedEqual.

Lead-authored-by: DB Tsai <d_tsai@apple.com>
Co-authored-by: DB Tsai <dbtsai@dbtsai.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
…se insensitive mode

## What changes were proposed in this pull request?

GetStructField with different optional names should be semantically equal. We will use this as building block to compare the nested fields used in the plans to be optimized by catalyst optimizer.

This PR also fixes a bug below that accessing nested fields with different cases in case insensitive mode will result `AnalysisException`.

```
sql("create table t (s struct<i: Int>) using json")
sql("select s.I from t group by s.i")
```
which is currently failing
```
org.apache.spark.sql.AnalysisException: expression 'default.t.`s`' is neither present in the group by, nor is it an aggregate function
```
as cloud-fan pointed out.

## How was this patch tested?

New tests are added.

Closes apache#23353 from dbtsai/nestedEqual.

Lead-authored-by: DB Tsai <d_tsai@apple.com>
Co-authored-by: DB Tsai <dbtsai@dbtsai.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
kai-chi pushed a commit to kai-chi/spark that referenced this pull request Jul 23, 2019
…se insensitive mode

## What changes were proposed in this pull request?

GetStructField with different optional names should be semantically equal. We will use this as building block to compare the nested fields used in the plans to be optimized by catalyst optimizer.

This PR also fixes a bug below that accessing nested fields with different cases in case insensitive mode will result `AnalysisException`.

```
sql("create table t (s struct<i: Int>) using json")
sql("select s.I from t group by s.i")
```
which is currently failing
```
org.apache.spark.sql.AnalysisException: expression 'default.t.`s`' is neither present in the group by, nor is it an aggregate function
```
as cloud-fan pointed out.

## How was this patch tested?

New tests are added.

Closes apache#23353 from dbtsai/nestedEqual.

Lead-authored-by: DB Tsai <d_tsai@apple.com>
Co-authored-by: DB Tsai <dbtsai@dbtsai.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit a5a24d9)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
kai-chi pushed a commit to kai-chi/spark that referenced this pull request Aug 1, 2019
…se insensitive mode

## What changes were proposed in this pull request?

GetStructField with different optional names should be semantically equal. We will use this as building block to compare the nested fields used in the plans to be optimized by catalyst optimizer.

This PR also fixes a bug below that accessing nested fields with different cases in case insensitive mode will result `AnalysisException`.

```
sql("create table t (s struct<i: Int>) using json")
sql("select s.I from t group by s.i")
```
which is currently failing
```
org.apache.spark.sql.AnalysisException: expression 'default.t.`s`' is neither present in the group by, nor is it an aggregate function
```
as cloud-fan pointed out.

## How was this patch tested?

New tests are added.

Closes apache#23353 from dbtsai/nestedEqual.

Lead-authored-by: DB Tsai <d_tsai@apple.com>
Co-authored-by: DB Tsai <dbtsai@dbtsai.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit a5a24d9)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants