-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-26402][SQL] Accessing nested fields with different cases in case insensitive mode #23353
Conversation
Could we also have an end-to-end test case to show it? |
@@ -41,6 +41,7 @@ object Canonicalize { | |||
private[expressions] def ignoreNamesTypes(e: Expression): Expression = e match { | |||
case a: AttributeReference => | |||
AttributeReference("none", a.dataType.asNullable)(exprId = a.exprId) | |||
case GetStructField(child, ordinal, _) => GetStructField(child, ordinal, None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for this canonicalization.
@gatorsmile I added an end-to-end test. Let me know what you think. |
Test build #100314 has finished for PR 23353 at commit
|
@@ -41,6 +41,7 @@ object Canonicalize { | |||
private[expressions] def ignoreNamesTypes(e: Expression): Expression = e match { | |||
case a: AttributeReference => | |||
AttributeReference("none", a.dataType.asNullable)(exprId = a.exprId) | |||
case GetStructField(child, ordinal, Some(_)) => GetStructField(child, ordinal, None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks not precisely matched the comments of ignoreNamesTypes
. It's better to change it accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/** Remove names and nullability from types. */
I can change it to and / or
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment of Canonicalize
says:
The following rules are applied:
* - Names and nullability hints for [[org.apache.spark.sql.types.DataType]]s are stripped.
...
It's also needed to be update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/** Remove names and nullability from types. */
Actually after this change it is not only for types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I re-wrote it a bit. Should look okay now.
Test build #100323 has finished for PR 23353 at commit
|
Test build #100321 has finished for PR 23353 at commit
|
Test build #100326 has finished for PR 23353 at commit
|
Test build #100325 has finished for PR 23353 at commit
|
assert(fieldA1.semanticEquals(fieldA2)) | ||
|
||
// End-to-end test case | ||
val testRelation = LocalRelation('a.int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a real end-to-end test...
How about add the following test to SQLQuerySuite?
sql("create table t (s struct<i: Int>) using json")
sql("select s.I from t group by s.i")
currently it fials with
org.apache.spark.sql.AnalysisException: expression 'default.t.`s`' is neither present in the group by, nor is it an aggregate function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one makes sense, and is addressed by this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then can we remove this part? i.e. code between L89 to L99
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test can be a part of BinaryComparisonSimplificationSuite
for SimplifyBinaryComparison
. So far, there is no test case for struct
type in BinaryComparisonSimplificationSuite
. Since this PR, SimplifyBinaryComparison
can remove s.I <=> s.i
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious that is that removed too when case sensitive mode is turned on?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will fail at name resolution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viirya I added a test to show in case insensitive mode, it will fail.
test("SPARK-26402: GetStructField with different names are semantically equal") { | ||
sql("create table t (s struct<i: Int>) using json") | ||
sql("select s.I from t group by s.i") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
@@ -2330,4 +2330,8 @@ class SQLQuerySuite extends QueryTest with SQLTestUtils with TestHiveSingleton { | |||
} | |||
} | |||
|
|||
test("SPARK-26402: GetStructField with different names are semantically equal") { | |||
sql("create table t (s struct<i: Int>) using json") | |||
sql("select s.I from t group by s.i") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
withTable
?
withTable("t") {
sql("create table t (s struct<i: Int>) using json")
sql("select s.I from t group by s.i")
}
@@ -37,10 +38,11 @@ object Canonicalize { | |||
expressionReorder(ignoreNamesTypes(e)) | |||
} | |||
|
|||
/** Remove names and nullability from types. */ | |||
/** Remove names and nullability from types, and names from `GetStructField` */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit. Ending with .
will be more consistent with the other comments around this.
@@ -2330,4 +2330,8 @@ class SQLQuerySuite extends QueryTest with SQLTestUtils with TestHiveSingleton { | |||
} | |||
} | |||
|
|||
test("SPARK-26402: GetStructField with different names are semantically equal") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we move this to org.apache.spark.sql.SQLQuerySuite
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad. Two files have the same name. Moved to the right one. Thanks.
Test build #100347 has finished for PR 23353 at commit
|
retest this please. |
Ur, the test on last commit is already running; |
Oh, I missed it. ha. |
test("SPARK-26402: GetStructField with different names are semantically equal") { | ||
withTable("t") { | ||
sql("create table t (s struct<i: Int>) using json") | ||
sql("select s.I from t group by s.i") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's a good practice to always check the result, how about checkAnswer(sql("select s.I from t group by s.i"), Nil)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's better.
Test build #100350 has finished for PR 23353 at commit
|
Test build #100348 has finished for PR 23353 at commit
|
@@ -26,6 +26,7 @@ package org.apache.spark.sql.catalyst.expressions | |||
* | |||
* The following rules are applied: | |||
* - Names and nullability hints for [[org.apache.spark.sql.types.DataType]]s are stripped. | |||
* - Names for [[org.apache.spark.sql.catalyst.expressions.GetStructField]] are stripped. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[[org.apache.spark.sql.catalyst.expressions.GetStructField]]
-> [[GetStructField]]
? GetStructField
is in the same package.
Test build #100353 has finished for PR 23353 at commit
|
.analyze | ||
|
||
val optimized = Optimize.execute(originalQuery) | ||
val correctAnswer = nonNullableRelation.where(Literal.TrueLiteral).analyze |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is removed eventually. To pass the test, we need to remove where(Literal.TrueLiteral)
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, this has BooleanSimplification
. Removed it.
Test build #100370 has finished for PR 23353 at commit
|
// GetStructField with different names are semantically equal | ||
val fieldB1 = GetStructField( | ||
AttributeReference("data1", structType, false)(expId, qualifier), | ||
0, Some("b1")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for nit-picking. This should be a1
(and a2
at line 67) because this is the first level.
Consequently, fieldB1
-> fieldA1
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
val fieldA1 = GetStructField(
AttributeReference("data1", structType, false)(expId, qualifier),
0, Some("a1"))
val fieldA2 = GetStructField(
AttributeReference("data2", structType, false)(expId, qualifier),
0, Some("a2"))
assert(fieldA1.semanticEquals(fieldA2))
val fieldB1 = GetStructField(
GetStructField(
AttributeReference("data1", structType, false)(expId, qualifier),
0, Some("a1")),
0, Some("b1"))
val fieldB2 = GetStructField(
GetStructField(
AttributeReference("data2", structType, false)(expId, qualifier),
0, Some("a2")),
0, Some("b2"))
assert(fieldB1.semanticEquals(fieldB2))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dongjoon-hyun I put the ordering wrong. Addressed as you suggested. Thanks!
+1, LGTM except a minor comment on misleading test case. @gatorsmile . Could you review this again? |
Test build #100372 has finished for PR 23353 at commit
|
val fieldA2 = GetStructField( | ||
AttributeReference("data2", structType, false)(expId, qualifier), | ||
0, Some("a2")) | ||
assert(fieldB1.semanticEquals(fieldB2)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line will fail to build.
Test build #100387 has finished for PR 23353 at commit
|
Test build #100379 has finished for PR 23353 at commit
|
Test build #100388 has finished for PR 23353 at commit
|
…se insensitive mode ## What changes were proposed in this pull request? GetStructField with different optional names should be semantically equal. We will use this as building block to compare the nested fields used in the plans to be optimized by catalyst optimizer. This PR also fixes a bug below that accessing nested fields with different cases in case insensitive mode will result `AnalysisException`. ``` sql("create table t (s struct<i: Int>) using json") sql("select s.I from t group by s.i") ``` which is currently failing ``` org.apache.spark.sql.AnalysisException: expression 'default.t.`s`' is neither present in the group by, nor is it an aggregate function ``` as cloud-fan pointed out. ## How was this patch tested? New tests are added. Closes #23353 from dbtsai/nestedEqual. Lead-authored-by: DB Tsai <d_tsai@apple.com> Co-authored-by: DB Tsai <dbtsai@dbtsai.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit a5a24d9) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
Thank you all. Merged to |
@cloud-fan and @gatorsmile . I thought the error on |
Looks fine to me. |
…se insensitive mode ## What changes were proposed in this pull request? GetStructField with different optional names should be semantically equal. We will use this as building block to compare the nested fields used in the plans to be optimized by catalyst optimizer. This PR also fixes a bug below that accessing nested fields with different cases in case insensitive mode will result `AnalysisException`. ``` sql("create table t (s struct<i: Int>) using json") sql("select s.I from t group by s.i") ``` which is currently failing ``` org.apache.spark.sql.AnalysisException: expression 'default.t.`s`' is neither present in the group by, nor is it an aggregate function ``` as cloud-fan pointed out. ## How was this patch tested? New tests are added. Closes apache#23353 from dbtsai/nestedEqual. Lead-authored-by: DB Tsai <d_tsai@apple.com> Co-authored-by: DB Tsai <dbtsai@dbtsai.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…se insensitive mode ## What changes were proposed in this pull request? GetStructField with different optional names should be semantically equal. We will use this as building block to compare the nested fields used in the plans to be optimized by catalyst optimizer. This PR also fixes a bug below that accessing nested fields with different cases in case insensitive mode will result `AnalysisException`. ``` sql("create table t (s struct<i: Int>) using json") sql("select s.I from t group by s.i") ``` which is currently failing ``` org.apache.spark.sql.AnalysisException: expression 'default.t.`s`' is neither present in the group by, nor is it an aggregate function ``` as cloud-fan pointed out. ## How was this patch tested? New tests are added. Closes apache#23353 from dbtsai/nestedEqual. Lead-authored-by: DB Tsai <d_tsai@apple.com> Co-authored-by: DB Tsai <dbtsai@dbtsai.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…se insensitive mode ## What changes were proposed in this pull request? GetStructField with different optional names should be semantically equal. We will use this as building block to compare the nested fields used in the plans to be optimized by catalyst optimizer. This PR also fixes a bug below that accessing nested fields with different cases in case insensitive mode will result `AnalysisException`. ``` sql("create table t (s struct<i: Int>) using json") sql("select s.I from t group by s.i") ``` which is currently failing ``` org.apache.spark.sql.AnalysisException: expression 'default.t.`s`' is neither present in the group by, nor is it an aggregate function ``` as cloud-fan pointed out. ## How was this patch tested? New tests are added. Closes apache#23353 from dbtsai/nestedEqual. Lead-authored-by: DB Tsai <d_tsai@apple.com> Co-authored-by: DB Tsai <dbtsai@dbtsai.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit a5a24d9) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…se insensitive mode ## What changes were proposed in this pull request? GetStructField with different optional names should be semantically equal. We will use this as building block to compare the nested fields used in the plans to be optimized by catalyst optimizer. This PR also fixes a bug below that accessing nested fields with different cases in case insensitive mode will result `AnalysisException`. ``` sql("create table t (s struct<i: Int>) using json") sql("select s.I from t group by s.i") ``` which is currently failing ``` org.apache.spark.sql.AnalysisException: expression 'default.t.`s`' is neither present in the group by, nor is it an aggregate function ``` as cloud-fan pointed out. ## How was this patch tested? New tests are added. Closes apache#23353 from dbtsai/nestedEqual. Lead-authored-by: DB Tsai <d_tsai@apple.com> Co-authored-by: DB Tsai <dbtsai@dbtsai.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit a5a24d9) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
What changes were proposed in this pull request?
GetStructField with different optional names should be semantically equal. We will use this as building block to compare the nested fields used in the plans to be optimized by catalyst optimizer.
This PR also fixes a bug below that accessing nested fields with different cases in case insensitive mode will result
AnalysisException
.which is currently failing
as @cloud-fan pointed out.
How was this patch tested?
New tests are added.