Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-26402][SQL] Accessing nested fields with different cases in case insensitive mode #23353

Closed
wants to merge 13 commits into from
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ package org.apache.spark.sql.catalyst.expressions
*
* The following rules are applied:
* - Names and nullability hints for [[org.apache.spark.sql.types.DataType]]s are stripped.
* - Names for [[org.apache.spark.sql.catalyst.expressions.GetStructField]] are stripped.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[[org.apache.spark.sql.catalyst.expressions.GetStructField]] -> [[GetStructField]]? GetStructField is in the same package.

* - Commutative and associative operations ([[Add]] and [[Multiply]]) have their children ordered
* by `hashCode`.
* - [[EqualTo]] and [[EqualNullSafe]] are reordered by `hashCode`.
Expand All @@ -37,10 +38,11 @@ object Canonicalize {
expressionReorder(ignoreNamesTypes(e))
}

/** Remove names and nullability from types. */
/** Remove names and nullability from types, and names from `GetStructField`. */
private[expressions] def ignoreNamesTypes(e: Expression): Expression = e match {
case a: AttributeReference =>
AttributeReference("none", a.dataType.asNullable)(exprId = a.exprId)
case GetStructField(child, ordinal, Some(_)) => GetStructField(child, ordinal, None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks not precisely matched the comments of ignoreNamesTypes. It's better to change it accordingly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/** Remove names and nullability from types. */

I can change it to and / or.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment of Canonicalize says:

The following rules are applied:
 *  - Names and nullability hints for [[org.apache.spark.sql.types.DataType]]s are stripped.
...

It's also needed to be update.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/** Remove names and nullability from types. */

Actually after this change it is not only for types.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I re-wrote it a bit. Should look okay now.

case _ => e
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -18,10 +18,19 @@
package org.apache.spark.sql.catalyst.expressions

import org.apache.spark.SparkFunSuite
import org.apache.spark.sql.catalyst.dsl.expressions._
import org.apache.spark.sql.catalyst.dsl.plans._
import org.apache.spark.sql.catalyst.plans.logical.Range
import org.apache.spark.sql.catalyst.optimizer._
import org.apache.spark.sql.catalyst.plans.PlanTest
import org.apache.spark.sql.catalyst.plans.logical.{LocalRelation, LogicalPlan, Range}
import org.apache.spark.sql.catalyst.rules.RuleExecutor
import org.apache.spark.sql.types.{IntegerType, StructField, StructType}

class CanonicalizeSuite extends SparkFunSuite {
class CanonicalizeSuite extends SparkFunSuite with ExpressionEvalHelper with PlanTest {

object Optimize extends RuleExecutor[LogicalPlan] {
val batches = Batch("SimplifyBinaryComparison", Once, SimplifyBinaryComparison) :: Nil
}

test("SPARK-24276: IN expression with different order are semantically equal") {
val range = Range(1, 1, 1, 1)
Expand Down Expand Up @@ -50,4 +59,43 @@ class CanonicalizeSuite extends SparkFunSuite {
assert(range.where(arrays1).sameResult(range.where(arrays2)))
assert(!range.where(arrays1).sameResult(range.where(arrays3)))
}

test("SPARK-26402: GetStructField with different names are semantically equal") {
val expId = NamedExpression.newExprId
val qualifier = Seq.empty[String]
val structType = StructType(
StructField("a", StructType(StructField("b", IntegerType, false) :: Nil), false) :: Nil)

val fieldB1 = GetStructField(
AttributeReference("data1", structType, false)(expId, qualifier),
0, Some("b1"))
Copy link
Member

@dongjoon-hyun dongjoon-hyun Dec 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for nit-picking. This should be a1 (and a2 at line 67) because this is the first level.
Consequently, fieldB1 -> fieldA1?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Done.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    val fieldA1 = GetStructField(
      AttributeReference("data1", structType, false)(expId, qualifier),
      0, Some("a1"))
    val fieldA2 = GetStructField(
      AttributeReference("data2", structType, false)(expId, qualifier),
      0, Some("a2"))
    assert(fieldA1.semanticEquals(fieldA2))

    val fieldB1 = GetStructField(
      GetStructField(
        AttributeReference("data1", structType, false)(expId, qualifier),
        0, Some("a1")),
      0, Some("b1"))
    val fieldB2 = GetStructField(
      GetStructField(
        AttributeReference("data2", structType, false)(expId, qualifier),
        0, Some("a2")),
      0, Some("b2"))
    assert(fieldB1.semanticEquals(fieldB2))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun I put the ordering wrong. Addressed as you suggested. Thanks!

val fieldB2 = GetStructField(
AttributeReference("data2", structType, false)(expId, qualifier),
0, Some("b2"))
assert(fieldB1.semanticEquals(fieldB2))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line will fail to build.


val fieldA1 = GetStructField(
GetStructField(
AttributeReference("data1", structType, false)(expId, qualifier),
0, Some("a1")),
0, Some("b1"))
val fieldA2 = GetStructField(
GetStructField(
AttributeReference("data2", structType, false)(expId, qualifier),
0, Some("a2")),
0, Some("b2"))
assert(fieldA1.semanticEquals(fieldA2))

val testRelation = LocalRelation('a.int)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a real end-to-end test...

How about add the following test to SQLQuerySuite?

sql("create table t (s struct<i: Int>) using json")
sql("select s.I from t group by s.i")

currently it fials with

org.apache.spark.sql.AnalysisException: expression 'default.t.`s`' is neither present in the group by, nor is it an aggregate function

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one makes sense, and is addressed by this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then can we remove this part? i.e. code between L89 to L99

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member

@dongjoon-hyun dongjoon-hyun Dec 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test can be a part of BinaryComparisonSimplificationSuite for SimplifyBinaryComparison. So far, there is no test case for struct type in BinaryComparisonSimplificationSuite. Since this PR, SimplifyBinaryComparison can remove s.I <=> s.i.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious that is that removed too when case sensitive mode is turned on?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will fail at name resolution.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya I added a test to show in case insensitive mode, it will fail.


val originalQuery =
LocalRelation('a.int)
.where(EqualTo(fieldA1, fieldA2))
.analyze

val optimized = Optimize.execute(originalQuery)
val correctAnswer = testRelation.where(Literal.TrueLiteral).analyze

comparePlans(optimized, correctAnswer)
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -2937,6 +2937,13 @@ class SQLQuerySuite extends QueryTest with SharedSQLContext {
}
}
}

test("SPARK-26402: GetStructField with different names are semantically equal") {
withTable("t") {
sql("create table t (s struct<i: Int>) using json")
sql("select s.I from t group by s.i")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a good practice to always check the result, how about checkAnswer(sql("select s.I from t group by s.i"), Nil)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's better.

}
}
}

case class Foo(bar: Option[String])