[SPARK-26078][SQL] Dedup self-join attributes on IN subqueries #23057

mgaido91 · 2018-11-16T12:36:12Z

What changes were proposed in this pull request?

When there is a self-join as result of a IN subquery, the join condition may be invalid, resulting in trivially true predicates and return wrong results.

The PR deduplicates the subquery output in order to avoid the issue.

How was this patch tested?

added UT

mgaido91 · 2018-11-16T12:36:23Z

cc @cloud-fan @viirya

SparkQA · 2018-11-16T12:42:17Z

Test build #98913 has finished for PR 23057 at commit 2af656a.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-11-16T15:02:24Z

Thanks @mgaido91. I will review this tomorrow.

SparkQA · 2018-11-16T15:30:22Z

Test build #98914 has finished for PR 23057 at commit a71b1c6.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91 · 2018-11-16T15:38:46Z

retest this please

SparkQA · 2018-11-16T19:27:21Z

Test build #98920 has finished for PR 23057 at commit a71b1c6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-11-17T13:51:14Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala

+
+  private def dedupSubqueryOnSelfJoin(values: Seq[Expression], sub: LogicalPlan): LogicalPlan = {
+    val leftRefs = AttributeSet.fromAttributeSets(values.map(_.references))
+    val rightRefs = AttributeSet(sub.output)


This is just outputSet?

viirya · 2018-11-17T13:53:05Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala

@@ -119,7 +139,7 @@ object RewritePredicateSubquery extends Rule[LogicalPlan] with PredicateHelper {
          // (A.A1 = B.B1 OR ISNULL(A.A1 = B.B1)) AND (B.B2 = A.A2) AND B.B3 > 1
          val finalJoinCond = (nullAwareJoinConds ++ conditions).reduceLeft(And)
          // Deduplicate conflicting attributes if any.
-          dedupJoin(Join(outerPlan, sub, LeftAnti, Option(finalJoinCond)))
+          dedupJoin(Join(outerPlan, newSub, LeftAnti, Option(finalJoinCond)))
        case (p, predicate) =>
          val (newCond, inputPlan) = rewriteExistentialExpr(Seq(predicate), p)
          Project(p.output, Filter(newCond.get, inputPlan))


In rewriteExistentialExpr, there is a similar logic for InSubquery. Should we also do dedupSubqueryOnSelfJoin for it?

mmmh...rewriteExistentialExpr operates on the result of the foldLeft,so every InSubquery there was already transformed using dedupSubqueryOnSelfJoin, right? So I don't think it is needed.

this fails indeed. I'll investigate it, thanks.

thanks for your help here @viirya. I added the check also to rewriteExistentialExpr. I was missing the case when it is invoked not only on the result of foldLeft. Thanks.

viirya · 2018-11-17T13:53:52Z

sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala

+      val a = spark.createDataFrame(spark.sparkContext.parallelize(Seq(Row("a", 2), Row("b", 1))),
+        StructType(Seq(StructField("id", StringType), StructField("num", IntegerType))))
+      val b = spark.createDataFrame(spark.sparkContext.parallelize(Seq(Row("a", 2), Row("b", 1))),
+        StructType(Seq(StructField("id", StringType), StructField("num", IntegerType))))


Two schema is the same. We can define it just once?

SparkQA · 2018-11-17T18:23:27Z

Test build #98969 has finished for PR 23057 at commit 86106fa.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91 · 2018-11-19T15:29:33Z

retest this please

SparkQA · 2018-11-19T21:53:41Z

Test build #99008 has finished for PR 23057 at commit 3d010fd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91 · 2018-11-20T09:08:28Z

any more comments @cloud-fan @viirya ?

viirya · 2018-11-21T02:42:35Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala

+    Project(aliasedExpressions, plan)
+  }
+
+  private def dedupSubqueryOnSelfJoin(values: Seq[Expression], sub: LogicalPlan): LogicalPlan = {


Add a simple code comment for this method?

viirya · 2018-11-21T02:44:54Z

The change looks fine to me. cc @cloud-fan

mgaido91 · 2018-11-21T09:10:15Z

thanks @viirya , I added a comment

SparkQA · 2018-11-21T12:55:13Z

Test build #99111 has finished for PR 23057 at commit 65fca4f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91 · 2018-11-26T09:16:49Z

any comments @cloud-fan ?

mgaido91 · 2018-11-28T09:31:57Z

cc @gatorsmile too

mccheah · 2018-12-01T00:21:30Z

Is this ready to merge?

mgaido91 · 2018-12-03T14:49:28Z

@mccheah this is waiting for reviews by committers

mgaido91 · 2018-12-10T11:05:58Z

@cloud-fan @gatorsmile may you please take a look at this? Thanks.

cloud-fan · 2018-12-11T03:40:22Z

sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala

+          StructType(Seq(StructField("id", StringType), StructField("num", IntegerType))))
+        df.createOrReplaceTempView(name)
+      }
+      genTestViewWithName("a")


nit:

Seq("a" -> 2, "b" -> 1).toDF("id", "num").createTempView("a") Seq("a" -> 2, "b" -> 1).toDF("id", "num").createTempView("b")

cloud-fan · 2018-12-11T03:45:05Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala

          val (joinCond, outerPlan) = rewriteExistentialExpr(inConditions ++ conditions, p)
          // Deduplicate conflicting attributes if any.
-          dedupJoin(Join(outerPlan, sub, LeftSemi, joinCond))
+          dedupJoin(Join(outerPlan, newSub, LeftSemi, joinCond))


do we still need to dedup here?

I think we don't, let me remove it.

cloud-fan · 2018-12-11T03:46:24Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala

@@ -92,18 +114,20 @@ object RewritePredicateSubquery extends Rule[LogicalPlan] with PredicateHelper {
          // Deduplicate conflicting attributes if any.
          dedupJoin(Join(outerPlan, sub, LeftAnti, joinCond))


Looks like we don't need dedupJoin, but always dedup the subquery before putting it in a join.

I think it makes sense to dedup the subquery only when the join condition has not been created yet (so in the case of InSubquery). In this case, the condition is already there, so I think we still have to use dedupJoin (for Exists)

dedupJoin will evetually dedup the subquery IIUC.

What I'd like to do is to unify dedupJoin and dedupSubqueryOnSelfJoin, so that the code will be consistent for all cases:

val newSub = dedup(sub, values) // create join condition if any Join(outerPlan, newSub, ...)

the main problem is that in the other cases, so when exists is there, the condition is already created. So we would need to complicate quite a lot the method in order to handle the 2 cases and I am not sure wether it is worth. For instance, the values, in the Exists case, should be taken from the conditions as those expressions referencing attributes from one side and the join condition needs to be rewritten. So I don't think that it is a good idea to have a common rewrite for both them: it would be overcomplicated IMHO.

ah thanks for the explanation! Makes sense to me.

SparkQA · 2018-12-11T12:00:20Z

Test build #99968 has finished for PR 23057 at commit 1beb40c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91 · 2018-12-11T12:01:45Z

retest this please

SparkQA · 2018-12-11T15:00:04Z

Test build #99970 has finished for PR 23057 at commit 1beb40c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-12-12T13:10:40Z

Test build #100018 has finished for PR 23057 at commit 0312558.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91 · 2018-12-12T13:13:15Z

retest this please

SparkQA · 2018-12-12T17:08:14Z

Test build #100023 has finished for PR 23057 at commit 0312558.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-12-13T08:21:38Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala

-          aliasMap.getOrElse(ref, ref)
-        }
-        val newRight = Project(aliasedExpressions, right)
+        val newRight = rewriteDedupPlan(right, aliasMap)
        val newJoinCond = joinCond.map { condExpr =>


not related to your PR, but is this corrected? The duplicated attributes in join condition may refer to left or right child, how can we blindly replace them with new attributes from right side?

yes, I actually think this is useless. Let me try and remove it.

cloud-fan · 2018-12-13T11:29:16Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala

+      condition: Option[Expression]): Join = {
+    // Deduplicate conflicting attributes if any.
+    val dedupSubplan = dedupSubqueryOnSelfJoin(outerPlan, subplan)
+    Join(outerPlan, dedupSubplan, joinType, condition)


shall we add an assert to make sure the condition doesn't contain conflicting attributes?

I am not sure about this: how do we check it? If the same attribute is present on both sides of a BinaryOperator? Is this always forbidden?

we need to refactor the code a little bit

... val duplicates = outerRefs.intersect(subplan.outputSet) condition.foreach { case a: Attribute if duplicates.contains(a) => fail case _ => } ...

I see what you mean now. I'll do that, thanks.

SparkQA · 2018-12-13T13:45:33Z

Test build #100083 has finished for PR 23057 at commit 6528582.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-12-13T21:42:32Z

Test build #100112 has started for PR 23057 at commit ec710d7.

mgaido91 · 2018-12-14T09:38:44Z

retest this please

SparkQA · 2018-12-14T13:20:24Z

Test build #100145 has finished for PR 23057 at commit ec710d7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-12-16T02:55:58Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala

+      condition.foreach { e =>
+          val conflictingAttrs = e.references.intersect(duplicates)
+          if (conflictingAttrs.nonEmpty) {
+            throw new AnalysisException("Found conflicting attributes " +


just for curiosity, when can this happen? or how we guarantee this will never happen?

this can happen in case the condition is built in advance (eg. the correlated condition of exists) and it contains some attribute which is not dedup. I am not sure if this scenario can actually happen or our dedup logic in the previous rules guarantees this will never happen, though.

cloud-fan · 2018-12-16T02:57:27Z

thanks, merging to master!

cloud-fan · 2019-01-03T16:51:34Z

Hi @mgaido91 , since we are going to have new releases for branch 2.3 and 2.4, do you know if this bug exists in 2.3/2.4 and shall we backport it? thanks!

mgaido91 · 2019-01-03T17:22:45Z

@cloud-fan yes, this affects 2.3/2.4 too. Let me know if you want me to open the PRs for backporting it there.

I am just wondering what to do for 2.2, since there is a discussion about its last release. If we want this there too, we should backport also SPARK-21835. What do you think? cc @viirya too

cloud-fan · 2019-01-03T17:33:36Z

also cc @dongjoon-hyun @HyukjinKwon

IIRC there were some refactorings about subquery rewrite, not sure how hard it is to backport to 2.2.

mgaido91 · 2019-01-03T18:04:20Z

backporting to 2.2 requires SPARK-21835, not sure if that is built on top of other changes...

dongjoon-hyun · 2019-01-03T18:36:25Z

Hi, @cloud-fan , @gatorsmile , @rxin . I know the correctness issue policy on release.
So, for the most correctness issues, I'm trying to review and cover them.

However, we have already one exceptional correctness issue SPARK-25206 which is decided not to be in branch-2.2 due to technical difficulty and risk. For me, Spark 2.2.3 is a little bit exceptional release since that is a farewell release and branch-2.2 is already EOL and too far from the active branch master.

So, for these risky issues (this one: SPARK-26078) and SPARK-25206, I'd like to put them out of the scope of the farewell release and recommend the users to use the latest one.

How do you think about that?

cloud-fan · 2019-01-04T02:42:11Z

SGTM, seems 2.3 and 2.4 are good enough to backport. @mgaido91 do you mind sending a new PR to backport? thanks!

dongjoon-hyun · 2019-01-04T02:50:47Z

Thank you, @cloud-fan !

…queries When there is a self-join as result of a IN subquery, the join condition may be invalid, resulting in trivially true predicates and return wrong results. The PR deduplicates the subquery output in order to avoid the issue. added UT Closes apache#23057 from mgaido91/SPARK-26078. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

## What changes were proposed in this pull request? When there is a self-join as result of a IN subquery, the join condition may be invalid, resulting in trivially true predicates and return wrong results. The PR deduplicates the subquery output in order to avoid the issue. ## How was this patch tested? added UT Closes apache#23057 from mgaido91/SPARK-26078. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

[SPARK-26078][SQL] Dedup self-join attributes on subqueries

2af656a

fix scalastyle

a71b1c6

viirya reviewed Nov 17, 2018

View reviewed changes

address comments

86106fa

dedup also rewriteExistentialExpr

3d010fd

viirya reviewed Nov 21, 2018

View reviewed changes

add comment

65fca4f

cloud-fan reviewed Dec 11, 2018

View reviewed changes

address comments

1beb40c

fix failures

0312558

cloud-fan reviewed Dec 13, 2018

View reviewed changes

address comment

6528582

cloud-fan reviewed Dec 13, 2018

View reviewed changes

mgaido91 added 2 commits December 13, 2018 21:16

address comment

6172f52

address comment

ec710d7

cloud-fan reviewed Dec 16, 2018

View reviewed changes

asfgit closed this in cd815ae Dec 16, 2018

		@@ -92,18 +114,20 @@ object RewritePredicateSubquery extends Rule[LogicalPlan] with PredicateHelper {
		// Deduplicate conflicting attributes if any.
		dedupJoin(Join(outerPlan, sub, LeftAnti, joinCond))

[SPARK-26078][SQL] Dedup self-join attributes on IN subqueries #23057

[SPARK-26078][SQL] Dedup self-join attributes on IN subqueries #23057

Conversation

mgaido91 commented Nov 16, 2018

What changes were proposed in this pull request?

How was this patch tested?

mgaido91 commented Nov 16, 2018

SparkQA commented Nov 16, 2018

viirya commented Nov 16, 2018

SparkQA commented Nov 16, 2018

mgaido91 commented Nov 16, 2018

SparkQA commented Nov 16, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Nov 17, 2018

mgaido91 commented Nov 19, 2018

SparkQA commented Nov 19, 2018

mgaido91 commented Nov 20, 2018

Choose a reason for hiding this comment

viirya commented Nov 21, 2018

mgaido91 commented Nov 21, 2018

SparkQA commented Nov 21, 2018

mgaido91 commented Nov 26, 2018

mgaido91 commented Nov 28, 2018

mccheah commented Dec 1, 2018

mgaido91 commented Dec 3, 2018

mgaido91 commented Dec 10, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Dec 11, 2018

mgaido91 commented Dec 11, 2018

SparkQA commented Dec 11, 2018

SparkQA commented Dec 12, 2018

mgaido91 commented Dec 12, 2018

SparkQA commented Dec 12, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Dec 13, 2018

SparkQA commented Dec 13, 2018

mgaido91 commented Dec 14, 2018

SparkQA commented Dec 14, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Dec 16, 2018

cloud-fan commented Jan 3, 2019

mgaido91 commented Jan 3, 2019

cloud-fan commented Jan 3, 2019

mgaido91 commented Jan 3, 2019

dongjoon-hyun commented Jan 3, 2019

cloud-fan commented Jan 4, 2019

dongjoon-hyun commented Jan 4, 2019