[SPARK-28344][SQL][FOLLOW-UP] Check the ambiguous self-join only if there is a join in the plan #28695

HyukjinKwon · 2020-06-01T14:09:30Z

What changes were proposed in this pull request?

This PR proposes to check DetectAmbiguousSelfJoin only if there is Join in the plan. Currently, the checking is too strict even to non-join queries.

For example, the codes below don't have join at all but it fails as the ambiguous self-join:

import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.sum
val df = Seq(1, 1, 2, 2).toDF("A")
val w = Window.partitionBy(df("A"))
df.select(df("A").alias("X"), sum(df("A")).over(w)).explain(true)

It is because ExtractWindowExpressions can create a AttributeReference with the same metadata but a different expression ID, see:

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

Line 2679 in 0fd98ab

Project(windowOps.output ++ newExpressionsWithWindowFunctions, windowOps)

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala

Line 63 in 71c73d5

override def output: Seq[Attribute] = projectList.map(_.toAttribute)

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala

Line 180 in 5945d46

    
           AttributeReference(name, child.dataType, child.nullable, metadata)(exprId, qualifier)

Before:

'Project [A#19 AS X#21, sum(A#19) windowspecdefinition(A#19, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS sum(A) OVER (PARTITION BY A unspecifiedframe$())#23L]
+- Relation[A#19] parquet

After:

Project [X#21, sum(A) OVER (PARTITION BY A unspecifiedframe$())#23L]
+- Project [X#21, A#19, sum(A) OVER (PARTITION BY A unspecifiedframe$())#23L, sum(A) OVER (PARTITION BY A unspecifiedframe$())#23L]
   +- Window [sum(A#19) windowspecdefinition(A#19, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS sum(A) OVER (PARTITION BY A unspecifiedframe$())#23L], [A#19]
      +- Project [A#19 AS X#21, A#19]
         +- Relation[A#19] parquet

X#21 holds the same metadata of DataFrame ID and column position with A#19 but it has a different expression ID which ends up with the checking fails.

Why are the changes needed?

To loose the checking and make users not surprised.

Does this PR introduce any user-facing change?

It's the changes in unreleased branches only.

How was this patch tested?

Manually tested and unittest was added.

HyukjinKwon · 2020-06-01T14:09:55Z

I believe this fix is conservative enough to land branch-3.0 ...

sql/core/src/main/scala/org/apache/spark/sql/execution/analysis/DetectAmbiguousSelfJoin.scala

SparkQA · 2020-06-01T18:32:48Z

Test build #123377 has finished for PR 28695 at commit 00bbe9d.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-06-01T21:08:49Z

Test build #123381 has finished for PR 28695 at commit 21e4241.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun

+1, LGTM. Thank you, @HyukjinKwon and @cloud-fan .
Merged to master/3.0.

…here is a join in the plan ### What changes were proposed in this pull request? This PR proposes to check `DetectAmbiguousSelfJoin` only if there is `Join` in the plan. Currently, the checking is too strict even to non-join queries. For example, the codes below don't have join at all but it fails as the ambiguous self-join: ```scala import org.apache.spark.sql.expressions.Window import org.apache.spark.sql.functions.sum val df = Seq(1, 1, 2, 2).toDF("A") val w = Window.partitionBy(df("A")) df.select(df("A").alias("X"), sum(df("A")).over(w)).explain(true) ``` It is because `ExtractWindowExpressions` can create a `AttributeReference` with the same metadata but a different expression ID, see: https://github.com/apache/spark/blob/0fd98abd859049dc3b200492487041eeeaa8f737/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L2679 https://github.com/apache/spark/blob/71c73d58f6e88d2558ed2e696897767d93bac60f/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala#L63 https://github.com/apache/spark/blob/5945d46c11a86fd85f9e65f24c2e88f368eee01f/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala#L180 Before: ``` 'Project [A#19 AS X#21, sum(A#19) windowspecdefinition(A#19, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS sum(A) OVER (PARTITION BY A unspecifiedframe$())#23L] +- Relation[A#19] parquet ``` After: ``` Project [X#21, sum(A) OVER (PARTITION BY A unspecifiedframe$())#23L] +- Project [X#21, A#19, sum(A) OVER (PARTITION BY A unspecifiedframe$())#23L, sum(A) OVER (PARTITION BY A unspecifiedframe$())#23L] +- Window [sum(A#19) windowspecdefinition(A#19, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS sum(A) OVER (PARTITION BY A unspecifiedframe$())#23L], [A#19] +- Project [A#19 AS X#21, A#19] +- Relation[A#19] parquet ``` `X#21` holds the same metadata of DataFrame ID and column position with `A#19` but it has a different expression ID which ends up with the checking fails. ### Why are the changes needed? To loose the checking and make users not surprised. ### Does this PR introduce _any_ user-facing change? It's the changes in unreleased branches only. ### How was this patch tested? Manually tested and unittest was added. Closes #28695 from HyukjinKwon/SPARK-28344-followup. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit ea45fc5) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

HyukjinKwon · 2020-06-02T00:31:00Z

Thanks @cloud-fan and @dongjoon-hyun.

### What changes were proposed in this pull request? This is a followup of #28695 , to fix the problem completely. The root cause is that, `df("col").as("name")` is not a column reference anymore, and should not have the special column metadata. However, this was broken in ba7adc4#diff-ac415c903887e49486ba542a65eec980L1050-L1053 This PR fixes the regression, by strip the special column metadata in `Column.name`, which is the behavior before #28326 . ### Why are the changes needed? Fix a regression. We shouldn't fail if there is no ambiguous self-join. ### Does this PR introduce _any_ user-facing change? Yes, the query in the test can run now. ### How was this patch tested? updated test Closes #28783 from cloud-fan/self-join. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

### What changes were proposed in this pull request? This is a followup of #28695 , to fix the problem completely. The root cause is that, `df("col").as("name")` is not a column reference anymore, and should not have the special column metadata. However, this was broken in ba7adc4#diff-ac415c903887e49486ba542a65eec980L1050-L1053 This PR fixes the regression, by strip the special column metadata in `Column.name`, which is the behavior before #28326 . ### Why are the changes needed? Fix a regression. We shouldn't fail if there is no ambiguous self-join. ### Does this PR introduce _any_ user-facing change? Yes, the query in the test can run now. ### How was this patch tested? updated test Closes #28783 from cloud-fan/self-join. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit c400519) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

### What changes were proposed in this pull request? This is a followup of apache#28695 , to fix the problem completely. The root cause is that, `df("col").as("name")` is not a column reference anymore, and should not have the special column metadata. However, this was broken in apache@ba7adc4#diff-ac415c903887e49486ba542a65eec980L1050-L1053 This PR fixes the regression, by strip the special column metadata in `Column.name`, which is the behavior before apache#28326 . ### Why are the changes needed? Fix a regression. We shouldn't fail if there is no ambiguous self-join. ### Does this PR introduce _any_ user-facing change? Yes, the query in the test can run now. ### How was this patch tested? updated test Closes apache#28783 from cloud-fan/self-join. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit c400519) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…here is a join in the plan ### What changes were proposed in this pull request? This PR proposes to check `DetectAmbiguousSelfJoin` only if there is `Join` in the plan. Currently, the checking is too strict even to non-join queries. For example, the codes below don't have join at all but it fails as the ambiguous self-join: ```scala import org.apache.spark.sql.expressions.Window import org.apache.spark.sql.functions.sum val df = Seq(1, 1, 2, 2).toDF("A") val w = Window.partitionBy(df("A")) df.select(df("A").alias("X"), sum(df("A")).over(w)).explain(true) ``` It is because `ExtractWindowExpressions` can create a `AttributeReference` with the same metadata but a different expression ID, see: https://github.com/apache/spark/blob/0fd98abd859049dc3b200492487041eeeaa8f737/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L2679 https://github.com/apache/spark/blob/71c73d58f6e88d2558ed2e696897767d93bac60f/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala#L63 https://github.com/apache/spark/blob/5945d46c11a86fd85f9e65f24c2e88f368eee01f/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala#L180 Before: ``` 'Project [A#19 AS X#21, sum(A#19) windowspecdefinition(A#19, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS sum(A) OVER (PARTITION BY A unspecifiedframe$())#23L] +- Relation[A#19] parquet ``` After: ``` Project [X#21, sum(A) OVER (PARTITION BY A unspecifiedframe$())#23L] +- Project [X#21, A#19, sum(A) OVER (PARTITION BY A unspecifiedframe$())#23L, sum(A) OVER (PARTITION BY A unspecifiedframe$())#23L] +- Window [sum(A#19) windowspecdefinition(A#19, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS sum(A) OVER (PARTITION BY A unspecifiedframe$())#23L], [A#19] +- Project [A#19 AS X#21, A#19] +- Relation[A#19] parquet ``` `X#21` holds the same metadata of DataFrame ID and column position with `A#19` but it has a different expression ID which ends up with the checking fails. ### Why are the changes needed? To loose the checking and make users not surprised. ### Does this PR introduce _any_ user-facing change? It's the changes in unreleased branches only. ### How was this patch tested? Manually tested and unittest was added. Closes apache#28695 from HyukjinKwon/SPARK-28344-followup. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

This is a followup of apache#28695 , to fix the problem completely. The root cause is that, `df("col").as("name")` is not a column reference anymore, and should not have the special column metadata. However, this was broken in apache@ba7adc4#diff-ac415c903887e49486ba542a65eec980L1050-L1053 This PR fixes the regression, by strip the special column metadata in `Column.name`, which is the behavior before apache#28326 . Fix a regression. We shouldn't fail if there is no ambiguous self-join. Yes, the query in the test can run now. updated test Closes apache#28783 from cloud-fan/self-join. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

Check the ambiguous self-join only if there's join in the plan

00bbe9d

probot-autolabeler bot added the SQL label Jun 1, 2020

HyukjinKwon requested a review from cloud-fan June 1, 2020 14:09

HyukjinKwon changed the title ~~[SPARK-28344][SQL] Check the ambiguous self-join only if there is a join in the plan~~ [SPARK-28344][SQL][FOLLOW-UP] Check the ambiguous self-join only if there is a join in the plan Jun 1, 2020

cloud-fan reviewed Jun 1, 2020

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/analysis/DetectAmbiguousSelfJoin.scala Outdated Show resolved Hide resolved

Address comments

21e4241

dongjoon-hyun approved these changes Jun 1, 2020

View reviewed changes

dongjoon-hyun closed this in ea45fc5 Jun 1, 2020

cloud-fan mentioned this pull request Jun 10, 2020

[SPARK-31956][SQL] Do not fail if there is no ambiguous self join #28783

Closed

HyukjinKwon deleted the SPARK-28344-followup branch July 27, 2020 07:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-28344][SQL][FOLLOW-UP] Check the ambiguous self-join only if there is a join in the plan #28695

[SPARK-28344][SQL][FOLLOW-UP] Check the ambiguous self-join only if there is a join in the plan #28695

HyukjinKwon commented Jun 1, 2020

HyukjinKwon commented Jun 1, 2020

SparkQA commented Jun 1, 2020

SparkQA commented Jun 1, 2020

dongjoon-hyun left a comment •

edited

Loading

HyukjinKwon commented Jun 2, 2020

[SPARK-28344][SQL][FOLLOW-UP] Check the ambiguous self-join only if there is a join in the plan #28695

[SPARK-28344][SQL][FOLLOW-UP] Check the ambiguous self-join only if there is a join in the plan #28695

Conversation

HyukjinKwon commented Jun 1, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

HyukjinKwon commented Jun 1, 2020

SparkQA commented Jun 1, 2020

SparkQA commented Jun 1, 2020

dongjoon-hyun left a comment • edited Loading

Choose a reason for hiding this comment

HyukjinKwon commented Jun 2, 2020

dongjoon-hyun left a comment •

edited

Loading