-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-28344][SQL][FOLLOW-UP] Check the ambiguous self-join only if there is a join in the plan #28695
Conversation
I believe this fix is conservative enough to land |
sql/core/src/main/scala/org/apache/spark/sql/execution/analysis/DetectAmbiguousSelfJoin.scala
Outdated
Show resolved
Hide resolved
Test build #123377 has finished for PR 28695 at commit
|
Test build #123381 has finished for PR 28695 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Thank you, @HyukjinKwon and @cloud-fan .
Merged to master/3.0.
…here is a join in the plan ### What changes were proposed in this pull request? This PR proposes to check `DetectAmbiguousSelfJoin` only if there is `Join` in the plan. Currently, the checking is too strict even to non-join queries. For example, the codes below don't have join at all but it fails as the ambiguous self-join: ```scala import org.apache.spark.sql.expressions.Window import org.apache.spark.sql.functions.sum val df = Seq(1, 1, 2, 2).toDF("A") val w = Window.partitionBy(df("A")) df.select(df("A").alias("X"), sum(df("A")).over(w)).explain(true) ``` It is because `ExtractWindowExpressions` can create a `AttributeReference` with the same metadata but a different expression ID, see: https://github.com/apache/spark/blob/0fd98abd859049dc3b200492487041eeeaa8f737/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L2679 https://github.com/apache/spark/blob/71c73d58f6e88d2558ed2e696897767d93bac60f/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala#L63 https://github.com/apache/spark/blob/5945d46c11a86fd85f9e65f24c2e88f368eee01f/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala#L180 Before: ``` 'Project [A#19 AS X#21, sum(A#19) windowspecdefinition(A#19, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS sum(A) OVER (PARTITION BY A unspecifiedframe$())#23L] +- Relation[A#19] parquet ``` After: ``` Project [X#21, sum(A) OVER (PARTITION BY A unspecifiedframe$())#23L] +- Project [X#21, A#19, sum(A) OVER (PARTITION BY A unspecifiedframe$())#23L, sum(A) OVER (PARTITION BY A unspecifiedframe$())#23L] +- Window [sum(A#19) windowspecdefinition(A#19, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS sum(A) OVER (PARTITION BY A unspecifiedframe$())#23L], [A#19] +- Project [A#19 AS X#21, A#19] +- Relation[A#19] parquet ``` `X#21` holds the same metadata of DataFrame ID and column position with `A#19` but it has a different expression ID which ends up with the checking fails. ### Why are the changes needed? To loose the checking and make users not surprised. ### Does this PR introduce _any_ user-facing change? It's the changes in unreleased branches only. ### How was this patch tested? Manually tested and unittest was added. Closes #28695 from HyukjinKwon/SPARK-28344-followup. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit ea45fc5) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
Thanks @cloud-fan and @dongjoon-hyun. |
### What changes were proposed in this pull request? This is a followup of #28695 , to fix the problem completely. The root cause is that, `df("col").as("name")` is not a column reference anymore, and should not have the special column metadata. However, this was broken in ba7adc4#diff-ac415c903887e49486ba542a65eec980L1050-L1053 This PR fixes the regression, by strip the special column metadata in `Column.name`, which is the behavior before #28326 . ### Why are the changes needed? Fix a regression. We shouldn't fail if there is no ambiguous self-join. ### Does this PR introduce _any_ user-facing change? Yes, the query in the test can run now. ### How was this patch tested? updated test Closes #28783 from cloud-fan/self-join. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request? This is a followup of #28695 , to fix the problem completely. The root cause is that, `df("col").as("name")` is not a column reference anymore, and should not have the special column metadata. However, this was broken in ba7adc4#diff-ac415c903887e49486ba542a65eec980L1050-L1053 This PR fixes the regression, by strip the special column metadata in `Column.name`, which is the behavior before #28326 . ### Why are the changes needed? Fix a regression. We shouldn't fail if there is no ambiguous self-join. ### Does this PR introduce _any_ user-facing change? Yes, the query in the test can run now. ### How was this patch tested? updated test Closes #28783 from cloud-fan/self-join. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit c400519) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request? This is a followup of apache#28695 , to fix the problem completely. The root cause is that, `df("col").as("name")` is not a column reference anymore, and should not have the special column metadata. However, this was broken in apache@ba7adc4#diff-ac415c903887e49486ba542a65eec980L1050-L1053 This PR fixes the regression, by strip the special column metadata in `Column.name`, which is the behavior before apache#28326 . ### Why are the changes needed? Fix a regression. We shouldn't fail if there is no ambiguous self-join. ### Does this PR introduce _any_ user-facing change? Yes, the query in the test can run now. ### How was this patch tested? updated test Closes apache#28783 from cloud-fan/self-join. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit c400519) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…here is a join in the plan ### What changes were proposed in this pull request? This PR proposes to check `DetectAmbiguousSelfJoin` only if there is `Join` in the plan. Currently, the checking is too strict even to non-join queries. For example, the codes below don't have join at all but it fails as the ambiguous self-join: ```scala import org.apache.spark.sql.expressions.Window import org.apache.spark.sql.functions.sum val df = Seq(1, 1, 2, 2).toDF("A") val w = Window.partitionBy(df("A")) df.select(df("A").alias("X"), sum(df("A")).over(w)).explain(true) ``` It is because `ExtractWindowExpressions` can create a `AttributeReference` with the same metadata but a different expression ID, see: https://github.com/apache/spark/blob/0fd98abd859049dc3b200492487041eeeaa8f737/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L2679 https://github.com/apache/spark/blob/71c73d58f6e88d2558ed2e696897767d93bac60f/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala#L63 https://github.com/apache/spark/blob/5945d46c11a86fd85f9e65f24c2e88f368eee01f/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala#L180 Before: ``` 'Project [A#19 AS X#21, sum(A#19) windowspecdefinition(A#19, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS sum(A) OVER (PARTITION BY A unspecifiedframe$())#23L] +- Relation[A#19] parquet ``` After: ``` Project [X#21, sum(A) OVER (PARTITION BY A unspecifiedframe$())#23L] +- Project [X#21, A#19, sum(A) OVER (PARTITION BY A unspecifiedframe$())#23L, sum(A) OVER (PARTITION BY A unspecifiedframe$())#23L] +- Window [sum(A#19) windowspecdefinition(A#19, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS sum(A) OVER (PARTITION BY A unspecifiedframe$())#23L], [A#19] +- Project [A#19 AS X#21, A#19] +- Relation[A#19] parquet ``` `X#21` holds the same metadata of DataFrame ID and column position with `A#19` but it has a different expression ID which ends up with the checking fails. ### Why are the changes needed? To loose the checking and make users not surprised. ### Does this PR introduce _any_ user-facing change? It's the changes in unreleased branches only. ### How was this patch tested? Manually tested and unittest was added. Closes apache#28695 from HyukjinKwon/SPARK-28344-followup. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
This is a followup of apache#28695 , to fix the problem completely. The root cause is that, `df("col").as("name")` is not a column reference anymore, and should not have the special column metadata. However, this was broken in apache@ba7adc4#diff-ac415c903887e49486ba542a65eec980L1050-L1053 This PR fixes the regression, by strip the special column metadata in `Column.name`, which is the behavior before apache#28326 . Fix a regression. We shouldn't fail if there is no ambiguous self-join. Yes, the query in the test can run now. updated test Closes apache#28783 from cloud-fan/self-join. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
What changes were proposed in this pull request?
This PR proposes to check
DetectAmbiguousSelfJoin
only if there isJoin
in the plan. Currently, the checking is too strict even to non-join queries.For example, the codes below don't have join at all but it fails as the ambiguous self-join:
It is because
ExtractWindowExpressions
can create aAttributeReference
with the same metadata but a different expression ID, see:spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
Line 2679 in 0fd98ab
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
Line 63 in 71c73d5
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala
Line 180 in 5945d46
Before:
After:
X#21
holds the same metadata of DataFrame ID and column position withA#19
but it has a different expression ID which ends up with the checking fails.Why are the changes needed?
To loose the checking and make users not surprised.
Does this PR introduce any user-facing change?
It's the changes in unreleased branches only.
How was this patch tested?
Manually tested and unittest was added.