[SPARK-48503][SQL] Fix invalid scalar subqueries with group-by on non-equivalent columns that were incorrectly allowed #46839

jchen5 · 2024-06-03T00:00:29Z

What changes were proposed in this pull request?

Fixes CheckAnalysis to reject invalid scalar subquery group-bys that were previously allowed and returned wrong results.

For example, this query is not legal and should give an error, but instead we incorrectly allowed it and it returns wrong results prior to this PR (full repro with table data in the jira):

select *, (select count(*) from y where y1 > x1 group by y1) from x;

It returns two rows, even though there's only one row of x. The correct result is an error, because there is more than one row returned by the scalar subquery.

Another problem case is if the correlation condition is an equality but it's under another operator like an OUTER JOIN or UNION. Various other expressions that are not equi-joins between the inner and outer fields hit this too, e.g. where y1 + y2 = x1 group by y1. See the comments in the code and the tests for more examples.

This PR fixes the logic which checks for valid vs invalid group-bys. However, note that this new logic may block some queries that are actually valid, for example a + 1 = outer(b) is valid but would be rejected. Therefore, we add a conf flag that can be used to restore the legacy behavior, as well as logging for when the legacy behavior is used and differs from the new behavior. (In general, to accurately run valid queries and reject invalid queries, the check must be moved from compile-time to run-time - see https://issues.apache.org/jira/browse/SPARK-48501.)

This is a longstanding bug. The bug is in CheckAnalysis in checkAggregateInScalarSubquery. It allows grouping columns that are present in correlation predicates, but doesn’t check whether those predicates are equalities - because when that code was written, non-equality correlation wasn’t allowed. Therefore, this bug has existed since non-equality correlation was added (~2 years ago).

Why are the changes needed?

Fix invalid queries returning wrong results

Does this PR introduce any user-facing change?

Yes, block subqueries with invalid group-bys.

How was this patch tested?

Add tests

Was this patch authored or co-authored using generative AI tooling?

No

jchen5 · 2024-06-03T00:05:41Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala

+    plan match {
+      case Filter(cond, child) =>
+        val correlated = AttributeSet(splitConjunctivePredicates(cond)
+          .filter(containsOuter) // TODO: can remove this line to allow e.g. where x = 1 group by x


I intend to enable that in a separate PR, to reduce risk here.

jchen5 · 2024-06-03T00:08:07Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala

+        _: SubqueryAlias =>
+        AttributeSet(plan.children.flatMap(child => getCorrelatedEquivalentInnerColumns(child)))
+
+      case _ => AttributeSet.empty


The list of operators handled here is by no means comprehensive and ensuring it covers enough is tricky. I used the list in LogicalPlanVisitor as a starting point, but in my testing I discovered that e.g. SubqueryAlias also needs to be handled to cover cases with FROM subqueries inside the scalar subquery.

Suggestions on other important operators to handle or other potential approaches welcome.

(In the long run I think we need to replace this entire check with a runtime check as described in https://issues.apache.org/jira/browse/SPARK-48501, but that's highly nontrivial)

jchen5 · 2024-06-03T01:46:49Z

@agubichev @andylam-db @cloud-fan

cloud-fan · 2024-06-03T17:50:40Z

thanks, merging to master!

…-equivalent columns that were incorrectly allowed ### What changes were proposed in this pull request? Fixes CheckAnalysis to reject invalid scalar subquery group-bys that were previously allowed and returned wrong results. For example, this query is not legal and should give an error, but instead we incorrectly allowed it and it returns wrong results prior to this PR (full repro with table data in the jira): ``` select *, (select count(*) from y where y1 > x1 group by y1) from x; ``` It returns two rows, even though there's only one row of x. The correct result is an error, because there is more than one row returned by the scalar subquery. Another problem case is if the correlation condition is an equality but it's under another operator like an OUTER JOIN or UNION. Various other expressions that are not equi-joins between the inner and outer fields hit this too, e.g. `where y1 + y2 = x1 group by y1`. See the comments in the code and the tests for more examples. This PR fixes the logic which checks for valid vs invalid group-bys. However, note that this new logic may block some queries that are actually valid, for example `a + 1 = outer(b)` is valid but would be rejected. Therefore, we add a conf flag that can be used to restore the legacy behavior, as well as logging for when the legacy behavior is used and differs from the new behavior. (In general, to accurately run valid queries and reject invalid queries, the check must be moved from compile-time to run-time - see https://issues.apache.org/jira/browse/SPARK-48501.) This is a longstanding bug. The bug is in CheckAnalysis in checkAggregateInScalarSubquery. It allows grouping columns that are present in correlation predicates, but doesn’t check whether those predicates are equalities - because when that code was written, non-equality correlation wasn’t allowed. Therefore, this bug has existed since non-equality correlation was added (~2 years ago). ### Why are the changes needed? Fix invalid queries returning wrong results ### Does this PR introduce _any_ user-facing change? Yes, block subqueries with invalid group-bys. ### How was this patch tested? Add tests ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#46839 from jchen5/scalar-subq-gby. Authored-by: Jack Chen <jack.chen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…ual to constant ### What changes were proposed in this pull request? We can enable scalar subqueries that have `group by a` if there's a predicate `a = 1`, because these predicates guarantee the group-by produces at most one row. (This builds on top of #46839 and enables shapes there were unsupported prior to that PR as well.) ### Why are the changes needed? Support valid subquery shapes. ### Does this PR introduce _any_ user-facing change? Yes, support subquery shapes. ### How was this patch tested? Unit tests ### Was this patch authored or co-authored using generative AI tooling? No Closes #46902 from jchen5/subq-gby-eq. Authored-by: Jack Chen <jack.chen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…, if they are bound to outer rows ### What changes were proposed in this pull request? Extends previous work in #46839, allowing the grouping expressions to be bound to outer references. Most common example is `select *, (select count(*) from T_inner where cast(T_inner.x as date) = T_outer.date group by cast(T_inner.x as date))` Here, we group by cast(T_inner.x as date) which is bound to an outer row. This guarantees that for every outer row, there is exactly one value of cast(T_inner.x as date), so it is safe to group on it. Previously, we required that only columns can be bound to outer expressions, thus forbidding such subqueries. ### Why are the changes needed? Extends supported subqueries ### Does this PR introduce _any_ user-facing change? Yes, previously failing queries are now passing ### How was this patch tested? Query tests ### Was this patch authored or co-authored using generative AI tooling? No Closes #47388 from agubichev/group_by_cols. Authored-by: Andrey Gubichev <andrey.gubichev@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…, if they are bound to outer rows ### What changes were proposed in this pull request? Extends previous work in apache#46839, allowing the grouping expressions to be bound to outer references. Most common example is `select *, (select count(*) from T_inner where cast(T_inner.x as date) = T_outer.date group by cast(T_inner.x as date))` Here, we group by cast(T_inner.x as date) which is bound to an outer row. This guarantees that for every outer row, there is exactly one value of cast(T_inner.x as date), so it is safe to group on it. Previously, we required that only columns can be bound to outer expressions, thus forbidding such subqueries. ### Why are the changes needed? Extends supported subqueries ### Does this PR introduce _any_ user-facing change? Yes, previously failing queries are now passing ### How was this patch tested? Query tests ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47388 from agubichev/group_by_cols. Authored-by: Andrey Gubichev <andrey.gubichev@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

Fix scalar subquery group-by check analysis

f722554

github-actions bot added the SQL label Jun 3, 2024

todo comment

0c4f686

jchen5 commented Jun 3, 2024

View reviewed changes

jchen5 changed the title ~~[SPARK-48503][SQL] Fix invalid scalar subquery with group-by and non-equality that was incorrectly allowed~~ [SPARK-48503][SQL] Fix invalid scalar subqueries with group-by on non-equivalent columns that were incorrectly allowed Jun 3, 2024

agubichev approved these changes Jun 3, 2024

View reviewed changes

cloud-fan approved these changes Jun 3, 2024

View reviewed changes

cloud-fan closed this in 5d71ef0 Jun 3, 2024

jchen5 mentioned this pull request Jun 6, 2024

[SPARK-48557][SQL] Support scalar subquery with group-by on column equal to constant #46902

Closed

agubichev mentioned this pull request Jul 17, 2024

[SPARK-48503][SQL] Allow grouping on expressions in scalar subqueries, if they are bound to outer rows #47388

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-48503][SQL] Fix invalid scalar subqueries with group-by on non-equivalent columns that were incorrectly allowed #46839

[SPARK-48503][SQL] Fix invalid scalar subqueries with group-by on non-equivalent columns that were incorrectly allowed #46839

jchen5 commented Jun 3, 2024 •

edited

Loading

jchen5 Jun 3, 2024

jchen5 Jun 3, 2024 •

edited

Loading

jchen5 commented Jun 3, 2024

cloud-fan commented Jun 3, 2024

[SPARK-48503][SQL] Fix invalid scalar subqueries with group-by on non-equivalent columns that were incorrectly allowed #46839

[SPARK-48503][SQL] Fix invalid scalar subqueries with group-by on non-equivalent columns that were incorrectly allowed #46839

Conversation

jchen5 commented Jun 3, 2024 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

jchen5 Jun 3, 2024

Choose a reason for hiding this comment

jchen5 Jun 3, 2024 • edited Loading

Choose a reason for hiding this comment

jchen5 commented Jun 3, 2024

cloud-fan commented Jun 3, 2024

jchen5 commented Jun 3, 2024 •

edited

Loading

jchen5 Jun 3, 2024 •

edited

Loading