Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-48503][SQL] Fix invalid scalar subqueries with group-by on non-equivalent columns that were incorrectly allowed #46839

Closed
wants to merge 2 commits into from

Conversation

jchen5
Copy link
Contributor

@jchen5 jchen5 commented Jun 3, 2024

What changes were proposed in this pull request?

Fixes CheckAnalysis to reject invalid scalar subquery group-bys that were previously allowed and returned wrong results.

For example, this query is not legal and should give an error, but instead we incorrectly allowed it and it returns wrong results prior to this PR (full repro with table data in the jira):

select *, (select count(*) from y where y1 > x1 group by y1) from x;

It returns two rows, even though there's only one row of x. The correct result is an error, because there is more than one row returned by the scalar subquery.

Another problem case is if the correlation condition is an equality but it's under another operator like an OUTER JOIN or UNION. Various other expressions that are not equi-joins between the inner and outer fields hit this too, e.g. where y1 + y2 = x1 group by y1. See the comments in the code and the tests for more examples.

This PR fixes the logic which checks for valid vs invalid group-bys. However, note that this new logic may block some queries that are actually valid, for example a + 1 = outer(b) is valid but would be rejected. Therefore, we add a conf flag that can be used to restore the legacy behavior, as well as logging for when the legacy behavior is used and differs from the new behavior. (In general, to accurately run valid queries and reject invalid queries, the check must be moved from compile-time to run-time - see https://issues.apache.org/jira/browse/SPARK-48501.)

This is a longstanding bug. The bug is in CheckAnalysis in checkAggregateInScalarSubquery. It allows grouping columns that are present in correlation predicates, but doesn’t check whether those predicates are equalities -  because when that code was written, non-equality correlation wasn’t allowed. Therefore, this bug has existed since non-equality correlation was added (~2 years ago).

Why are the changes needed?

Fix invalid queries returning wrong results

Does this PR introduce any user-facing change?

Yes, block subqueries with invalid group-bys.

How was this patch tested?

Add tests

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the SQL label Jun 3, 2024
plan match {
case Filter(cond, child) =>
val correlated = AttributeSet(splitConjunctivePredicates(cond)
.filter(containsOuter) // TODO: can remove this line to allow e.g. where x = 1 group by x
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intend to enable that in a separate PR, to reduce risk here.

_: SubqueryAlias =>
AttributeSet(plan.children.flatMap(child => getCorrelatedEquivalentInnerColumns(child)))

case _ => AttributeSet.empty
Copy link
Contributor Author

@jchen5 jchen5 Jun 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The list of operators handled here is by no means comprehensive and ensuring it covers enough is tricky. I used the list in LogicalPlanVisitor as a starting point, but in my testing I discovered that e.g. SubqueryAlias also needs to be handled to cover cases with FROM subqueries inside the scalar subquery.

Suggestions on other important operators to handle or other potential approaches welcome.

(In the long run I think we need to replace this entire check with a runtime check as described in https://issues.apache.org/jira/browse/SPARK-48501, but that's highly nontrivial)

@jchen5 jchen5 changed the title [SPARK-48503][SQL] Fix invalid scalar subquery with group-by and non-equality that was incorrectly allowed [SPARK-48503][SQL] Fix invalid scalar subqueries with group-by on non-equivalent columns that were incorrectly allowed Jun 3, 2024
@jchen5
Copy link
Contributor Author

jchen5 commented Jun 3, 2024

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 5d71ef0 Jun 3, 2024
riyaverm-db pushed a commit to riyaverm-db/spark that referenced this pull request Jun 7, 2024
…-equivalent columns that were incorrectly allowed

### What changes were proposed in this pull request?

Fixes CheckAnalysis to reject invalid scalar subquery group-bys that were previously allowed and returned wrong results.

For example, this query is not legal and should give an error, but instead we incorrectly allowed it and it returns wrong results prior to this PR (full repro with table data in the jira):

```
select *, (select count(*) from y where y1 > x1 group by y1) from x;
```

It returns two rows, even though there's only one row of x. The correct result is an error, because there is more than one row returned by the scalar subquery.

Another problem case is if the correlation condition is an equality but it's under another operator like an OUTER JOIN or UNION. Various other expressions that are not equi-joins between the inner and outer fields hit this too, e.g. `where y1 + y2 = x1 group by y1`. See the comments in the code and the tests for more examples.

This PR fixes the logic which checks for valid vs invalid group-bys. However, note that this new logic may block some queries that are actually valid, for example `a + 1 = outer(b)` is valid but would be rejected. Therefore, we add a conf flag that can be used to restore the legacy behavior, as well as logging for when the legacy behavior is used and differs from the new behavior. (In general, to accurately run valid queries and reject invalid queries, the check must be moved from compile-time to run-time - see https://issues.apache.org/jira/browse/SPARK-48501.)

This is a longstanding bug. The bug is in CheckAnalysis in checkAggregateInScalarSubquery. It allows grouping columns that are present in correlation predicates, but doesn’t check whether those predicates are equalities -  because when that code was written, non-equality correlation wasn’t allowed. Therefore, this bug has existed since non-equality correlation was added (~2 years ago).

### Why are the changes needed?
Fix invalid queries returning wrong results

### Does this PR introduce _any_ user-facing change?
Yes, block subqueries with invalid group-bys.

### How was this patch tested?
Add tests

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#46839 from jchen5/scalar-subq-gby.

Authored-by: Jack Chen <jack.chen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
cloud-fan pushed a commit that referenced this pull request Jun 17, 2024
…ual to constant

### What changes were proposed in this pull request?
We can enable scalar subqueries that have `group by a` if there's a predicate `a = 1`, because these predicates guarantee the group-by produces at most one row. (This builds on top of #46839 and enables shapes there were unsupported prior to that PR as well.)

### Why are the changes needed?
Support valid subquery shapes.

### Does this PR introduce _any_ user-facing change?
Yes, support subquery shapes.

### How was this patch tested?
Unit tests

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #46902 from jchen5/subq-gby-eq.

Authored-by: Jack Chen <jack.chen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
cloud-fan pushed a commit that referenced this pull request Jul 26, 2024
…, if they are bound to outer rows

### What changes were proposed in this pull request?

Extends previous work in #46839, allowing the grouping expressions to be bound to outer references.

Most common example is
`select *, (select count(*) from T_inner where cast(T_inner.x as date) = T_outer.date group by cast(T_inner.x as date))`

Here, we group by cast(T_inner.x as date) which is bound to an outer row. This guarantees that for every outer row, there is exactly one value of cast(T_inner.x as date), so it is safe to group on it.
Previously, we required that only columns can be bound to outer expressions, thus forbidding such subqueries.

### Why are the changes needed?

Extends supported subqueries

### Does this PR introduce _any_ user-facing change?

Yes, previously failing queries are now passing

### How was this patch tested?

Query tests

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #47388 from agubichev/group_by_cols.

Authored-by: Andrey Gubichev <andrey.gubichev@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
ilicmarkodb pushed a commit to ilicmarkodb/spark that referenced this pull request Jul 29, 2024
…, if they are bound to outer rows

### What changes were proposed in this pull request?

Extends previous work in apache#46839, allowing the grouping expressions to be bound to outer references.

Most common example is
`select *, (select count(*) from T_inner where cast(T_inner.x as date) = T_outer.date group by cast(T_inner.x as date))`

Here, we group by cast(T_inner.x as date) which is bound to an outer row. This guarantees that for every outer row, there is exactly one value of cast(T_inner.x as date), so it is safe to group on it.
Previously, we required that only columns can be bound to outer expressions, thus forbidding such subqueries.

### Why are the changes needed?

Extends supported subqueries

### Does this PR introduce _any_ user-facing change?

Yes, previously failing queries are now passing

### How was this patch tested?

Query tests

### Was this patch authored or co-authored using generative AI tooling?

No

Closes apache#47388 from agubichev/group_by_cols.

Authored-by: Andrey Gubichev <andrey.gubichev@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
fusheng-rd pushed a commit to fusheng-rd/spark that referenced this pull request Aug 6, 2024
…, if they are bound to outer rows

### What changes were proposed in this pull request?

Extends previous work in apache#46839, allowing the grouping expressions to be bound to outer references.

Most common example is
`select *, (select count(*) from T_inner where cast(T_inner.x as date) = T_outer.date group by cast(T_inner.x as date))`

Here, we group by cast(T_inner.x as date) which is bound to an outer row. This guarantees that for every outer row, there is exactly one value of cast(T_inner.x as date), so it is safe to group on it.
Previously, we required that only columns can be bound to outer expressions, thus forbidding such subqueries.

### Why are the changes needed?

Extends supported subqueries

### Does this PR introduce _any_ user-facing change?

Yes, previously failing queries are now passing

### How was this patch tested?

Query tests

### Was this patch authored or co-authored using generative AI tooling?

No

Closes apache#47388 from agubichev/group_by_cols.

Authored-by: Andrey Gubichev <andrey.gubichev@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
attilapiros pushed a commit to attilapiros/spark that referenced this pull request Oct 4, 2024
…, if they are bound to outer rows

### What changes were proposed in this pull request?

Extends previous work in apache#46839, allowing the grouping expressions to be bound to outer references.

Most common example is
`select *, (select count(*) from T_inner where cast(T_inner.x as date) = T_outer.date group by cast(T_inner.x as date))`

Here, we group by cast(T_inner.x as date) which is bound to an outer row. This guarantees that for every outer row, there is exactly one value of cast(T_inner.x as date), so it is safe to group on it.
Previously, we required that only columns can be bound to outer expressions, thus forbidding such subqueries.

### Why are the changes needed?

Extends supported subqueries

### Does this PR introduce _any_ user-facing change?

Yes, previously failing queries are now passing

### How was this patch tested?

Query tests

### Was this patch authored or co-authored using generative AI tooling?

No

Closes apache#47388 from agubichev/group_by_cols.

Authored-by: Andrey Gubichev <andrey.gubichev@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants