Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-48503][SQL] Allow grouping on expressions in scalar subqueries, if they are bound to outer rows #47388

Closed
wants to merge 4 commits into from

Conversation

agubichev
Copy link
Contributor

@agubichev agubichev commented Jul 17, 2024

What changes were proposed in this pull request?

Extends previous work in #46839, allowing the grouping expressions to be bound to outer references.

Most common example is
select *, (select count(*) from T_inner where cast(T_inner.x as date) = T_outer.date group by cast(T_inner.x as date))

Here, we group by cast(T_inner.x as date) which is bound to an outer row. This guarantees that for every outer row, there is exactly one value of cast(T_inner.x as date), so it is safe to group on it.
Previously, we required that only columns can be bound to outer expressions, thus forbidding such subqueries.

Why are the changes needed?

Extends supported subqueries

Does this PR introduce any user-facing change?

Yes, previously failing queries are now passing

How was this patch tested?

Query tests

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the SQL label Jul 17, 2024
@agubichev
Copy link
Contributor Author

@andylam-db

@HyukjinKwon HyukjinKwon changed the title [SPARK-48503] Allow grouping on expressions in scalar subqueries, if they are bound to outer rows [SPARK-48503][SQL] Allow grouping on expressions in scalar subqueries, if they are bound to outer rows Jul 18, 2024
// Collect the inner query expressions that are guaranteed to have a single value for each
// outer row. See comment on getCorrelatedEquivalentInnerExpressions.
val correlatedEquivalentExprs = getCorrelatedEquivalentInnerExpressions(query)
// Unlike 'groupByCols', preserve the entire grouping expression.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move groupByCols up, or this declaration + comment down? Forward comment reference here is confusing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice! just rephrased the comment

Comment on lines 18 to 26
-- Group-by expression is the same as the one we filter on - legal
select *, (select count(*) from y where x1 = y1 and cast(y2 as double) = x1 + 1
group by cast(y2 as double)) from x;


-- Illegal queries
select * from x where (select count(*) from y where y1 > x1 group by y1) = 1;
select *, (select count(*) from y where y1 + y2 = x1 group by y1) from x;
select *, (select count(*) from y where x1 = y1 and y2 + 10 = x1 + 1 group by y2) from x;
Copy link
Contributor

@andylam-db andylam-db Jul 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I'm not really sure how these new test cases reflect the changes made in the code. Could you elaborate how Group-by expression is the same as the one we filter on is legal and why the other one is illegal?
  • edit: the idea is grouping expressions need to be present as a whole in predicates.
  1. Could you also more test cases for the positive and negative cases you commented in getEquivalentToOuter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1-discussed offline

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2- added a test on expr = outer1 + outer2

@agubichev
Copy link
Contributor Author

@cloud-fan

@cloud-fan
Copy link
Contributor

To extend it a bit more, shall we allow where func(T_inner.x) = T_outer.date group by T_inner.x if the func guarantees to produce different results for different values of T_inner.x?

@agubichev
Copy link
Contributor Author

To extend it a bit more, shall we allow where func(T_inner.x) = T_outer.date group by T_inner.x if the func guarantees to produce different results for different values of T_inner.x?

We will look into this as follow-up steps, but ultimately this should be a runtime-level check (checking that the join returns at most 1 row), so that we can allow all kinds of subqueries and let runtime throw a runtime error for bad cases.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in f3b819e Jul 26, 2024
ilicmarkodb pushed a commit to ilicmarkodb/spark that referenced this pull request Jul 29, 2024
…, if they are bound to outer rows

### What changes were proposed in this pull request?

Extends previous work in apache#46839, allowing the grouping expressions to be bound to outer references.

Most common example is
`select *, (select count(*) from T_inner where cast(T_inner.x as date) = T_outer.date group by cast(T_inner.x as date))`

Here, we group by cast(T_inner.x as date) which is bound to an outer row. This guarantees that for every outer row, there is exactly one value of cast(T_inner.x as date), so it is safe to group on it.
Previously, we required that only columns can be bound to outer expressions, thus forbidding such subqueries.

### Why are the changes needed?

Extends supported subqueries

### Does this PR introduce _any_ user-facing change?

Yes, previously failing queries are now passing

### How was this patch tested?

Query tests

### Was this patch authored or co-authored using generative AI tooling?

No

Closes apache#47388 from agubichev/group_by_cols.

Authored-by: Andrey Gubichev <andrey.gubichev@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
fusheng-rd pushed a commit to fusheng-rd/spark that referenced this pull request Aug 6, 2024
…, if they are bound to outer rows

### What changes were proposed in this pull request?

Extends previous work in apache#46839, allowing the grouping expressions to be bound to outer references.

Most common example is
`select *, (select count(*) from T_inner where cast(T_inner.x as date) = T_outer.date group by cast(T_inner.x as date))`

Here, we group by cast(T_inner.x as date) which is bound to an outer row. This guarantees that for every outer row, there is exactly one value of cast(T_inner.x as date), so it is safe to group on it.
Previously, we required that only columns can be bound to outer expressions, thus forbidding such subqueries.

### Why are the changes needed?

Extends supported subqueries

### Does this PR introduce _any_ user-facing change?

Yes, previously failing queries are now passing

### How was this patch tested?

Query tests

### Was this patch authored or co-authored using generative AI tooling?

No

Closes apache#47388 from agubichev/group_by_cols.

Authored-by: Andrey Gubichev <andrey.gubichev@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
attilapiros pushed a commit to attilapiros/spark that referenced this pull request Oct 4, 2024
…, if they are bound to outer rows

### What changes were proposed in this pull request?

Extends previous work in apache#46839, allowing the grouping expressions to be bound to outer references.

Most common example is
`select *, (select count(*) from T_inner where cast(T_inner.x as date) = T_outer.date group by cast(T_inner.x as date))`

Here, we group by cast(T_inner.x as date) which is bound to an outer row. This guarantees that for every outer row, there is exactly one value of cast(T_inner.x as date), so it is safe to group on it.
Previously, we required that only columns can be bound to outer expressions, thus forbidding such subqueries.

### Why are the changes needed?

Extends supported subqueries

### Does this PR introduce _any_ user-facing change?

Yes, previously failing queries are now passing

### How was this patch tested?

Query tests

### Was this patch authored or co-authored using generative AI tooling?

No

Closes apache#47388 from agubichev/group_by_cols.

Authored-by: Andrey Gubichev <andrey.gubichev@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants