Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-41631][SQL] Support implicit lateral column alias resolution on Aggregate #39040

Closed
wants to merge 37 commits into from

Conversation

anchovYu
Copy link
Contributor

@anchovYu anchovYu commented Dec 13, 2022

What changes were proposed in this pull request?

This PR implements the implicit lateral column alias on Aggregate case. For example,

-- LCA in Aggregate. The avg_salary references an attribute defined by a previous alias
SELECT dept, average(salary) AS avg_salary, avg_salary + average(bonus)
FROM employee
GROUP BY dept

The high level implementation idea is to insert the Project node above, and falling back to the resolution of lateral alias of Project code path in the last PR.

  • Phase 1: recognize resolved lateral alias, wrap the attributes referencing them with LateralColumnAliasReference
  • Phase 2: when the Aggregate operator is resolved, it goes through the whole aggregation list, extracts the aggregation expressions and grouping expressions to keep them in this Aggregate node, and add a Project above with the original output. It doesn't do anything on LateralColumnAliasReference, but completely leave it to the Project in the future turns of this rule.

Example:

 // Before rewrite:
 Aggregate [dept#14] [dept#14 AS a#12, 'a + 1, avg(salary#16) AS b#13, 'b + avg(bonus#17)]
 +- Child [dept#14,name#15,salary#16,bonus#17]

 // After phase 1:
 Aggregate [dept#14] [dept#14 AS a#12, lca(a) + 1, avg(salary#16) AS b#13, lca(b) + avg(bonus#17)]
 +- Child [dept#14,name#15,salary#16,bonus#17]

 // After phase 2:
 Project [dept#14 AS a#12, lca(a) + 1, avg(salary)#26 AS b#13, lca(b) + avg(bonus)#27]
 +- Aggregate [dept#14] [avg(salary#16) AS avg(salary)#26, avg(bonus#17) AS avg(bonus)#27, dept#14]
     +- Child [dept#14,name#15,salary#16,bonus#17]

 // Now the problem falls back to the lateral alias resolution in Project.
 // After future rounds of this rule:
 Project [a#12, a#12 + 1, b#13, b#13 + avg(bonus)#27]
 +- Project [dept#14 AS a#12, avg(salary)#26 AS b#13]
    +- Aggregate [dept#14] [avg(salary#16) AS avg(salary)#26, avg(bonus#17) AS avg(bonus)#27, dept#14]
       +- Child [dept#14,name#15,salary#16,bonus#17]

Similar as the last PR (#38776), because lateral column alias has higher resolution priority than outer reference, it will try to resolve an OuterReference using lateral column alias, similar as an UnresolvedAttribute. If success, it strips OuterReference and also wraps it with LateralColumnAliasReference.

Why are the changes needed?

Similar as stated in #38776.

Does this PR introduce any user-facing change?

Yes, as shown in the above example, it will be able to resolve lateral column alias in Aggregate.

How was this patch tested?

Existing tests and newly added tests.

@anchovYu anchovYu changed the title [WIP][SPARK-27561][SQL] Support implicit lateral column alias resolution on Aggregate [SPARK-27561][SQL] Support implicit lateral column alias resolution on Aggregate Dec 13, 2022
@anchovYu
Copy link
Contributor Author

@gengliangwang sure, will do today, fyi i was sick yesterday.

@anchovYu
Copy link
Contributor Author

I will shortly address the left todos in this PR.

beliefer pushed a commit to beliefer/spark that referenced this pull request Dec 18, 2022
…s into one file

### What changes were proposed in this pull request?

Move the two rules `WrapLateralColumnAliasReference` and `ResolveLateralColumnAliasReference` into one file `ResolveLateralColumnAlias.scala`, instead of one rule in `Analyzer.scala` and the other one in another file.
Also update the code comments.

### Why are the changes needed?

Found this issue during reviewing apache#39040
This refactor should make the code easier to read and review.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing UT

Closes apache#39054 from gengliangwang/refactorLCA.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
// perform an early check on current Aggregate before any lift-up / push-down to throw
// the same exception such as non-aggregate expressions not in group by, which becomes
// missing input after transformation
earlyCheckAggregate(agg)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems from checkAnalysis. Could you refactor and reduce duplicated code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, actually simplified from checkAnalysis. If this solution works and all tests pass, I can move this method to Aggregate or AnalysisHelper object.


test("Non-aggregating expressions not in group by still throws the same error") {
// query without lateral alias
assert(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use checkError?

@@ -182,6 +183,157 @@ object AnalysisContext {
}
}

object Analyzer extends Logging {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is even more change than reverting #39054 ...

@anchovYu anchovYu changed the title [SPARK-27561][SQL][FOLLOWUP] Support implicit lateral column alias resolution on Aggregate [SPARK-41631][SQL] Support implicit lateral column alias resolution on Aggregate Dec 20, 2022
@anchovYu anchovYu requested review from gengliangwang and cloud-fan and removed request for gengliangwang and cloud-fan December 20, 2022 18:26
@anchovYu
Copy link
Contributor Author

Is that a Github issue that I can re-request review from only one person? Every time I requested a second reviewer it automatically remove the request from the first reviewer.. But anyway cc @gengliangwang and @cloud-fan to review this PR.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in fd6d226 Dec 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants