Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] Support for aggregation expressions that use multiple AggExprs #3296

Merged
merged 6 commits into from
Nov 15, 2024

Conversation

kevinzwang
Copy link
Member

This enables expressions such as sum("a") + sum("b") or mean("a") / 100 in aggregations. This PR enables Q8 and Q14 of TPC-H and is also necessary for Q17 and Q20 (which are also missing subquery).

@github-actions github-actions bot added the enhancement New feature or request label Nov 14, 2024
Copy link

codspeed-hq bot commented Nov 14, 2024

CodSpeed Performance Report

Merging #3296 will degrade performances by 22.71%

Comparing kevin/multi-aggs (da49fd0) with main (25c3b26)

Summary

❌ 2 regressions
✅ 15 untouched benchmarks

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

Benchmark main kevin/multi-aggs Change
test_count[1 Small File] 3.6 ms 4.2 ms -14.78%
test_iter_rows_first_row[100 Small Files] 253.4 ms 327.9 ms -22.71%

Copy link

codecov bot commented Nov 14, 2024

Codecov Report

Attention: Patch coverage is 92.11823% with 32 lines in your changes missing coverage. Please review.

Project coverage is 77.57%. Comparing base (711e862) to head (da49fd0).
Report is 6 commits behind head on main.

Files with missing lines Patch % Lines
...ft-physical-plan/src/physical_planner/translate.rs 68.08% 15 Missing ⚠️
src/daft-dsl/src/resolve_expr/mod.rs 83.33% 11 Missing ⚠️
src/daft-logical-plan/src/ops/pivot.rs 80.95% 4 Missing ⚠️
src/daft-local-execution/src/pipeline.rs 87.50% 2 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3296      +/-   ##
==========================================
+ Coverage   77.50%   77.57%   +0.07%     
==========================================
  Files         666      667       +1     
  Lines       81335    81685     +350     
==========================================
+ Hits        63041    63371     +330     
- Misses      18294    18314      +20     
Files with missing lines Coverage Δ
src/daft-dsl/src/lib.rs 100.00% <ø> (ø)
src/daft-local-plan/src/plan.rs 96.38% <100.00%> (ø)
src/daft-logical-plan/src/logical_plan.rs 62.72% <100.00%> (ø)
...rc/daft-logical-plan/src/ops/actor_pool_project.rs 41.89% <100.00%> (+1.61%) ⬆️
src/daft-logical-plan/src/ops/agg.rs 58.13% <100.00%> (+5.50%) ⬆️
src/daft-logical-plan/src/ops/explode.rs 80.85% <100.00%> (+1.30%) ⬆️
src/daft-logical-plan/src/ops/filter.rs 64.28% <100.00%> (+5.95%) ⬆️
src/daft-logical-plan/src/ops/join.rs 90.23% <100.00%> (+0.18%) ⬆️
src/daft-logical-plan/src/ops/project.rs 64.54% <100.00%> (+0.16%) ⬆️
src/daft-logical-plan/src/ops/repartition.rs 66.66% <100.00%> (+1.28%) ⬆️
... and 12 more

... and 6 files with indirect coverage changes

src/daft-logical-plan/src/builder.rs Outdated Show resolved Hide resolved
src/daft-logical-plan/src/builder.rs Outdated Show resolved Hide resolved
src/daft-dsl/src/resolve_expr/mod.rs Outdated Show resolved Hide resolved
src/daft-physical-plan/src/physical_planner/translate.rs Outdated Show resolved Hide resolved
Comment on lines 14 to 16
/// Optimization rule for lifting expressions that can be done in a project out of an aggregation.
///
/// After a pass of this rule, the top level expressions in each aggregate should all be aliases or agg exprs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it'd be helpful to add a bit more to the docstring here.
Can you add an example of before+after the optimizer pass.

Copy link
Contributor

@universalmind303 universalmind303 Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kevinzwang, so i know this is just semantics, but this seems like more of a rewrite instead of an optimization. I wonder if it's worth making the distinction between optimizing and rewriting.

to elaborate on how i see them as different, an optimization implies that it is rewriting the query to make it more performant. On the other hand, this rule seems like it's only rewriting the query in a way that makes it executable. (I could be wrong here though).

Copy link
Member Author

@kevinzwang kevinzwang Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's true. I mean technically I have this rule because moving things out into a project allows for other optimizations, however the implementation of the aggregate translation assumes that this rule is applied.

In general, should we have rewriting rules in the optimizer instead of in each of the ops themselves? AKA is it okay for the logical plan to be in an invalid state until it has been optimized? I personally don't have very strong thoughts about this

Copy link
Contributor

@universalmind303 universalmind303 Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 I don't like the idea of having invalid plans, as that wouldnt allow us to disable certain optimizations.

I know there's been some discussion with @samster25 about potentially introducing a 'unresolved' or 'dsl' plan that wouldn't be guaranteed to be semantically valid, only syntactically valid, but that's definitely out of scope for this PR.

Maybe just add a comment here saying that this is actually a 'rewrite' to make it into a valid plan, just so we don't lose track of it over time.

Copy link
Contributor

@universalmind303 universalmind303 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the ExprResolver is a much better abstraction. Good work!

@kevinzwang kevinzwang enabled auto-merge (squash) November 14, 2024 21:53
@kevinzwang kevinzwang disabled auto-merge November 14, 2024 21:53
@kevinzwang kevinzwang enabled auto-merge (squash) November 14, 2024 21:55
@kevinzwang kevinzwang disabled auto-merge November 14, 2024 21:59
@kevinzwang kevinzwang enabled auto-merge (squash) November 14, 2024 23:47
@kevinzwang kevinzwang merged commit e18b719 into main Nov 15, 2024
41 of 42 checks passed
@kevinzwang kevinzwang deleted the kevin/multi-aggs branch November 15, 2024 00:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants