Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert first, last aggregate function to UDAF #10648

Merged
merged 86 commits into from
May 27, 2024

Conversation

mustafasrepo
Copy link
Contributor

@mustafasrepo mustafasrepo commented May 24, 2024

Which issue does this PR close?

Closes #9957
Part of #8708

Rationale for this change

This work is a continuation of the work done by @jayzhan211 for UDAF support.
This PR converts FIRST_VALUE_AGG, LAST_VALUE_AGG Aggregate functions to the UDAF.

As part of this PR, I have also updated refactored OptimizeAggregateOrder rule such that it no longer depends on the
.as_any().downcast_ref for to work (e.g it is not overfitted to the FirstValue, LastValue aggregate expressions). File name is changed from convert_first_last.rs to update_aggr_exprs.rs also.

What changes are included in this PR?

Are these changes tested?

Yes

Are there any user-facing changes?

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
@github-actions github-actions bot added logical-expr Logical plan and expressions physical-expr Physical Expressions optimizer Optimizer rules core Core DataFusion crate labels May 24, 2024
Copy link
Contributor

@jayzhan211 jayzhan211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is really impressive!

.map(Some)
}

fn reverse_expr(&self) -> Option<Arc<dyn AggregateExpr>> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

datafusion/functions-aggregate/src/first_last.rs Outdated Show resolved Hide resolved
datafusion/functions-aggregate/src/first_last.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@ozankabak ozankabak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as well

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a look at this PR and I agree, great work @mustafasrepo .

I had some comments on the public API design but nothing that would block the PR merging, in my opinon.

@@ -360,6 +391,27 @@ pub trait AggregateUDFImpl: Debug + Send + Sync {
&[]
}

/// Sets the flag specifying whether the requirement of the UDF is satisfied.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found this name / documenation to be a little confusing - specifically it isn't clear from just this documentation (the public API of AggregageUDFImpl what type of requirement is being satisfied.

The documentation on AggregateExpr makes it clearer -- maybe we could find a way to bring some of that description to this location, or link to there for more details

Also I think we could improve the name of this function

  1. with_... is often used for builder style APIs in rust, which this is not
  2. It uses "requirement" which is general

Maybe it would be clearer it we called it has_beneficial_ordering or input_ordered_correctly 🤔

datafusion/physical-expr-common/src/utils.rs Outdated Show resolved Hide resolved
@mustafasrepo mustafasrepo merged commit 549cf84 into apache:main May 27, 2024
25 checks passed
))
})
.collect::<Vec<Expr>>();
.map(|e| first_value(vec![e], false, None, sort_expr.clone(), None));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just noticed that we import functions-aggregate in optimizer for first_value, should we move this rule into functions-aggregate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I know, All LogicalPlan rules are under optimizer/src/ folder. Also, I don't see any downside in having this rule here. Hence, I think it is better to keep rule in its current place. I wonder what others think though?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps what @jayzhan211 is noticing is that by directly relying on datafusion-aggregate-functions in the optimizer it means that now the optimizer is still treating "built in" aggregates specially (and for example, user defined functions couldn't take advantage of this)

So if we moved this rule (or somehow made the dependency via an interface) it would be more general and decouple the code and make datafusion more extensible.

One way is to move the entire rule into datafusion-aggregate-functions but maybe we could instead have replace_distinct_aggregate depend on FunctionRegistry rather than a direct call to first_value for example

Maybe we can file a TODO ticket to track it

Michael-J-Ward added a commit to Michael-J-Ward/datafusion-python that referenced this pull request Jun 13, 2024
andygrove pushed a commit to apache/datafusion-python that referenced this pull request Jun 14, 2024
* deps: update datafusion to 39.0.0, pyo3 to 0.21, and object_store to 0.10.1

`datafusion-common` also depends on `pyo3`, so they need to be upgraded together.

* feat: remove GetIndexField

datafusion replaced Expr::GetIndexField with a FieldAccessor trait.

Ref apache/datafusion#10568
Ref apache/datafusion#10769

* feat: update ScalarFunction

The field `func_name` was changed to `func` as part of removing `ScalarFunctionDefinition` upstream.

Ref apache/datafusion#10325

* feat: incorporate upstream array_slice fixes

Fixes #670

* update ExectionPlan::children impl for DatasetExec

Ref apache/datafusion#10543

* update value_interval_daytime

Ref apache/arrow-rs#5769

* update regexp_replace and regexp_match

Fixes #677

* add gil-refs feature to pyo3

This silences pyo3's deprecation warnings for its new Bounds api.

It's the 1st step of the migration, and should be removed before merge.

Ref https://pyo3.rs/v0.21.0/migration#from-020-to-021

* fix signature for octet_length

Ref apache/datafusion#10726

* update signature for covar_samp

AggregateUDF expressions now have a builder API design, which removes arguments like filter and order_by

Ref apache/datafusion#10545
Ref apache/datafusion#10492

* convert covar_pop to expr_fn api

Ref: https://github.com/apache/datafusion/pull/10418/files

* convert median to expr_fn api

Ref apache/datafusion#10644

* convert variance sample to UDF

Ref apache/datafusion#10667

* convert first_value and last_value to UDFs

Ref apache/datafusion#10648

* checkpointing with a few todos to fix remaining compile errors

* impl PyExpr::python_value for IntervalDayTime and IntervalMonthDayNano

* convert sum aggregate function to UDF

* remove unnecessary clone on double reference

* apply cargo fmt

* remove duplicate allow-dead-code annotation

* update tpch examples for new pyarrow interval

Fixes #665

* marked q11 tpch example as expected fail

Ref #730

* add default stride of None back to array_slice
findepi pushed a commit to findepi/datafusion that referenced this pull request Jul 16, 2024
* move out the ordering ruel

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* introduce rule

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* revert test result

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* pass mulit order test

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* cleanup

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* with new childes

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* revert slt

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* revert back

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* rm rewrite in new child

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* backup

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* only move conversion to optimizer

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* find test that do reverse

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* add test for first and last

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* pass all test

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* upd test

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* upd test

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* cleanup

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* cleanup

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* cleanup

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* add aggregate test

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* cleanup

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* final draft

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* cleanup again

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* pull out finer ordering code and reuse

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* clippy

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* remove finer in optimize rule

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* add comments and clenaup

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* rename fun

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* rename fun

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* fmt

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* avoid unnecessary recursion and rename

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* Minor changes

* Add new API for aggregate optimization

* Minor changes

* Minor changes

* Remove old code

* Minor changes

* Minor changes

* Minor changes

* Minor changes

* Minor changes

* Minor changes

* Update comments

* Minor changes

* Minor changes

* Review Part 1

* TMP

* Update display of aggregate fun exprs

* TMP

* TMP

* Update tests

* TMP buggy

* modify name in place

* Minor changes

* Tmp

* Tmp

* Tmp

* TMP

* Simplifications

* Tmp

* Tmp

* Compiles

* Resolve linter errors

* Resolve linter errors

* Minor changes

* Simplifications

* Minor chagnes

* Move cast to common

* Minor changes

* Fix test

* Minor changes

* Simplifications

* Review

* Address reviews

* Address reviews

* Update documentation, rename method

* Minor changes

---------

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Co-authored-by: jayzhan211 <jayzhan211@gmail.com>
Co-authored-by: berkaysynnada <berkay.sahin@synnada.ai>
Co-authored-by: Mehmet Ozan Kabak <ozankabak@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate logical-expr Logical plan and expressions optimizer Optimizer rules physical-expr Physical Expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remove builtin aggregate function FirstValue
5 participants