-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Convert first, last aggregate function to UDAF #10648
Convert first, last aggregate function to UDAF #10648
Conversation
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is really impressive!
.map(Some) | ||
} | ||
|
||
fn reverse_expr(&self) -> Option<Arc<dyn AggregateExpr>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a look at this PR and I agree, great work @mustafasrepo .
I had some comments on the public API design but nothing that would block the PR merging, in my opinon.
datafusion/expr/src/udaf.rs
Outdated
@@ -360,6 +391,27 @@ pub trait AggregateUDFImpl: Debug + Send + Sync { | |||
&[] | |||
} | |||
|
|||
/// Sets the flag specifying whether the requirement of the UDF is satisfied. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found this name / documenation to be a little confusing - specifically it isn't clear from just this documentation (the public API of AggregageUDFImpl
what type of requirement is being satisfied.
The documentation on AggregateExpr makes it clearer -- maybe we could find a way to bring some of that description to this location, or link to there for more details
Also I think we could improve the name of this function
with_...
is often used for builder style APIs in rust, which this is not- It uses "requirement" which is general
Maybe it would be clearer it we called it has_beneficial_ordering
or input_ordered_correctly
🤔
)) | ||
}) | ||
.collect::<Vec<Expr>>(); | ||
.map(|e| first_value(vec![e], false, None, sort_expr.clone(), None)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just noticed that we import functions-aggregate in optimizer for first_value, should we move this rule into functions-aggregate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I know, All LogicalPlan
rules are under optimizer/src/
folder. Also, I don't see any downside in having this rule here. Hence, I think it is better to keep rule in its current place. I wonder what others think though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps what @jayzhan211 is noticing is that by directly relying on datafusion-aggregate-functions
in the optimizer it means that now the optimizer is still treating "built in" aggregates specially (and for example, user defined functions couldn't take advantage of this)
So if we moved this rule (or somehow made the dependency via an interface) it would be more general and decouple the code and make datafusion more extensible.
One way is to move the entire rule into datafusion-aggregate-functions
but maybe we could instead have replace_distinct_aggregate
depend on FunctionRegistry
rather than a direct call to first_value
for example
Maybe we can file a TODO ticket to track it
* deps: update datafusion to 39.0.0, pyo3 to 0.21, and object_store to 0.10.1 `datafusion-common` also depends on `pyo3`, so they need to be upgraded together. * feat: remove GetIndexField datafusion replaced Expr::GetIndexField with a FieldAccessor trait. Ref apache/datafusion#10568 Ref apache/datafusion#10769 * feat: update ScalarFunction The field `func_name` was changed to `func` as part of removing `ScalarFunctionDefinition` upstream. Ref apache/datafusion#10325 * feat: incorporate upstream array_slice fixes Fixes #670 * update ExectionPlan::children impl for DatasetExec Ref apache/datafusion#10543 * update value_interval_daytime Ref apache/arrow-rs#5769 * update regexp_replace and regexp_match Fixes #677 * add gil-refs feature to pyo3 This silences pyo3's deprecation warnings for its new Bounds api. It's the 1st step of the migration, and should be removed before merge. Ref https://pyo3.rs/v0.21.0/migration#from-020-to-021 * fix signature for octet_length Ref apache/datafusion#10726 * update signature for covar_samp AggregateUDF expressions now have a builder API design, which removes arguments like filter and order_by Ref apache/datafusion#10545 Ref apache/datafusion#10492 * convert covar_pop to expr_fn api Ref: https://github.com/apache/datafusion/pull/10418/files * convert median to expr_fn api Ref apache/datafusion#10644 * convert variance sample to UDF Ref apache/datafusion#10667 * convert first_value and last_value to UDFs Ref apache/datafusion#10648 * checkpointing with a few todos to fix remaining compile errors * impl PyExpr::python_value for IntervalDayTime and IntervalMonthDayNano * convert sum aggregate function to UDF * remove unnecessary clone on double reference * apply cargo fmt * remove duplicate allow-dead-code annotation * update tpch examples for new pyarrow interval Fixes #665 * marked q11 tpch example as expected fail Ref #730 * add default stride of None back to array_slice
* move out the ordering ruel Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * introduce rule Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * revert test result Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * pass mulit order test Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * cleanup Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * with new childes Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * revert slt Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * revert back Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * rm rewrite in new child Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * backup Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * only move conversion to optimizer Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * find test that do reverse Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * add test for first and last Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * pass all test Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * upd test Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * upd test Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * cleanup Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * cleanup Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * cleanup Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * add aggregate test Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * cleanup Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * final draft Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * cleanup again Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * pull out finer ordering code and reuse Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * clippy Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * remove finer in optimize rule Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * add comments and clenaup Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * rename fun Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * rename fun Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * fmt Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * avoid unnecessary recursion and rename Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * Minor changes * Add new API for aggregate optimization * Minor changes * Minor changes * Remove old code * Minor changes * Minor changes * Minor changes * Minor changes * Minor changes * Minor changes * Update comments * Minor changes * Minor changes * Review Part 1 * TMP * Update display of aggregate fun exprs * TMP * TMP * Update tests * TMP buggy * modify name in place * Minor changes * Tmp * Tmp * Tmp * TMP * Simplifications * Tmp * Tmp * Compiles * Resolve linter errors * Resolve linter errors * Minor changes * Simplifications * Minor chagnes * Move cast to common * Minor changes * Fix test * Minor changes * Simplifications * Review * Address reviews * Address reviews * Update documentation, rename method * Minor changes --------- Signed-off-by: jayzhan211 <jayzhan211@gmail.com> Co-authored-by: jayzhan211 <jayzhan211@gmail.com> Co-authored-by: berkaysynnada <berkay.sahin@synnada.ai> Co-authored-by: Mehmet Ozan Kabak <ozankabak@gmail.com>
Which issue does this PR close?
Closes #9957
Part of #8708
Rationale for this change
This work is a continuation of the work done by @jayzhan211 for UDAF support.
This PR converts
FIRST_VALUE_AGG
,LAST_VALUE_AGG
Aggregate functions to the UDAF.As part of this PR, I have also updated refactored
OptimizeAggregateOrder
rule such that it no longer depends on the.as_any().downcast_ref
for to work (e.g it is not overfitted to theFirstValue
,LastValue
aggregate expressions). File name is changed fromconvert_first_last.rs
toupdate_aggr_exprs.rs
also.What changes are included in this PR?
Are these changes tested?
Yes
Are there any user-facing changes?