-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move Median to functions-aggregate
and Introduce Numeric signature
#10644
Conversation
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
functions-aggregate
functions-aggregate
and Introduce Numeric signature
data_type, | ||
distinct, | ||
aliases: vec!["median".to_string()], | ||
signature: Signature::numeric(1, Volatility::Immutable), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I introduce this signature that computes the final coerce types instead of generating tons of valid types with the possible same type. The numeric signature considers decimal types too, which is needed but not included in the Numeric types array.
@@ -39,6 +39,7 @@ path = "src/lib.rs" | |||
|
|||
[dependencies] | |||
arrow = { workspace = true } | |||
arrow-schema = { workspace = true } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for downcast_integer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the issue that not re-exported in arrow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, some macro has inevitable arrow-schema
dependency apache/arrow-rs#5676.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love it -- thank you @jayzhan211
datafusion/expr/src/signature.rs
Outdated
@@ -119,6 +119,8 @@ pub enum TypeSignature { | |||
OneOf(Vec<TypeSignature>), | |||
/// Specifies Signatures for array functions | |||
ArraySignature(ArrayFunctionSignature), | |||
/// Fixed number of arguments of numeric types |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we please document (or link to documentation) about what types are "numeric"? Is it https://docs.rs/arrow/latest/arrow/datatypes/enum.DataType.html#method.is_numeric ?
@@ -39,6 +39,7 @@ path = "src/lib.rs" | |||
|
|||
[dependencies] | |||
arrow = { workspace = true } | |||
arrow-schema = { workspace = true } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the issue that not re-exported in arrow?
|
||
/// MEDIAN aggregate expression. If using the non-distinct variation, then this uses a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this comment still holds and would be valuable to bring to the new Median
UDF
Specifically that median uses substantial memory
@@ -257,6 +257,56 @@ impl OptimizerRule for SingleDistinctToGroupBy { | |||
))) | |||
} | |||
} | |||
Expr::AggregateFunction(AggregateFunction { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I can tell this is now a general optimization (as in it will rewrite any distinct user defined aggregate as well). If so, that is quite cool. Is it correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we please add a test (if not already done) showing that the distinct aggregate is rewritten? Or perhaps there is already a test for median 🤔
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
logical_plan | ||
01)Projection: MEDIAN(alias1) AS MEDIAN(DISTINCT t.c) | ||
02)--Aggregate: groupBy=[[]], aggr=[[MEDIAN(alias1)]] | ||
03)----Aggregate: groupBy=[[t.c AS alias1]], aggr=[[]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shows distinct is optimzed
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Thanks again @jayzhan211 |
…pache#10644) * introduce median udaf Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * rm agg median Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * rm old median Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * introduce numeric signature Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * address comment Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * fix doc Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * add proto roundtrip Signed-off-by: jayzhan211 <jayzhan211@gmail.com> --------- Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
* deps: update datafusion to 39.0.0, pyo3 to 0.21, and object_store to 0.10.1 `datafusion-common` also depends on `pyo3`, so they need to be upgraded together. * feat: remove GetIndexField datafusion replaced Expr::GetIndexField with a FieldAccessor trait. Ref apache/datafusion#10568 Ref apache/datafusion#10769 * feat: update ScalarFunction The field `func_name` was changed to `func` as part of removing `ScalarFunctionDefinition` upstream. Ref apache/datafusion#10325 * feat: incorporate upstream array_slice fixes Fixes #670 * update ExectionPlan::children impl for DatasetExec Ref apache/datafusion#10543 * update value_interval_daytime Ref apache/arrow-rs#5769 * update regexp_replace and regexp_match Fixes #677 * add gil-refs feature to pyo3 This silences pyo3's deprecation warnings for its new Bounds api. It's the 1st step of the migration, and should be removed before merge. Ref https://pyo3.rs/v0.21.0/migration#from-020-to-021 * fix signature for octet_length Ref apache/datafusion#10726 * update signature for covar_samp AggregateUDF expressions now have a builder API design, which removes arguments like filter and order_by Ref apache/datafusion#10545 Ref apache/datafusion#10492 * convert covar_pop to expr_fn api Ref: https://github.com/apache/datafusion/pull/10418/files * convert median to expr_fn api Ref apache/datafusion#10644 * convert variance sample to UDF Ref apache/datafusion#10667 * convert first_value and last_value to UDFs Ref apache/datafusion#10648 * checkpointing with a few todos to fix remaining compile errors * impl PyExpr::python_value for IntervalDayTime and IntervalMonthDayNano * convert sum aggregate function to UDF * remove unnecessary clone on double reference * apply cargo fmt * remove duplicate allow-dead-code annotation * update tpch examples for new pyarrow interval Fixes #665 * marked q11 tpch example as expected fail Ref #730 * add default stride of None back to array_slice
…pache#10644) * introduce median udaf Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * rm agg median Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * rm old median Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * introduce numeric signature Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * address comment Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * fix doc Signed-off-by: jayzhan211 <jayzhan211@gmail.com> * add proto roundtrip Signed-off-by: jayzhan211 <jayzhan211@gmail.com> --------- Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Which issue does this PR close?
Closes #.
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?