-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move Covariance
(Sample) covar
/ covar_samp
to be a User Defined Aggregate Function
#10372
Conversation
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Covariance
(Sample) covariance_samp
to be a User Defined Aggregate Function
@@ -63,8 +63,6 @@ pub enum AggregateFunction { | |||
Stddev, | |||
/// Standard Deviation (Population) | |||
StddevPop, | |||
/// Covariance (Sample) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point of this PR is to remove this variant and make it a user defined aggregate
Covariance
(Sample) covariance_samp
to be a User Defined Aggregate FunctionCovariance
(Sample) covar
/ covar_samp
to be a User Defined Aggregate Function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @jayzhan211
This looks great and I think it is a nice verification that we can extract aggregates from the code. I left some small suggestions but I don't think they are necessary -- I also updated the title and description of this PR to be more description
I believe this function (and related ones) are part of what @yyy1000 was working on when we started down the path to extract this type of thing from the core.
// specific language governing permissions and limitations | ||
// under the License. | ||
|
||
//! Defines the covariance aggregations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
//! Defines the covariance aggregations. | |
//! [`CovarianceSample`]: covariance aggregations. |
f.debug_struct("CovarianceSample") | ||
.field("name", &self.name()) | ||
.field("signature", &self.signature) | ||
.field("accumulator", &"<FUNC>") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably don't need the accumulator
field in the debug printoug as it doesn't exist in the structure
} | ||
|
||
/// An accumulator to compute covariance | ||
/// The algrithm used is an online implementation and numerically stable. It is derived from the following paper |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// The algrithm used is an online implementation and numerically stable. It is derived from the following paper | |
/// The algorithm used is an online implementation and numerically stable. It is derived from the following paper |
} | ||
|
||
fn name(&self) -> &str { | ||
"covar" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A minor nitpick here is that the name of the struct is CovarianceSample but the name is covar
(with alias covar_samp
)
It would be better in my opinion of name()
and the struct name were consistent -- so Covariance
or name()
to return "covariance_pop"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I rename it to covar_samp
like Postgres, covar
is now alias
@@ -0,0 +1,25 @@ | |||
// Licensed to the Apache Software Foundation (ASF) under one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the StatsType
is only used for functions in defined in datafusion-functions-aggregate
so this module could go in datafusion-functions-aggregate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let me take a to-do note. Before all other functions are moved to functions-aggregate
we need it to keep in common
, since we don't import physical-expr into functions-aggregate
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Thanks for your review. |
Thanks again @jayzhan211 |
Upstream is continuing it's migration to UDFs. Ref apache/datafusion#10098 Ref apache/datafusion#10372
* chore: upgrade datafusion Deps Ref #690 * update concat and concat_ws to use datafusion_functions Moved in apache/datafusion#10089 * feat: upgrade functions.rs Upstream is continuing it's migration to UDFs. Ref apache/datafusion#10098 Ref apache/datafusion#10372 * fix ScalarUDF import * feat: remove deprecated suppors_filter_pushdown and impl supports_filters_pushdown Deprecated function removed in apache/datafusion#9923 * use `unnest_columns_with_options` instead of deprecated `unnest_column_with_option` * remove ScalarFunction wrappers These relied on upstream BuiltinScalarFunction, which are now removed. Ref apache/datafusion#10098 * update dataframe `test_describe` `null_count` was fixed upstream. Ref apache/datafusion#10260 * remove PyDFField and related methods DFField was removed upstream. Ref: apache/datafusion#9595 * bump `datafusion-python` package version to 38.0.0 * re-implement `PyExpr::column_name` The previous implementation relied on `DFField` which was removed upstream. Ref: apache/datafusion#9595
Which issue does this PR close?
Part of #8708
Rationale for this change
We are moving aggregate functions out of the core to ensure the core APIs are sufficient to implement all aggregates and make datafusion more configurable
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?