Move `Covariance` (Sample) `covar` / `covar_samp` to be a User Defined Aggregate Function #10372

jayzhan211 · 2024-05-04T01:32:41Z

Which issue does this PR close?

Part of #8708

Rationale for this change

We are moving aggregate functions out of the core to ensure the core APIs are sufficient to implement all aggregates and make datafusion more configurable

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

alamb · 2024-05-05T12:55:31Z

datafusion/expr/src/aggregate_function.rs

@@ -63,8 +63,6 @@ pub enum AggregateFunction {
    Stddev,
    /// Standard Deviation (Population)
    StddevPop,
-    /// Covariance (Sample)


The point of this PR is to remove this variant and make it a user defined aggregate

alamb

Thank you @jayzhan211

This looks great and I think it is a nice verification that we can extract aggregates from the code. I left some small suggestions but I don't think they are necessary -- I also updated the title and description of this PR to be more description

I believe this function (and related ones) are part of what @yyy1000 was working on when we started down the path to extract this type of thing from the core.

alamb · 2024-05-05T12:56:30Z

datafusion/functions-aggregate/src/covariance.rs

+// specific language governing permissions and limitations
+// under the License.
+
+//! Defines the covariance aggregations.


Suggested change

//! Defines the covariance aggregations.

//! [`CovarianceSample`]: covariance aggregations.

alamb · 2024-05-05T12:58:31Z

datafusion/functions-aggregate/src/covariance.rs

+        f.debug_struct("CovarianceSample")
+            .field("name", &self.name())
+            .field("signature", &self.signature)
+            .field("accumulator", &"<FUNC>")


We probably don't need the accumulator field in the debug printoug as it doesn't exist in the structure

alamb · 2024-05-05T12:59:11Z

datafusion/functions-aggregate/src/covariance.rs

+}
+
+/// An accumulator to compute covariance
+/// The algrithm used is an online implementation and numerically stable. It is derived from the following paper


Suggested change

/// The algrithm used is an online implementation and numerically stable. It is derived from the following paper

/// The algorithm used is an online implementation and numerically stable. It is derived from the following paper

alamb · 2024-05-05T13:03:50Z

datafusion/functions-aggregate/src/covariance.rs

+    }
+
+    fn name(&self) -> &str {
+        "covar"


A minor nitpick here is that the name of the struct is CovarianceSample but the name is covar (with alias covar_samp)

It would be better in my opinion of name() and the struct name were consistent -- so Covariance or name() to return "covariance_pop"

I rename it to covar_samp like Postgres, covar is now alias

alamb · 2024-05-05T13:06:47Z

datafusion/physical-expr-common/src/aggregate/stats.rs

@@ -0,0 +1,25 @@
+// Licensed to the Apache Software Foundation (ASF) under one


I think the StatsType is only used for functions in defined in datafusion-functions-aggregate so this module could go in datafusion-functions-aggregate

let me take a to-do note. Before all other functions are moved to functions-aggregate we need it to keep in common, since we don't import physical-expr into functions-aggregate

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

jayzhan211 · 2024-05-05T15:06:00Z

Thank you @jayzhan211

This looks great and I think it is a nice verification that we can extract aggregates from the code. I left some small suggestions but I don't think they are necessary -- I also updated the title and description of this PR to be more description

I believe this function (and related ones) are part of what @yyy1000 was working on when we started down the path to extract this type of thing from the core.

Thanks for your review.
I plan to work on Sum or others that have additional features, but not too complex ones. covar_pop can be a good first issue.

alamb · 2024-05-06T09:46:11Z

Thanks again @jayzhan211

Upstream is continuing it's migration to UDFs. Ref apache/datafusion#10098 Ref apache/datafusion#10372

* chore: upgrade datafusion Deps Ref #690 * update concat and concat_ws to use datafusion_functions Moved in apache/datafusion#10089 * feat: upgrade functions.rs Upstream is continuing it's migration to UDFs. Ref apache/datafusion#10098 Ref apache/datafusion#10372 * fix ScalarUDF import * feat: remove deprecated suppors_filter_pushdown and impl supports_filters_pushdown Deprecated function removed in apache/datafusion#9923 * use `unnest_columns_with_options` instead of deprecated `unnest_column_with_option` * remove ScalarFunction wrappers These relied on upstream BuiltinScalarFunction, which are now removed. Ref apache/datafusion#10098 * update dataframe `test_describe` `null_count` was fixed upstream. Ref apache/datafusion#10260 * remove PyDFField and related methods DFField was removed upstream. Ref: apache/datafusion#9595 * bump `datafusion-python` package version to 38.0.0 * re-implement `PyExpr::column_name` The previous implementation relied on `DFField` which was removed upstream. Ref: apache/datafusion#9595

jayzhan211 added 6 commits May 4, 2024 08:36

introduce CovarianceSample

c6d41b7

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

rewrite macro

fa1c55a

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

rm old statstype

7fe2049

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

register

3a53b82

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

state field

ebc1d8f

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

rm builtin

aa9e800

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

github-actions bot added logical-expr Logical plan and expressions physical-expr Physical Expressions core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels May 4, 2024

jayzhan211 marked this pull request as ready for review May 4, 2024 03:26

alamb changed the title ~~UDAF: Covariance Sample~~ Move Covariance (Sample) covariance_samp to be a User Defined Aggregate Function May 5, 2024

alamb reviewed May 5, 2024

View reviewed changes

alamb changed the title ~~Move Covariance (Sample) covariance_samp to be a User Defined Aggregate Function~~ Move Covariance (Sample) covar / covar_samp to be a User Defined Aggregate Function May 5, 2024

alamb approved these changes May 5, 2024

View reviewed changes

alamb added the api change Changes the API exposed to users of the crate label May 5, 2024

addres comments

3e5cb0a

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

alamb merged commit a0fccbf into apache:main May 6, 2024
23 checks passed

This was referenced May 6, 2024

Move Covariance (Population) covar_pop to be a UDAF #10389

Closed

[Epic] Unify AggregateFunction Interface (remove built in list of AggregateFunction s), improve the system #8708

Open

alamb mentioned this pull request May 6, 2024

DataFusion weekly project plan (Andrew Lamb) - May 6, 2024 #10395

Closed

7 tasks

Michael-J-Ward mentioned this pull request May 13, 2024

Tracking Upgrade to Datafusion 38 apache/datafusion-python#690

Closed

3 tasks

Michael-J-Ward added a commit to Michael-J-Ward/datafusion-python that referenced this pull request May 13, 2024

feat: upgrade functions.rs

4d89cd7

Upstream is continuing it's migration to UDFs. Ref apache/datafusion#10098 Ref apache/datafusion#10372

jayzhan211 mentioned this pull request May 25, 2024

Convert Variance Sample to UDAF #10667

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move `Covariance` (Sample) `covar` / `covar_samp` to be a User Defined Aggregate Function #10372

Move `Covariance` (Sample) `covar` / `covar_samp` to be a User Defined Aggregate Function #10372

jayzhan211 commented May 4, 2024 •

edited by alamb

Loading

alamb May 5, 2024

alamb left a comment

alamb May 5, 2024

alamb May 5, 2024

alamb May 5, 2024

alamb May 5, 2024

jayzhan211 May 5, 2024 •

edited

Loading

alamb May 5, 2024

jayzhan211 May 5, 2024

jayzhan211 commented May 5, 2024 •

edited

Loading

alamb commented May 6, 2024

	//! Defines the covariance aggregations.
	//! [`CovarianceSample`]: covariance aggregations.

	/// The algrithm used is an online implementation and numerically stable. It is derived from the following paper
	/// The algorithm used is an online implementation and numerically stable. It is derived from the following paper

		@@ -0,0 +1,25 @@
		// Licensed to the Apache Software Foundation (ASF) under one

Move Covariance (Sample) covar / covar_samp to be a User Defined Aggregate Function #10372

Move Covariance (Sample) covar / covar_samp to be a User Defined Aggregate Function #10372

Conversation

jayzhan211 commented May 4, 2024 • edited by alamb Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

alamb May 5, 2024

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

alamb May 5, 2024

Choose a reason for hiding this comment

alamb May 5, 2024

Choose a reason for hiding this comment

alamb May 5, 2024

Choose a reason for hiding this comment

alamb May 5, 2024

Choose a reason for hiding this comment

jayzhan211 May 5, 2024 • edited Loading

Choose a reason for hiding this comment

alamb May 5, 2024

Choose a reason for hiding this comment

jayzhan211 May 5, 2024

Choose a reason for hiding this comment

jayzhan211 commented May 5, 2024 • edited Loading

alamb commented May 6, 2024

Move `Covariance` (Sample) `covar` / `covar_samp` to be a User Defined Aggregate Function #10372

Move `Covariance` (Sample) `covar` / `covar_samp` to be a User Defined Aggregate Function #10372

jayzhan211 commented May 4, 2024 •

edited by alamb

Loading

jayzhan211 May 5, 2024 •

edited

Loading

jayzhan211 commented May 5, 2024 •

edited

Loading