feat: Supports Stddev #348

huaxingao · 2024-04-29T17:12:51Z

Which issue does this PR close?

Closes #.

Rationale for this change

Supports STDDEV_SAMP and STDDEV_POP
The implementation mostly is the same as the DataFusion's implementation. The reason
we have our own implementation is that DataFusion has UInt64 for state_field count,
while Spark has Double for count. Also adding null_on_divide_by_zero
to be consistent with Spark's implementation.

What changes are included in this PR?

How are these changes tested?

kazuyukitanimura · 2024-04-29T21:35:46Z

spark/src/test/resources/tpcds-query-results/v1_4/q39a.sql.out

@@ -31,7 +31,7 @@ struct<w_warehouse_sk:int,i_item_sk:int,d_moy:int,mean:double,cov:double,w_wareh
 1	12259	1	326.5	1.219693210219279	1	12259	2	292.6666666666667	1.2808898286830026
 1	12641	1	321.25	1.1286221893301993	1	12641	2	279.25	1.129134558577743
 1	13043	1	260.5	1.355894484625015	1	13043	2	295.0	1.056210118409035
-1	13157	1	260.5	1.5242630430075292	1	13157	2	413.5	1.0422561797285326
+1	13157	1	260.5	1.524263043007529	1	13157	2	413.5	1.0422561797285326


Wondering what is causing the digit difference...

I am not sure what caused the digit difference.
Actually for SortMergeJoin, I got 1.524263043007529, but for BroadCastJoin, I still got 1.5242630430075292. Is it OK if I change the the expected result based on join type?

also cc @viirya

Ideally, it would be good to compare floating point numbers based on an epsilon to make sure they are within some tolerance threshold. I assume we are currently just comparing text file output directly? Do we have a way to generate the output into a structured file type such as CSV or JSON?

The difference may be down to order of operations - depending on the order that batches that are being processed from different partitions, for example. I don't think we can expect it to be 100% deterministic in a distributed system.

Yes, we are currently just comparing text file output directly. We are using Spark's TPCDSQuerySuite. It doesn't seem to be a way to generate the output into a structured file type.

core/src/execution/datafusion/expressions/stddev.rs

andygrove · 2024-05-03T11:28:35Z

core/src/execution/datafusion/expressions/stddev.rs

+        match variance {
+            ScalarValue::Float64(e) => {
+                if e.is_none() {
+                    Ok(ScalarValue::Float64(None))
+                } else {
+                    Ok(ScalarValue::Float64(e.map(|f| f.sqrt())))
+                }
+            }
+            _ => internal_err!("Variance should be f64"),
+        }


We can leverage pattern matching to simplify this.

Suggested change

match variance {

ScalarValue::Float64(e) => {

if e.is_none() {

Ok(ScalarValue::Float64(None))

} else {

Ok(ScalarValue::Float64(e.map(|f| f.sqrt())))

}

}

_ => internal_err!("Variance should be f64"),

}

match variance {

ScalarValue::Float64(Some(e)) => Ok(ScalarValue::Float64(Some(e.sqrt()))),

ScalarValue::Float64(None) => Ok(ScalarValue::Float64(None)),

_ => internal_err!("Variance should be f64"),

}

Changed. Thanks

andygrove

LGTM. Thanks @huaxingao

viirya · 2024-05-06T21:14:29Z

core/src/execution/datafusion/expressions/stddev.rs

+// specific language governing permissions and limitations
+// under the License.
+
+//! Defines physical expressions that can evaluated at runtime during query execution


Seems copied from somewhere and not related?

removed. Thanks

viirya · 2024-05-06T21:15:29Z

core/src/execution/datafusion/expressions/stddev.rs

+        // the result of stddev just support FLOAT64 and Decimal data type.
+        assert!(matches!(data_type, DataType::Float64));


Hmm? So we also need to add DecimalType here?

It's FLOAT64 only. Removed and Decimal data type

viirya · 2024-05-06T21:19:00Z

core/src/execution/datafusion/expressions/stddev.rs

+    }
+}
+
+/// An accumulator to compute the average


Suggested change

/// An accumulator to compute the average

/// An accumulator to compute the standard deviation

Changed. Thanks

viirya · 2024-05-06T21:22:30Z

spark/src/test/scala/org/apache/spark/sql/CometTPCDSQuerySuite.scala

+        // TODO: comment 39a and 39b for now because the expected result for stddev failed:
+        //  expected: 1.5242630430075292, actual: 1.524263043007529.
+        //  Will change the comparison logic to detect floating-point numbers and compare
+        //  with epsilon
+        // "q39a",
+        // "q39b",


We should create a ticket for this.

opened #392

viirya · 2024-05-06T21:23:17Z

Some minor comments.

viirya · 2024-05-07T00:21:12Z

Merged. Thanks @huaxingao @kazuyukitanimura @andygrove

huaxingao · 2024-05-07T01:11:04Z

Thanks, everyone!

* feat: Supports Stddev * fix fmt * update q39a.sql.out * address comments * disable q93a and q93b for now * address comments --------- Co-authored-by: Huaxin Gao <huaxin.gao@apple.com>

Huaxin Gao added 2 commits April 29, 2024 09:27

feat: Supports Stddev

1d8c67d

fix fmt

b91969c

huaxingao force-pushed the stddev branch from 296b641 to b91969c Compare April 29, 2024 20:39

update q39a.sql.out

66c8b8d

kazuyukitanimura reviewed Apr 29, 2024

View reviewed changes

andygrove reviewed May 3, 2024

View reviewed changes

core/src/execution/datafusion/expressions/stddev.rs Show resolved Hide resolved

andygrove reviewed May 3, 2024

View reviewed changes

Huaxin Gao added 2 commits May 5, 2024 18:08

address comments

ee05e3a

disable q93a and q93b for now

63deea4

andygrove approved these changes May 6, 2024

View reviewed changes

viirya reviewed May 6, 2024

View reviewed changes

viirya approved these changes May 6, 2024

View reviewed changes

huaxingao mentioned this pull request May 6, 2024

support epsilon based floating point numbers comparison in TPCDS results #392

Open

address comments

5b09775

viirya merged commit c40bc7c into apache:main May 7, 2024
28 checks passed

huaxingao deleted the stddev branch May 7, 2024 01:11

himadripal pushed a commit to himadripal/datafusion-comet that referenced this pull request Sep 7, 2024

fix: Diff merged into Spark (apache#348)

da2bd39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Supports Stddev #348

feat: Supports Stddev #348

huaxingao commented Apr 29, 2024

kazuyukitanimura Apr 29, 2024

huaxingao Apr 30, 2024

huaxingao Apr 30, 2024

andygrove May 3, 2024

andygrove May 3, 2024

huaxingao May 6, 2024

andygrove May 3, 2024

huaxingao May 6, 2024

andygrove left a comment

viirya May 6, 2024 •

edited

Loading

huaxingao May 6, 2024

viirya May 6, 2024

huaxingao May 6, 2024

viirya May 6, 2024

huaxingao May 6, 2024

viirya May 6, 2024

huaxingao May 6, 2024

viirya commented May 6, 2024

viirya commented May 7, 2024

huaxingao commented May 7, 2024

		// the result of stddev just support FLOAT64 and Decimal data type.
		assert!(matches!(data_type, DataType::Float64));

	/// An accumulator to compute the average
	/// An accumulator to compute the standard deviation

feat: Supports Stddev #348

feat: Supports Stddev #348

Conversation

huaxingao commented Apr 29, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andygrove left a comment

Choose a reason for hiding this comment

viirya May 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya commented May 6, 2024

viirya commented May 7, 2024

huaxingao commented May 7, 2024

viirya May 6, 2024 •

edited

Loading