feat: expand division function options #615

westonpace · 2024-03-26T22:00:13Z

The division functions do not allow you to specify on_domain_error or on_division_by_zero for integer division operations. However, some engines return an error in this case and some return null.

In addition, when dealing with floating point operations, some engines return null, some return an error, and some return nan. The option to return null was missing. This PR adds that.

richtia · 2024-03-27T00:46:27Z

extensions/functions_arithmetic.yaml

@@ -257,7 +258,7 @@ scalar_functions:
          on_domain_error:
            values: [ NAN, "NULL", ERROR ]
          on_division_by_zero:
-            values: [ LIMIT, NAN, "NULL", ERROR ]
+            values: [ NAN, "NULL", ERROR ]


the IEEE option is only for fp32/fp32?

Yes. IEEE behavior is:

0 / 0 => NAN NAN / 0 => NAN Infinity / 0 => Infinity -Infinity / 0 => -Infinity

Integer division has no concept of NAN / Infinity and so NULL and ERROR are the only two possible options.

oh, i was more of wondering why fp64/fp64 didn't have the IEEE option.

Oh! Good catch 😶

westonpace · 2024-03-27T04:15:17Z

There is another option.

It seems that there is very little consensus amongst engines for this function. MySQL and SQL server have no concept of NaN. Postgres does its own thing. DuckDb does its own thing. Sqlite does its own thing.

If our rationale is "nothing is an option unless there are at least 2 implementations" (everything else is a vendor-specific extension) then we could actually get away with a single option for floating point division which is:

on_invalid_input: [ "IEEE", "NULL", "ERROR" ]

IEEE -> datafusion / pandas / etc.
NULL -> sqlite
ERROR -> SQL Server

If our rational is "we will try to support all big engines" then we are going to need at least 5 different options (but at most 9).

This leaves DuckDb, MySQL, and postgres (of the ones I tested) as "vendor-specific".

Actually, even the above could be further simplified to 0 options with IEEE as the only official (i.e. implemented by at least 2 engines) behavior and any other behavior is "vendor-specific".

EpsilonPrime · 2024-03-29T01:52:01Z

If our rationale is "nothing is an option unless there are at least 2 implementations" (everything else is a vendor-specific extension) then we could actually get away with a single option for floating point division which is:
on_invalid_input: [ "IEEE", "NULL", "ERROR" ]

Implementing 5-9 options would allow us to say what the behavior is but I'm not sure that helps the ecosystem. Ideally we have a conformance test that shows that it doesn't support any option (any maybe that conformance test supports the other options so we know exactly how it performs) so that vendors can consciously move to a more standard behavior over time.

Having one specific option is making a choice for the community which also feels wrong. Although there is a defacto consensus I don't want us to start dictating.

The happy medium of having the three options you list here seems reasonable.

EpsilonPrime · 2024-03-29T01:52:30Z

extensions/functions_arithmetic.yaml

          on_division_by_zero:
-            values: [ LIMIT, NAN, ERROR ]


Is this going to be a broken change because we no longer support LIMIT?

Technically, yes. However, I do not know of any engine that implements LIMIT. So, practically speaking, no.

cudf

import cudf a = cudf.Series(5) b = cudf.Series(0) a.divide(b) 0 inf dtype: float64

Well, darn. I'll add limit back in then. I wouldn't introduce limit for this case (since cudf is the only engine) but I think we can keep an option even if there aren't two engines.

I've added LIMIT back as an option.

westonpace · 2024-03-29T12:22:23Z

Implementing 5-9 options would allow us to say what the behavior is but I'm not sure that helps the ecosystem. Ideally we have a conformance test that shows that it doesn't support any option (any maybe that conformance test supports the other options so we know exactly how it performs) so that vendors can consciously move to a more standard behavior over time.

Agreed. For BFT / dialect testing I think we will need to add the ability for an engine to have an "engine-specific behavior" which is defined as part of the dialect but not rooted in any substrait yaml. I think, for now, a minimal standard of "at least 2 engines" will be a good guideline.

westonpace requested review from jacques-n, cpcloud, EpsilonPrime and vbarua as code owners March 26, 2024 22:00

richtia previously approved these changes Mar 26, 2024

View reviewed changes

westonpace dismissed richtia’s stale review via 615234e March 26, 2024 22:26

EpsilonPrime previously approved these changes Mar 26, 2024

View reviewed changes

richtia reviewed Mar 27, 2024

View reviewed changes

EpsilonPrime reviewed Mar 29, 2024

View reviewed changes

westonpace dismissed EpsilonPrime’s stale review via 985108f March 29, 2024 12:20

richtia mentioned this pull request Mar 29, 2024

Add ability for backend engines to have "engine-specific behavior" substrait-io/bft#65

Open

westonpace requested review from richtia and EpsilonPrime April 11, 2024 13:20

westonpace added 4 commits April 11, 2024 06:29

Add null to division functions. Add error handling to integer division.

c7cef86

Clarify limit/nan/ieee behavior

8a1c775

Add IEEE option for fp64 as well

839cc3e

Add back in LIMIT to support cuda

5cb5d51

westonpace force-pushed the feat/division-null branch from 1923a2c to 5cb5d51 Compare April 11, 2024 13:29

richtia approved these changes Apr 11, 2024

View reviewed changes

EpsilonPrime approved these changes Apr 12, 2024

View reviewed changes

westonpace merged commit 7b79437 into substrait-io:main Apr 12, 2024
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: expand division function options #615

feat: expand division function options #615

westonpace commented Mar 26, 2024

richtia Mar 27, 2024

westonpace Mar 27, 2024

richtia Mar 27, 2024

westonpace Mar 28, 2024

westonpace Mar 29, 2024

westonpace commented Mar 27, 2024 •

edited

Loading

EpsilonPrime commented Mar 29, 2024

EpsilonPrime Mar 29, 2024

westonpace Mar 29, 2024

richtia Mar 29, 2024

westonpace Mar 29, 2024

westonpace Apr 11, 2024

westonpace commented Mar 29, 2024

feat: expand division function options #615

feat: expand division function options #615

Conversation

westonpace commented Mar 26, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

westonpace commented Mar 27, 2024 • edited Loading

EpsilonPrime commented Mar 29, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

westonpace commented Mar 29, 2024

westonpace commented Mar 27, 2024 •

edited

Loading