Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: expand division function options #615

Merged
merged 4 commits into from
Apr 12, 2024

Conversation

westonpace
Copy link
Member

The division functions do not allow you to specify on_domain_error or on_division_by_zero for integer division operations. However, some engines return an error in this case and some return null.

In addition, when dealing with floating point operations, some engines return null, some return an error, and some return nan. The option to return null was missing. This PR adds that.

richtia
richtia previously approved these changes Mar 26, 2024
EpsilonPrime
EpsilonPrime previously approved these changes Mar 26, 2024
@@ -257,7 +258,7 @@ scalar_functions:
on_domain_error:
values: [ NAN, "NULL", ERROR ]
on_division_by_zero:
values: [ LIMIT, NAN, "NULL", ERROR ]
values: [ NAN, "NULL", ERROR ]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the IEEE option is only for fp32/fp32?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. IEEE behavior is:

0 / 0 => NAN
NAN / 0 => NAN
Infinity / 0 => Infinity
-Infinity / 0 => -Infinity

Integer division has no concept of NAN / Infinity and so NULL and ERROR are the only two possible options.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, i was more of wondering why fp64/fp64 didn't have the IEEE option.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh! Good catch 😶

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@westonpace
Copy link
Member Author

westonpace commented Mar 27, 2024

There is another option.

It seems that there is very little consensus amongst engines for this function. MySQL and SQL server have no concept of NaN. Postgres does its own thing. DuckDb does its own thing. Sqlite does its own thing.

If our rationale is "nothing is an option unless there are at least 2 implementations" (everything else is a vendor-specific extension) then we could actually get away with a single option for floating point division which is:

on_invalid_input: [ "IEEE", "NULL", "ERROR" ]

IEEE -> datafusion / pandas / etc.
NULL -> sqlite
ERROR -> SQL Server

If our rational is "we will try to support all big engines" then we are going to need at least 5 different options (but at most 9).

This leaves DuckDb, MySQL, and postgres (of the ones I tested) as "vendor-specific".

Actually, even the above could be further simplified to 0 options with IEEE as the only official (i.e. implemented by at least 2 engines) behavior and any other behavior is "vendor-specific".

@EpsilonPrime
Copy link
Member

If our rationale is "nothing is an option unless there are at least 2 implementations" (everything else is a vendor-specific extension) then we could actually get away with a single option for floating point division which is:

on_invalid_input: [ "IEEE", "NULL", "ERROR" ]

Implementing 5-9 options would allow us to say what the behavior is but I'm not sure that helps the ecosystem. Ideally we have a conformance test that shows that it doesn't support any option (any maybe that conformance test supports the other options so we know exactly how it performs) so that vendors can consciously move to a more standard behavior over time.

Having one specific option is making a choice for the community which also feels wrong. Although there is a defacto consensus I don't want us to start dictating.

The happy medium of having the three options you list here seems reasonable.

on_division_by_zero:
values: [ LIMIT, NAN, ERROR ]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this going to be a broken change because we no longer support LIMIT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, yes. However, I do not know of any engine that implements LIMIT. So, practically speaking, no.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cudf

import cudf
a = cudf.Series(5)
b = cudf.Series(0)
a.divide(b)
0    inf
dtype: float64

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, darn. I'll add limit back in then. I wouldn't introduce limit for this case (since cudf is the only engine) but I think we can keep an option even if there aren't two engines.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added LIMIT back as an option.

@westonpace
Copy link
Member Author

Implementing 5-9 options would allow us to say what the behavior is but I'm not sure that helps the ecosystem. Ideally we have a conformance test that shows that it doesn't support any option (any maybe that conformance test supports the other options so we know exactly how it performs) so that vendors can consciously move to a more standard behavior over time.

Agreed. For BFT / dialect testing I think we will need to add the ability for an engine to have an "engine-specific behavior" which is defined as part of the dialect but not rooted in any substrait yaml. I think, for now, a minimal standard of "at least 2 engines" will be a good guideline.

@westonpace westonpace merged commit 7b79437 into substrait-io:main Apr 12, 2024
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants