-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: expand division function options #615
Conversation
extensions/functions_arithmetic.yaml
Outdated
@@ -257,7 +258,7 @@ scalar_functions: | |||
on_domain_error: | |||
values: [ NAN, "NULL", ERROR ] | |||
on_division_by_zero: | |||
values: [ LIMIT, NAN, "NULL", ERROR ] | |||
values: [ NAN, "NULL", ERROR ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the IEEE
option is only for fp32/fp32?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. IEEE behavior is:
0 / 0 => NAN
NAN / 0 => NAN
Infinity / 0 => Infinity
-Infinity / 0 => -Infinity
Integer division has no concept of NAN / Infinity and so NULL and ERROR are the only two possible options.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, i was more of wondering why fp64/fp64 didn't have the IEEE
option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh! Good catch 😶
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
There is another option. It seems that there is very little consensus amongst engines for this function. MySQL and SQL server have no concept of NaN. Postgres does its own thing. DuckDb does its own thing. Sqlite does its own thing. If our rationale is "nothing is an option unless there are at least 2 implementations" (everything else is a vendor-specific extension) then we could actually get away with a single option for floating point division which is:
If our rational is "we will try to support all big engines" then we are going to need at least 5 different options (but at most 9). This leaves DuckDb, MySQL, and postgres (of the ones I tested) as "vendor-specific". Actually, even the above could be further simplified to 0 options with IEEE as the only official (i.e. implemented by at least 2 engines) behavior and any other behavior is "vendor-specific". |
Implementing 5-9 options would allow us to say what the behavior is but I'm not sure that helps the ecosystem. Ideally we have a conformance test that shows that it doesn't support any option (any maybe that conformance test supports the other options so we know exactly how it performs) so that vendors can consciously move to a more standard behavior over time. Having one specific option is making a choice for the community which also feels wrong. Although there is a defacto consensus I don't want us to start dictating. The happy medium of having the three options you list here seems reasonable. |
on_division_by_zero: | ||
values: [ LIMIT, NAN, ERROR ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this going to be a broken change because we no longer support LIMIT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically, yes. However, I do not know of any engine that implements LIMIT. So, practically speaking, no.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cudf
import cudf
a = cudf.Series(5)
b = cudf.Series(0)
a.divide(b)
0 inf
dtype: float64
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, darn. I'll add limit back in then. I wouldn't introduce limit for this case (since cudf is the only engine) but I think we can keep an option even if there aren't two engines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added LIMIT back as an option.
Agreed. For BFT / dialect testing I think we will need to add the ability for an engine to have an "engine-specific behavior" which is defined as part of the dialect but not rooted in any substrait yaml. I think, for now, a minimal standard of "at least 2 engines" will be a good guideline. |
1923a2c
to
5cb5d51
Compare
The division functions do not allow you to specify
on_domain_error
oron_division_by_zero
for integer division operations. However, some engines return an error in this case and some return null.In addition, when dealing with floating point operations, some engines return null, some return an error, and some return nan. The option to return null was missing. This PR adds that.