Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should comparison operators return Bool values? #7

Open
ti-s opened this issue Oct 28, 2016 · 5 comments
Open

Should comparison operators return Bool values? #7

ti-s opened this issue Oct 28, 2016 · 5 comments

Comments

@ti-s
Copy link

ti-s commented Oct 28, 2016

From the discussions in JuliaStats/NullableArrays.jl#85, https://github.com/JuliaData/Roadmap.jl/issues/3, and JuliaData/DataFramesMeta.jl#58 it seems that the main open question is the behavior of comparison operators. I understand that there are good reasons for the current behavior of NAables. Unfortunately, it could lead to subtle errors, e.g.:

julia> include("NAable.jl"); using .NAables

julia> a = [NAable(1), NAable(-1), NAable{Int}()]
3-element Array{NAables.NAable{Int64},1}:
1  
-1 
#NA

julia> positive = broadcast(>=, a, 0)
3-element Array{Bool,1}:
true
false
false

julia> b = Any[NA, NA, NA]; b[positive] = "+"; b[~positive] = "-"; b
3-element Array{Any,1}:
"+"
"-"
"-"

julia> getsign(x) = x < 0 ? "-" : "+"
getsign (generic function with 1 method)

julia> getsign.(a)
3-element Array{String,1}:
"+"
"-"
"+"

If comparison operators return NAable{Bool} instead, and a function like
bool{T<:Bool}(x::NAable{T}) = isna(x) ? false : x.value exists, the above would become:

julia> include("NAable3VL.jl"); using .NAables

julia> a = [NAable(1), NAable(-1), NAable{Int}()]
3-element Array{NAables.NAable{Int64},1}:
1  
-1 
#NA

julia> positive = broadcast(>=, a, 0)
3-element Array{NAables.NAable{Bool},1}:
true 
false
#NA  

julia> b = Any[NA, NA, NA]; b[bool.(positive)] = "+"; b[bool.(~positive)] = "-"; b
3-element Array{Any,1}:
"+"
"-"
NA 

julia> getsign(x) = x < 0 ? "-" : "+"
getsign (generic function with 1 method)

julia> getsign.(a)
ERROR: TypeError: non-boolean (NAables.NAable{Bool}) used in boolean context

Maybe one can automatically treat NA as false in filter-like contexts like Boolean indexing, generators, @where, ..., so one does not have to wrap such conditions in bool(...). But because it is not safe to assume that the author of an arbitrary function using if ... else .. end or the ternary operator has anticipated the possibility of NAs, an error should be thrown if NA occurs in a comparison in a control flow statement. As Stefan Karpinsky pointed out in JuliaStats/NullableArrays.jl#85 (comment), it could still be possible to allow non-NA NAable{Bool}s in such situations.

@davidanthoff
Copy link
Member

The whole point of having a data type for missing values is to have something that is not in base so that we can iterate faster. But if this data type is not in base, then a filter in a generator can't special case it, so there is no way to get filter expressions in a generator to drop NA rows. So there already we would end up with a discrepancy between generators and queries, which I want to avoid at all cost.

I also just generally think that you make a system a LOT more complicated the moment you introduce 3VL. That aspect of SQL is not a strength, it makes things much more complicated.

I'd also like to point out that C# had this behavior for over a decade, and there is almost no complaining about it (unlike the SQL story).

Finally, I think there are a gazillion examples of how people can write functions where they didn't properly deal with all the input combinations. This is one more case, I don't see why this is any different than say a function that forgot to deal with negative numbers or something like that.

@davidanthoff
Copy link
Member

@ti-s My current plan is this: have == and all the other comparison operators return Bool. And then add lifting support via the dot syntax, so that .== returns a Nullable{Bool}.

@ti-s
Copy link
Author

ti-s commented Apr 19, 2017

I have never used C# but one difference could be that in Julia, it is common to type function signatures as loosely as possible and rely on duck typing to catch errors. If that is not the case in C#, it might happen more often in Julia that one gets silently wrong results when passing a DataValue to a function from a completely unrelated package that was written without missing values in mind.

Here is an example from base that gives unexpected results with DataValues (it uses x!=x to test for NaN but that does not work for NA):

julia> const NA = DataValue()
DataValue{Union{}}()

julia> findmax([NA, 1, 2, 3])
(DataValue{Int64}(), 1)

julia> findmax([1, 2, 3, NA])
(DataValue{Int64}(3), 3)

It would be safer to have comparisons return a Nullable{Bool} (or DataValue{Bool}) and dotted operators return a Bool. Then, conditionals would not silently give wrong results if the author didn't anticipate the use of DataValues. Also, it would need only one additional dot to fix the cases where the fallback to false is correct.

This is also different to SQL, which uses 3VL but silently interprets Unknown as falsein conditionals.

@davidanthoff
Copy link
Member

Yeah, things can go wrong with this. But I almost feel that in this example findmax should be changed, it seems to rely on a pretty specific implementation detail of floating point numbers in an algorithm that is supposed to work with all sorts of types.

The core problem is that the convention in base now is that the lifted version of a function that you get with a . will return a datatype that can represent missingness (i.e. Nullable in base). To change that convention for one operator in this package here seems really confusing and inconsistent.

This is also different to SQL, which uses 3VL but silently interprets Unknown as false in conditionals.

I know, but I don't like that one bit, I think it makes things way too difficult. I also don't see any chance that base will adopt this convention for things like if etc..

@ti-s
Copy link
Author

ti-s commented Apr 20, 2017

I agree that findmax needs to change. But my point is that these kind of errors might be more frequent in Julia than in C# because in Julia functions often rely on duck typing (I don't know if that is common in C#).

To clarify, I don't propose to copy SQL. I mentioned the difference to SQL as a potential advantage of my proposal.

An alternative could be to have comparisons with NA return DataValue(false) instead of NA or false, thereby avoiding 3VL and still having the safety of error messages in boolean contexts. I would like to see an example where this would be simpler than 3VL.

Not being able to use dotted operators because of the inconsistency with Nullable is unfortunate. The need to wrap every expression with potential NAs into a function in boolean contexts could be a bit annoying. But personally I would prefer the increased safety over the loss of convenience.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants