Skip to content
This repository has been archived by the owner on May 4, 2019. It is now read-only.

[RFC] Fix [map]reduce with no non-null and single non-null element #158

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

alyst
Copy link
Member

@alyst alyst commented Oct 5, 2016

Also test [map]reduce() with N=2 to expose this corner case.

@alyst
Copy link
Member Author

alyst commented Oct 5, 2016

0.4 and 0.5 ok, failing on nightlies in broadcast.jl, I'm not sure this is related to the PR.

@codecov-io
Copy link

codecov-io commented Oct 7, 2016

Codecov Report

Merging #158 into master will decrease coverage by -27.77%.

@@             Coverage Diff             @@
##           master     #158       +/-   ##
===========================================
- Coverage   85.89%   58.12%   -27.77%     
===========================================
  Files          14       13        -1     
  Lines         865      855       -10     
===========================================
- Hits          743      497      -246     
- Misses        122      358      +236
Impacted Files Coverage Δ
src/reduce.jl 83.58% <55.55%> (+0.72%)
src/map0_4.jl 0% <ø> (-100%)
src/subarray.jl 18.51% <ø> (-52.92%)
src/show.jl 0% <ø> (-51.88%)
src/operators.jl 52.5% <ø> (-42.63%)
src/broadcast.jl 56.32% <ø> (-41.38%)
src/indexing.jl 78.04% <ø> (-21.96%)
src/nullablevector.jl 86.84% <ø> (-12.29%)
src/map.jl 87.87% <ø> (-3.04%)
src/primitives.jl 98.8% <ø> (-1.2%)
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b7b50c3...80f62d9. Read the comment docs.

@alyst
Copy link
Member Author

alyst commented Oct 7, 2016

Also fixed the return type of mapreduce() when the non-empty array contains no non-null elements (was returning T, should return Nullable{T}) and added the tests.

Actually, what should be the behaviour of mapreduce(skipnull=true) in the situations, where the are no non-null elements and the reduction over an empty array is not defined (no matching Base.mr_empty())? Right now it throws ArgumentError (actually thrown by the default Base.mr_empty()), but maybe it should return Nullable{T}() instead?

cc @davidagold, @nalimilan

@alyst
Copy link
Member Author

alyst commented Oct 7, 2016

Now fails on nightlies and also on 0.4 (all nulls test) I would like to fix that, but we should clarify the correct behaviour first.

@alyst alyst changed the title Fix reduce with single non-null Fix [map]reduce with no non-null and single non-null element Oct 7, 2016
@alyst alyst changed the title Fix [map]reduce with no non-null and single non-null element [RFC] Fix [map]reduce with no non-null and single non-null element Oct 7, 2016
@nalimilan
Copy link
Member

Actually, what should be the behaviour of mapreduce(skipnull=true) in the situations, where the are no non-null elements and the reduction over an empty array is not defined (no matching Base.mr_empty())? Right now it throws ArgumentError (actually thrown by the default Base.mr_empty()), but maybe it should return Nullable{T}() instead?

We should do the same thing as what Base does on empty arrays. Apparently it throws an error, so let's continue doing that. If that's annoying, we should file an issue in Julia.

@alyst
Copy link
Member Author

alyst commented Oct 7, 2016

@nalimilan I suppose one of the Nullable roles is to silence exceptions like this. Otherwise one always has to check for such corner cases in advance, instead of doing simple post-hoc isnull(). And that's not fun for by(df, ...). I think it could be fixed by defining the default mr_empty()

Base.mr_empty{T}(f, op, Nullable{T}) = Nullable{T}()

@nalimilan
Copy link
Member

Otherwise one always has to check for such corner cases in advance, instead of doing simple post-hoc isnull().

OK, that's an argument. Yet I guess Base has a strong reason for throwing an error instead of guessing the return type. I'm not sure which one, since in theory we should be able to infer the return type even when no element is present as long as we know the element type. But the definition you give doesn't do that: it assumes f and op will preserve the type, which isn't necessarily the case and will introduce type instabilities in some cases.

I think we need 1) to understand why Base does this, and 2) see how we can use Base.return_type to infer a correct type.

@alyst
Copy link
Member Author

alyst commented Oct 7, 2016

@nalimilan You're right, this definition is not correct, it should be something like

function Base.mr_empty{T<:DataType}(f, op, ::Type{Nullable{T}})
    f_T = Core.Inference.return_type(f, Tuple{T})
    op_T = Core.Inference.return_type(Base.r_promote, Tuple{typeof(op), f_T})
    return Nullable{op_T}()
end

Base defines the result of reduction over an empty array when "mathematically" it has some sense: sum([]) = 0.0, prod([]) = 1.0, maximum(abs, []) = 0.0 and few similar cases (I guess there are some algorithms relying on maximum(abs, []) == 0.0). By default it throws an error. So I think it would make "mathematical" sense to return Nullable by default instead of throwing an error. Trying to get the value would result in an exception anyway.

alyst added 5 commits October 7, 2016 12:28
exposes some corner cases that are not observed with N>2
Avoid counting the zeros twice (for missingdata flag and for nnull) if
skipnull=true.
* mapreduce() returns Nullable{T}(mr_empty) instead of just mr_empty
* added tests for [map]reduce() over non-empty collection that
  contain no non-null
@nalimilan
Copy link
Member

Shouldn't the function call Base.r_promote instead of calling return_type on it?

Anyway, I wonder whether we couldn't get rid of the mapreduce method for NullableArray and use the generic AbstractArray version. See #153 for the same discussion about broadcast. For example, this already works with Array{Nullable}:

x = [Nullable{Int}()]
mapreduce(x->x*Nullable(2), +, x) # -> Nullable{Int64}()

So I'm not sure why we would need our custom method.

@alyst
Copy link
Member Author

alyst commented Oct 7, 2016

Shouldn't the function call Base.r_promote instead of calling return_type on it?

The problem is that there is no valid element of type T to promote.

Anyway, I wonder whether we couldn't get rid of the mapreduce method for NullableArray and use the generic AbstractArray version. See #153 for the same discussion about broadcast. For example, this already works with Array{Nullable}:

x = [Nullable{Int}()]
mapreduce(x->x*Nullable(2), +, x) # -> Nullable{Int64}()
So I'm not sure why we would need our custom method.

Since Base.mapreduce() doesn't have skipnull, this PR is more about these cases (on 0.5):

julia> x = Nullable{Int}[]
# 0-element Array{Nullable{Int64},1}

julia> mapreduce(x->x*Nullable(2), min, x)
# Nullable{Any}()

julia> mapreduce(x->x*Nullable(2), sum, x)
# Nullable{Any}()

julia> reduce(min, x)
# Nullable{Int64}()

@nalimilan
Copy link
Member

The problem is that there is no valid element of type T to promote.

Use Base.r_promote_type then.

But I'd really like to know why Base doesn't does the same when the type can be inferred. There must be an issue about it.

@alyst
Copy link
Member Author

alyst commented Oct 7, 2016

The problem is that there is no valid element of type T to promote.
Use Base.r_promote_type then.

I cannot find r_promote_type() in the Base.

But I'd really like to know why Base doesn't does the same when the type can be inferred. There must be an issue about it.

I think it's just "an issue" to be created. There's also e.g.

julia> sum(Nullable{Float64}[])
ERROR: MethodError: no method matching zero(::Type{Nullable{Float64}})
Closest candidates are:
  zero(::Type{Base.LibGit2.Oid}) at libgit2/oid.jl:88
  zero(::Type{Base.Pkg.Resolve.VersionWeights.VWPreBuildItem}) at pkg/resolve/versionweight.jl:80
  zero(::Type{Base.Pkg.Resolve.VersionWeights.VWPreBuild}) at pkg/resolve/versionweight.jl:120
  ...
 in _mapreduce(::Base.#identity, ::Base.#+, ::Base.LinearFast, ::Array{Nullable{Float64},1}) at ./reduce.jl:148
 in sum(::Array{Nullable{Float64},1}) at ./reduce.jl:229
 in eval_user_input(::Any, ::Base.REPL.REPLBackend) at ./REPL.jl:64
 in macro expansion at ./REPL.jl:95 [inlined]
 in (::Base.REPL.##3#4{Base.REPL.REPLBackend})() at ./event.jl:68

@nalimilan
Copy link
Member

Funny, it's been added two days ago in master. So I guess we need to use the less pretty form for now. Would be nice to have a VERSION branch so that we can easily clean it up later.

The issue of reductions on empty collections is hardly new. See the links at:
JuliaLang/julia#18073 (comment)

Though it might not have been discussed again since the introduction of Base.return_type. You could try filing an issue in Julia to get some thoughts.

As regard the sum(Nullable{Int}[]) failure, maybe we'll have to define zero and one for Nullable, as they are quite basic functions (just like operators). Doesn't sound great, but lifting cannot help us here...

@alyst
Copy link
Member Author

alyst commented Oct 7, 2016

As regard the sum(Nullable{Int}[]) failure, maybe we'll have to define zero and one for Nullable, as they are quite basic functions (just like operators). Doesn't sound great, but lifting cannot help us here...

Nullable support for zero() and one() would be nice, but in general there should be a mechanism to make [map]reduce([f,] op, Nullable{T}[]) consistent with [map]reduce([f,] op, T[]), i.e. returning Nullable{T}(), when the original version throws ArgumentError and Nullable{T}(v) when the original version returns v.

@davidagold
Copy link
Contributor

davidagold commented Oct 15, 2016

What alternative is there to throwing an error when one reduces over an empty array? Since reduce does not, in general, return a container, you can't use the inferred return type, say U to somehow inform whatever such a reduction yields -- that is, one can't just return an empty Container{U}.

The case of NullableArrays is different. Using the generic (AbstractArray) reduce along with a higher-order lift should (unless I'm missing something) give the following behavior for X::NullableArray{T}:

  • If X is non-empty but has no non-null values, then reduce(lift(op), X) returns an empty Nullable{U} where U is determined by type inference conducted in lift(op, x::T, y::T).
  • If X is empty, then reduce(lift(op), X) errors
  • If X is non-empty but has no non-null values, then reduce(op, dropnull(X)) errors
  • If X is non-empty and has non-null values, then reduce(op, dropnull(X)) returns the reduction applied to an Array containing only the non-null values of X.

I'd argue that this is almost exactly what we want. I kind of want to have a discussion about whether dropnull should return an Array or a NullableArray, which would affect the fourth behavior mentioned above. But in any case, I think that trying to reduce over missing data is not the same as trying to reduce over no data, and I'm happy if the former returns Nullable while the latter errors.

I think that, in general, we should try to rely extant frameworks in Base as much as possible. The main reasons not to are (1) performance concerns and (2) usability concerns. The community seems to be heading towards addressing (2) in packages that are closer to (tabular) data structures built on top of NullableArrays. As for (1), I think that, at this point, we should really be working with the Base language to improve compiler support for Nullable rather than hacking it onto NullableArrays. This would mean less maintenance work for us in the short term and would probably be better for both the language and code that deals with Nullables elsewhere in the long term.

@alyst
Copy link
Member Author

alyst commented Oct 15, 2016

@davidagold Thanks for your thorough considerations.

I totally agree that the Nullable behaviour should be defined by the Base as much possible. Actually, this PR started as a simple fix for the typo in 1-non-null case (it throws an error), which was not covered by the tests, and then 0-non-null return type. So while the behaviour fixes should be probably done in Base, I guess there should be some immediate fix for the bug.

Re the behaviour. It could work that the downstream packages (e.g. DataFrames) do the preliminary non-null checks during aggregation to prevent NullableArray throwing an exception in all-nulls skipnulls=true case, or catch that exception (whatever creates smaller overhead). I hope there would be just a very places where such check have to be injected, and the code for the check and Nullable{U}() type inference would be compact. OTOH, if DataFrames aggregation is the main/typical use case of reduction over nullables, it would make sense to make non-throwing nullable-returning behaviour the default and implement it in Base. The third option is to add strict=true/false kw argument for mapreduce() dealing with nullables (strict throws an error, non-strict returns null).

@davidagold
Copy link
Contributor

@alyst hmm, you're definitely right that there's a fifth behavior we want, namely non-throwing over non-empty but all-null X::NullableArray. What do you think about the following proposal: we let reduce(op, X) fall back on the generic implementation, but implement lift such that lift(reduce)(op, X) has the following behavior:

  • if X is empty, throw an error
  • if X is non-empty but all-null, return Nullable{U}(), where U is chosen according via type inference (with some fall-back for non-concrete inferred types, if this makes sense).
  • if X is non-empty and not all null, return Nullable(reduce(op, dropnull(X)) (or reduce(lift(op), dropnull(X)) if we decide that dropnull should actually return a NullableArray).

@nalimilan
Copy link
Member

lift(reduce)(op, X) sounds a bit surprising to me. Couldn't we use reduce(lift(op), X) for this? Or do you plan to give it different semantics?

@davidagold
Copy link
Contributor

The difference between lift(reduce)(op, X) and reduce(lift(op), X) as I've delineated them above presents in the case where X has some null, some non-null values. In this case, lift(reduce)(op, X) will return reduce(op, dropnull(X)) which is non-null, whereas reduce(lift(op), X) will return (typically, assuming standard lifting semantics for op), an empty Nullable. This makes a difference in the case Alexey was talking about, aggregating over grouped data. If you want to handle aggregation with the same function over all groups, then reduce(lift(op), col) will return an empty Nullable both when a group has all null values and when a group has only some non-null values. I assume that's not what we want?

We could handle this by requiring/lowering to reduce(lift(op), dropnull(col)) and making reduction over empty NullableArrays (that is, NullableArrays with no elements, rather than an all-null NullableArray) not throw an error. Turns out this is what R does (y'all probably knew this already), but with a warning:

> x <- c()
> mean(x)
[1] NA
Warning message:
In mean.default(x) : argument is not numeric or logical: returning NA

Though it doesn't seem to be returning NA because x is empty per se. I'd be fine with breaking from the AbstractArray generic behavior and returning an empty Nullable with a warning rather than throwing an exception. Requiring dropnull may be a bit annoying to users, though from a statistical perspective making that step explicit is definitely preferable. In any case, including dropnull could be done automatically in a querying macro if we decide later that the usability factor is a big concern.

@alyst
Copy link
Member Author

alyst commented Oct 18, 2016

Actually, R doesn't warn if it's an empty logical or numeric, c() is something different:

> x <- c()
> x
NULL
> x <- as.integer()
> x
integer(0)
> mean(x)
[1] NaN
> mean(x, na.rm=TRUE)
[1] NaN

@davidagold
Copy link
Contributor

Good point. Looks like Julia implements the same behavior for mean, at least for empty vectors with numeric eltypes.

@nalimilan
Copy link
Member

The difference between lift(reduce)(op, X) and reduce(lift(op), X) as I've delineated them above presents in the case where X has some null, some non-null values. In this case, lift(reduce)(op, X) will return reduce(op, dropnull(X)) which is non-null, whereas reduce(lift(op), X) will return (typically, assuming standard lifting semantics for op), an empty Nullable.

Got it, thanks. Indeed returning null whenever the input contains a null isn't really useful for a type which is designed precisely to support missing values. I've always argued that this apparent correctness wouldn't offer any additional safety to users for who there always are nulls in the data. But as you say this can be handled by convenience macros.

@nalimilan
Copy link
Member

For consistency with #166, I think we should have (map)reduce automatically lift. To implement this, we could do the same as for broadcast, i.e. call mapreduce(x -> lift(op, x...), ...). That would return null if the input contains a null. Then we could have an alternative version of lift which would skip nulls, i.e. lift_skipnull(op, x::Nullable, Nullable()) -> x. The choice between these two could be made via skipnull=true as currently.

This approach avoids calling dropnull, which creates a copy.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants