[RFC] Fix [map]reduce with no non-null and single non-null element #158

alyst · 2016-10-05T19:43:42Z

Also test [map]reduce() with N=2 to expose this corner case.

alyst · 2016-10-05T19:47:28Z

0.4 and 0.5 ok, failing on nightlies in broadcast.jl, I'm not sure this is related to the PR.

codecov-io · 2016-10-07T08:06:20Z

Codecov Report

Merging #158 into master will decrease coverage by -27.77%.

@@             Coverage Diff             @@
##           master     #158       +/-   ##
===========================================
- Coverage   85.89%   58.12%   -27.77%     
===========================================
  Files          14       13        -1     
  Lines         865      855       -10     
===========================================
- Hits          743      497      -246     
- Misses        122      358      +236

Impacted Files	Coverage Δ
src/reduce.jl	`83.58% <55.55%> (+0.72%)`	✅
src/map0_4.jl	`0% <ø> (-100%)`	❌
src/subarray.jl	`18.51% <ø> (-52.92%)`	❌
src/show.jl	`0% <ø> (-51.88%)`	❌
src/operators.jl	`52.5% <ø> (-42.63%)`	❌
src/broadcast.jl	`56.32% <ø> (-41.38%)`	❌
src/indexing.jl	`78.04% <ø> (-21.96%)`	❌
src/nullablevector.jl	`86.84% <ø> (-12.29%)`	❌
src/map.jl	`87.87% <ø> (-3.04%)`	❌
src/primitives.jl	`98.8% <ø> (-1.2%)`	❌
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b7b50c3...80f62d9. Read the comment docs.

alyst · 2016-10-07T08:13:08Z

Also fixed the return type of mapreduce() when the non-empty array contains no non-null elements (was returning T, should return Nullable{T}) and added the tests.

Actually, what should be the behaviour of mapreduce(skipnull=true) in the situations, where the are no non-null elements and the reduction over an empty array is not defined (no matching Base.mr_empty())? Right now it throws ArgumentError (actually thrown by the default Base.mr_empty()), but maybe it should return Nullable{T}() instead?

cc @davidagold, @nalimilan

alyst · 2016-10-07T08:15:13Z

Now fails on nightlies and also on 0.4 (all nulls test) I would like to fix that, but we should clarify the correct behaviour first.

nalimilan · 2016-10-07T08:39:23Z

Actually, what should be the behaviour of mapreduce(skipnull=true) in the situations, where the are no non-null elements and the reduction over an empty array is not defined (no matching Base.mr_empty())? Right now it throws ArgumentError (actually thrown by the default Base.mr_empty()), but maybe it should return Nullable{T}() instead?

We should do the same thing as what Base does on empty arrays. Apparently it throws an error, so let's continue doing that. If that's annoying, we should file an issue in Julia.

alyst · 2016-10-07T08:48:30Z

@nalimilan I suppose one of the Nullable roles is to silence exceptions like this. Otherwise one always has to check for such corner cases in advance, instead of doing simple post-hoc isnull(). And that's not fun for by(df, ...). I think it could be fixed by defining the default mr_empty()

Base.mr_empty{T}(f, op, Nullable{T}) = Nullable{T}()

nalimilan · 2016-10-07T09:06:55Z

Otherwise one always has to check for such corner cases in advance, instead of doing simple post-hoc isnull().

OK, that's an argument. Yet I guess Base has a strong reason for throwing an error instead of guessing the return type. I'm not sure which one, since in theory we should be able to infer the return type even when no element is present as long as we know the element type. But the definition you give doesn't do that: it assumes f and op will preserve the type, which isn't necessarily the case and will introduce type instabilities in some cases.

I think we need 1) to understand why Base does this, and 2) see how we can use Base.return_type to infer a correct type.

alyst · 2016-10-07T09:46:34Z

@nalimilan You're right, this definition is not correct, it should be something like

function Base.mr_empty{T<:DataType}(f, op, ::Type{Nullable{T}})
    f_T = Core.Inference.return_type(f, Tuple{T})
    op_T = Core.Inference.return_type(Base.r_promote, Tuple{typeof(op), f_T})
    return Nullable{op_T}()
end

Base defines the result of reduction over an empty array when "mathematically" it has some sense: sum([]) = 0.0, prod([]) = 1.0, maximum(abs, []) = 0.0 and few similar cases (I guess there are some algorithms relying on maximum(abs, []) == 0.0). By default it throws an error. So I think it would make "mathematical" sense to return Nullable by default instead of throwing an error. Trying to get the value would result in an exception anyway.

exposes some corner cases that are not observed with N>2

Avoid counting the zeros twice (for missingdata flag and for nnull) if skipnull=true.

* mapreduce() returns Nullable{T}(mr_empty) instead of just mr_empty * added tests for [map]reduce() over non-empty collection that contain no non-null

nalimilan · 2016-10-07T11:51:45Z

Shouldn't the function call Base.r_promote instead of calling return_type on it?

Anyway, I wonder whether we couldn't get rid of the mapreduce method for NullableArray and use the generic AbstractArray version. See #153 for the same discussion about broadcast. For example, this already works with Array{Nullable}:

x = [Nullable{Int}()]
mapreduce(x->x*Nullable(2), +, x) # -> Nullable{Int64}()

So I'm not sure why we would need our custom method.

alyst · 2016-10-07T12:03:17Z

Shouldn't the function call Base.r_promote instead of calling return_type on it?

The problem is that there is no valid element of type T to promote.

Anyway, I wonder whether we couldn't get rid of the mapreduce method for NullableArray and use the generic AbstractArray version. See #153 for the same discussion about broadcast. For example, this already works with Array{Nullable}:

x = [Nullable{Int}()]
mapreduce(x->x*Nullable(2), +, x) # -> Nullable{Int64}()
So I'm not sure why we would need our custom method.

Since Base.mapreduce() doesn't have skipnull, this PR is more about these cases (on 0.5):

julia> x = Nullable{Int}[]
# 0-element Array{Nullable{Int64},1}

julia> mapreduce(x->x*Nullable(2), min, x)
# Nullable{Any}()

julia> mapreduce(x->x*Nullable(2), sum, x)
# Nullable{Any}()

julia> reduce(min, x)
# Nullable{Int64}()

nalimilan · 2016-10-07T12:21:49Z

The problem is that there is no valid element of type T to promote.

Use Base.r_promote_type then.

But I'd really like to know why Base doesn't does the same when the type can be inferred. There must be an issue about it.

alyst · 2016-10-07T13:25:03Z

The problem is that there is no valid element of type T to promote.
Use Base.r_promote_type then.

I cannot find r_promote_type() in the Base.

But I'd really like to know why Base doesn't does the same when the type can be inferred. There must be an issue about it.

I think it's just "an issue" to be created. There's also e.g.

julia> sum(Nullable{Float64}[])
ERROR: MethodError: no method matching zero(::Type{Nullable{Float64}})
Closest candidates are:
  zero(::Type{Base.LibGit2.Oid}) at libgit2/oid.jl:88
  zero(::Type{Base.Pkg.Resolve.VersionWeights.VWPreBuildItem}) at pkg/resolve/versionweight.jl:80
  zero(::Type{Base.Pkg.Resolve.VersionWeights.VWPreBuild}) at pkg/resolve/versionweight.jl:120
  ...
 in _mapreduce(::Base.#identity, ::Base.#+, ::Base.LinearFast, ::Array{Nullable{Float64},1}) at ./reduce.jl:148
 in sum(::Array{Nullable{Float64},1}) at ./reduce.jl:229
 in eval_user_input(::Any, ::Base.REPL.REPLBackend) at ./REPL.jl:64
 in macro expansion at ./REPL.jl:95 [inlined]
 in (::Base.REPL.##3#4{Base.REPL.REPLBackend})() at ./event.jl:68

nalimilan · 2016-10-07T13:48:30Z

Funny, it's been added two days ago in master. So I guess we need to use the less pretty form for now. Would be nice to have a VERSION branch so that we can easily clean it up later.

The issue of reductions on empty collections is hardly new. See the links at:
JuliaLang/julia#18073 (comment)

Though it might not have been discussed again since the introduction of Base.return_type. You could try filing an issue in Julia to get some thoughts.

As regard the sum(Nullable{Int}[]) failure, maybe we'll have to define zero and one for Nullable, as they are quite basic functions (just like operators). Doesn't sound great, but lifting cannot help us here...

alyst · 2016-10-07T13:55:09Z

As regard the sum(Nullable{Int}[]) failure, maybe we'll have to define zero and one for Nullable, as they are quite basic functions (just like operators). Doesn't sound great, but lifting cannot help us here...

Nullable support for zero() and one() would be nice, but in general there should be a mechanism to make [map]reduce([f,] op, Nullable{T}[]) consistent with [map]reduce([f,] op, T[]), i.e. returning Nullable{T}(), when the original version throws ArgumentError and Nullable{T}(v) when the original version returns v.

davidagold · 2016-10-15T04:29:58Z

What alternative is there to throwing an error when one reduces over an empty array? Since reduce does not, in general, return a container, you can't use the inferred return type, say U to somehow inform whatever such a reduction yields -- that is, one can't just return an empty Container{U}.

The case of NullableArrays is different. Using the generic (AbstractArray) reduce along with a higher-order lift should (unless I'm missing something) give the following behavior for X::NullableArray{T}:

If X is non-empty but has no non-null values, then reduce(lift(op), X) returns an empty Nullable{U} where U is determined by type inference conducted in lift(op, x::T, y::T).
If X is empty, then reduce(lift(op), X) errors
If X is non-empty but has no non-null values, then reduce(op, dropnull(X)) errors
If X is non-empty and has non-null values, then reduce(op, dropnull(X)) returns the reduction applied to an Array containing only the non-null values of X.

I'd argue that this is almost exactly what we want. I kind of want to have a discussion about whether dropnull should return an Array or a NullableArray, which would affect the fourth behavior mentioned above. But in any case, I think that trying to reduce over missing data is not the same as trying to reduce over no data, and I'm happy if the former returns Nullable while the latter errors.

I think that, in general, we should try to rely extant frameworks in Base as much as possible. The main reasons not to are (1) performance concerns and (2) usability concerns. The community seems to be heading towards addressing (2) in packages that are closer to (tabular) data structures built on top of NullableArrays. As for (1), I think that, at this point, we should really be working with the Base language to improve compiler support for Nullable rather than hacking it onto NullableArrays. This would mean less maintenance work for us in the short term and would probably be better for both the language and code that deals with Nullables elsewhere in the long term.

alyst · 2016-10-15T08:10:30Z

@davidagold Thanks for your thorough considerations.

I totally agree that the Nullable behaviour should be defined by the Base as much possible. Actually, this PR started as a simple fix for the typo in 1-non-null case (it throws an error), which was not covered by the tests, and then 0-non-null return type. So while the behaviour fixes should be probably done in Base, I guess there should be some immediate fix for the bug.

Re the behaviour. It could work that the downstream packages (e.g. DataFrames) do the preliminary non-null checks during aggregation to prevent NullableArray throwing an exception in all-nulls skipnulls=true case, or catch that exception (whatever creates smaller overhead). I hope there would be just a very places where such check have to be injected, and the code for the check and Nullable{U}() type inference would be compact. OTOH, if DataFrames aggregation is the main/typical use case of reduction over nullables, it would make sense to make non-throwing nullable-returning behaviour the default and implement it in Base. The third option is to add strict=true/false kw argument for mapreduce() dealing with nullables (strict throws an error, non-strict returns null).

davidagold · 2016-10-18T02:24:55Z

@alyst hmm, you're definitely right that there's a fifth behavior we want, namely non-throwing over non-empty but all-null X::NullableArray. What do you think about the following proposal: we let reduce(op, X) fall back on the generic implementation, but implement lift such that lift(reduce)(op, X) has the following behavior:

if X is empty, throw an error
if X is non-empty but all-null, return Nullable{U}(), where U is chosen according via type inference (with some fall-back for non-concrete inferred types, if this makes sense).
if X is non-empty and not all null, return Nullable(reduce(op, dropnull(X)) (or reduce(lift(op), dropnull(X)) if we decide that dropnull should actually return a NullableArray).

nalimilan · 2016-10-18T07:40:44Z

lift(reduce)(op, X) sounds a bit surprising to me. Couldn't we use reduce(lift(op), X) for this? Or do you plan to give it different semantics?

davidagold · 2016-10-18T14:23:32Z

The difference between lift(reduce)(op, X) and reduce(lift(op), X) as I've delineated them above presents in the case where X has some null, some non-null values. In this case, lift(reduce)(op, X) will return reduce(op, dropnull(X)) which is non-null, whereas reduce(lift(op), X) will return (typically, assuming standard lifting semantics for op), an empty Nullable. This makes a difference in the case Alexey was talking about, aggregating over grouped data. If you want to handle aggregation with the same function over all groups, then reduce(lift(op), col) will return an empty Nullable both when a group has all null values and when a group has only some non-null values. I assume that's not what we want?

We could handle this by requiring/lowering to reduce(lift(op), dropnull(col)) and making reduction over empty NullableArrays (that is, NullableArrays with no elements, rather than an all-null NullableArray) not throw an error. Turns out this is what R does (y'all probably knew this already), but with a warning:

> x <- c()
> mean(x)
[1] NA
Warning message:
In mean.default(x) : argument is not numeric or logical: returning NA

Though it doesn't seem to be returning NA because x is empty per se. I'd be fine with breaking from the AbstractArray generic behavior and returning an empty Nullable with a warning rather than throwing an exception. Requiring dropnull may be a bit annoying to users, though from a statistical perspective making that step explicit is definitely preferable. In any case, including dropnull could be done automatically in a querying macro if we decide later that the usability factor is a big concern.

alyst · 2016-10-18T14:50:40Z

Actually, R doesn't warn if it's an empty logical or numeric, c() is something different:

> x <- c()
> x
NULL
> x <- as.integer()
> x
integer(0)
> mean(x)
[1] NaN
> mean(x, na.rm=TRUE)
[1] NaN

davidagold · 2016-10-18T15:00:07Z

Good point. Looks like Julia implements the same behavior for mean, at least for empty vectors with numeric eltypes.

nalimilan · 2016-10-18T18:49:25Z

The difference between lift(reduce)(op, X) and reduce(lift(op), X) as I've delineated them above presents in the case where X has some null, some non-null values. In this case, lift(reduce)(op, X) will return reduce(op, dropnull(X)) which is non-null, whereas reduce(lift(op), X) will return (typically, assuming standard lifting semantics for op), an empty Nullable.

Got it, thanks. Indeed returning null whenever the input contains a null isn't really useful for a type which is designed precisely to support missing values. I've always argued that this apparent correctness wouldn't offer any additional safety to users for who there always are nulls in the data. But as you say this can be handled by convenience macros.

nalimilan · 2017-01-23T09:25:34Z

For consistency with #166, I think we should have (map)reduce automatically lift. To implement this, we could do the same as for broadcast, i.e. call mapreduce(x -> lift(op, x...), ...). That would return null if the input contains a null. Then we could have an alternative version of lift which would skip nulls, i.e. lift_skipnull(op, x::Nullable, Nullable()) -> x. The choice between these two could be made via skipnull=true as currently.

This approach avoids calling dropnull, which creates a copy.

alyst force-pushed the fix_reduce branch from c398995 to d5b3a71 Compare October 7, 2016 08:06

alyst changed the title ~~Fix reduce with single non-null~~ Fix [map]reduce with no non-null and single non-null element Oct 7, 2016

alyst changed the title ~~Fix [map]reduce with no non-null and single non-null element~~ [RFC] Fix [map]reduce with no non-null and single non-null element Oct 7, 2016

alyst added 5 commits October 7, 2016 12:28

fix typo (mapreduce -> reduce)

1e2d6a0

fix mapreduce_skipnull for single nonnull element

1b75f28

test reduce with vectors of length 2

1b25f71

exposes some corner cases that are not observed with N>2

don't count missing data twice

c0f76ae

Avoid counting the zeros twice (for missingdata flag and for nnull) if skipnull=true.

fix mapreduce() if all values are null + tests

80f62d9

* mapreduce() returns Nullable{T}(mr_empty) instead of just mr_empty * added tests for [map]reduce() over non-empty collection that contain no non-null

alyst force-pushed the fix_reduce branch from d5b3a71 to 80f62d9 Compare October 7, 2016 10:29

alyst mentioned this pull request Jan 22, 2017

Enhance joining and grouping JuliaData/DataFrames.jl#850

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Fix [map]reduce with no non-null and single non-null element #158

[RFC] Fix [map]reduce with no non-null and single non-null element #158

alyst commented Oct 5, 2016

alyst commented Oct 5, 2016 •

edited

Loading

codecov-io commented Oct 7, 2016 •

edited

Loading

alyst commented Oct 7, 2016 •

edited

Loading

alyst commented Oct 7, 2016

nalimilan commented Oct 7, 2016

alyst commented Oct 7, 2016 •

edited

Loading

nalimilan commented Oct 7, 2016

alyst commented Oct 7, 2016

nalimilan commented Oct 7, 2016

alyst commented Oct 7, 2016 •

edited

Loading

nalimilan commented Oct 7, 2016

alyst commented Oct 7, 2016

nalimilan commented Oct 7, 2016

alyst commented Oct 7, 2016

davidagold commented Oct 15, 2016 •

edited

Loading

alyst commented Oct 15, 2016

davidagold commented Oct 18, 2016

nalimilan commented Oct 18, 2016

davidagold commented Oct 18, 2016

alyst commented Oct 18, 2016

davidagold commented Oct 18, 2016

nalimilan commented Oct 18, 2016

nalimilan commented Jan 23, 2017

[RFC] Fix [map]reduce with no non-null and single non-null element #158

Are you sure you want to change the base?

[RFC] Fix [map]reduce with no non-null and single non-null element #158

Conversation

alyst commented Oct 5, 2016

alyst commented Oct 5, 2016 • edited Loading

codecov-io commented Oct 7, 2016 • edited Loading

Codecov Report

alyst commented Oct 7, 2016 • edited Loading

alyst commented Oct 7, 2016

nalimilan commented Oct 7, 2016

alyst commented Oct 7, 2016 • edited Loading

nalimilan commented Oct 7, 2016

alyst commented Oct 7, 2016

nalimilan commented Oct 7, 2016

alyst commented Oct 7, 2016 • edited Loading

nalimilan commented Oct 7, 2016

alyst commented Oct 7, 2016

nalimilan commented Oct 7, 2016

alyst commented Oct 7, 2016

davidagold commented Oct 15, 2016 • edited Loading

alyst commented Oct 15, 2016

davidagold commented Oct 18, 2016

nalimilan commented Oct 18, 2016

davidagold commented Oct 18, 2016

alyst commented Oct 18, 2016

davidagold commented Oct 18, 2016

nalimilan commented Oct 18, 2016

nalimilan commented Jan 23, 2017

alyst commented Oct 5, 2016 •

edited

Loading

codecov-io commented Oct 7, 2016 •

edited

Loading

alyst commented Oct 7, 2016 •

edited

Loading

alyst commented Oct 7, 2016 •

edited

Loading

alyst commented Oct 7, 2016 •

edited

Loading

davidagold commented Oct 15, 2016 •

edited

Loading