unspecified behavior of `mapreduce` is bad #49042

fedeinthemix · 2023-03-18T11:19:14Z

The documentation for mapreduce says that it is unspecified if the init key argument is used for non-empty collections. Since there is a single julia implementation leaving the behavior unspecified is bad. In general the init argument is a starting point that you want to combine with the rest. For example

sol = nlsolve(f!, initial_x)
 # mapreduce doesn't specify if `init` is used. Need to use reduce(..., map(...))
 if mapreduce(map(v -> v > 0, &,  sol.zero, init=converged(sol))
     return ...
 else
   ...
 end

Using the init argument to specify the result in case of an empty collection is a special case.

The text was updated successfully, but these errors were encountered:

jishnub · 2023-03-21T07:29:01Z

From what I understand, the nomenclature is unfortunate, and init doesn't specify a starting point, rather it provides the identity element in the reduction. The documentation of mapreduce states this as well:

If provided, `init` must be a neutral element for `op` that will be returned for empty collections.

This implies that init is not arbitrary, and must satisfy op(op(init, x), y) == op(x, y). For the operator &, therefore, init should be set to true. In such a case, whether init is used or not doesn't change the result, and it's left as an implementation detail.

In your case, perhaps you might prefer foldl which guarantees that init is used exactly once, even if it's not set to the identity element. Or, perhaps the more explicit

if converged(sol) && mapreduce(map(v -> v > 0, &, sol.zero)

fedeinthemix · 2023-03-21T09:15:31Z

You are right that the documentation states that init must be a neutral element of the operation, but then the similarly worded documentation for reduce shows the following example

reduce(*, [2; 3; 4]; init=-1)

There are plenty of alternatives for my example, that's not the issue. My points are two. The first is that mathematically init doesn't have to be the neutral element (like -1 above) and it would be useful to allow this generality (like in the example I gave). Some other languages embrace this freedom
https://hackage.haskell.org/package/base-4.18.0.0/docs/Prelude.html#v:foldr

Second, documenting some behavior of an algorithm implementation as unspecified is unhelpful. Unspecified behavior may be acceptable in standards that want to allow for radically different implementations.

jakobnissen · 2023-03-21T11:04:38Z

Second, documenting some behavior of an algorithm implementation as unspecified is unhelpful. Unspecified behavior may be acceptable in standards that want to allow for radically different implementations.

Documenting it as unspecified allows Julia's implementation to change without being a breaking change. In other words, this is about being able to have different implementations across time.

That being said, the example you point to with reduce(*, [2; 3; 4]; init=-1) should indeed be changed, since its usage is incorrect.

Edit: Actually, it might be an issue (a bug?) that the two documented methods of reduce disagree on what init is. Is it the initial value, or the value returned on empty collections? Is it guaranteed to be used once? We should definitely clean this up.

Given that init has been documented to be the initial value for reduce on arrays, and it's called init, and the similar keyword in foldl, I think we should make it documented behaviour that it's used exactly once. Since we don't guarantee associativity, I'm not sure it's even meaningful to define it as the "initial value".

Are there any optimisation opportunities we lose if the guarantee that init is used exactly once?

N5N3 · 2023-03-21T11:07:18Z

See #44819.

fedeinthemix · 2023-03-21T13:05:23Z

Since we don't guarantee associativity, I'm not sure it's even meaningful to define it as the "initial value".

I guess I don't clearly understand the difference between reduce and foldr/foldl. Using (for clarity) Haskell's notation for the type of functions, the latter two require functions with a different signature

foldr :: (a -> b -> b) -> b -> t a -> b
foldl :: (b -> a -> b) -> b -> t a -> b

Associativity is irrelevant if the types a and b are the same, and the function is associative. In that case the two give the same result, but with different memory and speed results. Was reduce added as a way of helping picking the fastest one? If not, what is the actual semantic of reduce then?

jakobnissen · 2023-03-21T13:11:06Z

Re-ordering the operations can have various benefits. For example, re-ordering floating point operations allow you to use SIMD, making it much faster. It also enables pairwise summation, reducing the impact of float rounding errors.

So, if you have a function which is associative, you would use reduce in order to let the implementation decide the best order of operations. In that sense, foldl and foldr can be seen as special cases of reduce.

mcabbott · 2023-03-21T13:16:32Z

Are there any optimisation opportunities we lose if the guarantee that init is used exactly once?

CUDA does parallel reduction, and relies on init being the neutral element:

julia> let x = collect(1:10^4), init=2
        a = reduce(+, x; init)
        b = reduce(+, CuArray(x); init)
        a, b, sum(x)
       end
(50005002, 50046150, 50005000)

ctarn · 2023-04-05T12:27:54Z

It confuses me too. Both C++ and Python treat init of reduce as initial value or starting point.

C++: https://en.cppreference.com/w/cpp/algorithm/reduce

Generalized sum of init and *first, *(first+1), ... *(last-1) over binary_op

Python: https://docs.python.org/3/library/functools.html#functools.reduce

If the optional initializer is present, it is placed before the items of the iterable in the calculation, and serves as a default when the iterable is empty.

jishnub · 2023-04-05T12:36:52Z

The definition of functools.reduce in python seems to be equivalent to foldl in Julia

fedeinthemix · 2023-04-06T06:24:16Z

If init must be a unit and the operation op associative, then op and type ty form a monoid. Wouldn't it be useful to add something like monoidunit(ty, op) generalizing e.g. one, ones, zero and zeros. Then the init parameter could be either dropped or, more usefully, used as an initial value.

jariji · 2023-04-06T06:34:36Z

There is InitialValues.jl which says

InitialValues.jl provides a generic singleton initial value Init(f) that can be used as a₀ in f(a₀, x). For a binary operator op, it means that Init(op) acts like the left identity for any type of x

I'm not sure if it's better to do it with the two-argument monoidunit(t, op) or the one-argument wrapper Init(op).

fedeinthemix · 2023-04-06T06:58:21Z

I'm not sure if it's better to do it with the two-argument monoidunit(t, op) or the one-argument wrapper Init(op).

The type is important. For example for + if you specify the type you can use it for any scalar number, but also matrices, vectors, ... Also, without type you may loose type information. For example for vector concatenation the unit is the empty one, but without specifying the type you get

julia> v = [1, 2]
2-element Vector{Int64}:
 1
 2

julia> vcat(v, [])
2-element Vector{Any}:
 1
 2

ctarn · 2023-04-06T07:10:53Z

InitialValues.jl gives the following example:

julia> float(InitialValue(*))
1.0

julia> Integer(InitialValue(+))
0

So an InitialValue can also be casted to a specific type.

Personally I prefer to leave the feature as InitialValues.jl instead of adding it to standard lib of Julia.

fedeinthemix · 2023-04-06T07:24:11Z

Personally I prefer to leave the feature as InitialValues.jl instead of adding it to standard lib of Julia.

It is a suggestion to remove the requirement that the init parameter of reduce and mapreduce be a unit. Instead the implementation could use monoidunit.

vtjnash · 2023-10-31T14:48:59Z

Trying to wrap my head around this question a bit, it seems like the original PR for init left the meaning of init unspecified for mapreduce, but did specify that reduce is defined in terms of mapreduce and specified that init have the meaning of being the first element of the reduction, thus transitively enforcing this upon mapreduce also:
https://github.com/JuliaLang/julia/pull/27711/files#diff-cc983f27f9c33e9b278f74989b8aaca48e461e0717bb96ea116d02754554746fR290. I don't know if @andyferris remembers this from that time, and if there was a counter-example that prevented it from being defined at that point?

nlw0 · 2023-10-31T15:17:08Z

Perhaps it's not correct to say this is a case of "unspecified behavior". My understanding is that you should assume we could apply "op(init,x)" an arbitrary number of times to your input. Only the number of times is not specified. The idea is that we are enabling a potential parallel version of the reduce based on a number of folds that start with init. Is this fair to say?

Or I guess the number of applications is specified as "at least once", what includes the case of an empty list? And this would be the slightly complicated detail. Everything else, whether it's something suitable to do or not according to the operator in eg maximum or sum is just a consequence of the operator. My point is to clarify that it all sounds crazy unless you know you're talking about sum etc, and then it all sounds perfectly reasonable. And saying this is "unspecified behavior" doesn't do justice to this reasonableness.

vtjnash · 2023-10-31T20:20:50Z

There is no reason that fold needs to use init other than exactly once. In a parallel implementation, the init would likely be dealt with at the end, in the last reducer step, since non-empty chunks do not need it. Indeed, it is required that the implementations can use init zero times already, so it merely needs to arrange for init to only be used in the final serial potion (or any one of the initial parallel portions--either option is equivalent)

nlw0 · 2023-11-01T07:49:27Z

I didn't mean it would have to be like this, it's just one option on my mind. I believe you're looking at it from the point of view of what is strictly required, that init is the output at an empty list, and that's great. I'm merely illustrating a situation that exercises the other requirement: that it might be applied an indefinite number of times. When you write down your fold(s) as a for-loop it's very handy to initialize an accumulator variable before the loop with something such as an "init" value (eg zero in a summation) instead of carefully loading the first datapoint and then proceeding from there --- what actually implies testing for an empty list as well, I believe that may be the best illustration to why this makes sense.

andyferris · 2023-11-01T08:27:25Z

My memory is a bit vague. I feel init should probably be used if specified, not just for empty collections.

There might be an argument that (eg parallel) algorithms might use a divide-and-conquer approach and benefit from using init more than once.

I’m not sure if a different verb (generic function) is necessary for when we assume the operator is associative, or commutative, or that init is an identity, as opposed to one where the result must match the accumulation in iteration order as if you wrote it out imperatively?

mikmoore · 2024-03-20T16:37:55Z

EDIT: it's pointed out below that my tirade here only concerns one of several (sometimes conflicting) purposes of init. As such, it is incomplete in both its analysis and proposed "solution."

The init keyword is ridiculous except in the fold family where it is guaranteed to be applied exactly once. The reduce family should never have had this (badly-defined) init. It's a pun and it's a bad one.

If it wasn't a pun, then we would have maximum(Float64[]) == -Inf because the only valid init for this call is -Inf.
Instead it's a MethodError. (ASIDE: this should be an ArgumentError). We also have a minor infraction with our current sum(Float64[]) == +0.0 because the additive identity for IEEEFloat is -0.0 (compare +(-0.0, -0.0) and +(-0.0, +0.0).

Do I think that we should have maximum(Float64[]) == -Inf? No, it's a footgun. Do I think we should have sum(Float64[]) == -0.0? No, it's obnoxious and would be wildly unpopular. But this just shows that init was the wrong keyword with the wrong semantics.

An init keyword is useless for a reduce because the reduction order is unspecified. In other words, sum(x) + y is programmatically equivalent to what would want from sum(x; init=y) (if this were usable). I think we should deprecate init for all reduce functions.

Right now the correct thing to do is write (e.g.) isempty(x) ? DEFAULT_VALUE : maximum(x). If we want a keyword for this, we should make one like ifempty (insert bikeshedding) with usable semantics.

Under the semantics of init, we could make a nonbreaking change to deprecate init and introduce ifempty if we default ifempty = init when init is specified. This would simply resolve the ambiguous behavior of init to the semantic "only apply to empty collections when ifempty is not defined."

mbauman · 2024-03-20T17:55:41Z

It's worth connecting this to #52397 — one of the sticky points there is figuring out exactly how far to lean into init's unspecified-ness and what its current docs mean.

I can count four distinct (but entwined) motivating reasons why someone might use init in the context of foldl:

Provide the return value when empty
Ensure op is called for all non-empty collections (including single-element collections)
A shorthand to pretend init exists in the collection exactly once like foldl(op, [init; x]) or sum(x) + init.
Ensure all operations happen on a promoted type like foldl(+, rand(UInt8,100), init=0)

But when it comes to a order-unspecified (possibly parallel) reduce, points 3 and 4 are in direct conflict with eachother. And while I am in near-complete agreement with @mikmoore's points above, renaming this to ifempty leaves point 2 in the dark.

vtjnash · 2024-03-20T18:37:51Z

I think the aside comment points to #52004 and a couple others like it. Just waiting for a rebase if someone wants to take those over.

JeffBezanson · 2024-03-26T19:26:01Z

I agree it would have been better for reduce not to have the init argument. But it does, and what it does in Base seems to be Matt's point (3) above, pretend it's added to the collection once, so that's probably what the spec should be.

ctarn mentioned this issue Apr 5, 2023

Base.reduce(::typeof(vcat), ...) on DataFrames does not support init JuliaData/DataFrames.jl#3309

Closed

N5N3 added the fold sum, maximum, reduce, foldl, etc. label Apr 5, 2023

vtjnash mentioned this issue Oct 30, 2023

Clarify the possible uses of the init keyword in minimum, maximum and extrema #44819

Open

mbauman mentioned this issue Mar 26, 2024

Change init's role in reduce-like functions: remove "neutral element" restriction and guarantee its use #53871

Closed

mbauman linked a pull request Apr 3, 2024 that will close this issue

Document that reduce-like functions use init one or more times #53945

Open

non-Jedi mentioned this issue Aug 28, 2024

Add init kwarg to mean/median to match other reductions in Base JuliaStats/Statistics.jl#105

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unspecified behavior of `mapreduce` is bad #49042

unspecified behavior of `mapreduce` is bad #49042

fedeinthemix commented Mar 18, 2023

jishnub commented Mar 21, 2023

fedeinthemix commented Mar 21, 2023 •

edited

Loading

jakobnissen commented Mar 21, 2023 •

edited

Loading

N5N3 commented Mar 21, 2023

fedeinthemix commented Mar 21, 2023

jakobnissen commented Mar 21, 2023

mcabbott commented Mar 21, 2023

ctarn commented Apr 5, 2023 •

edited

Loading

jishnub commented Apr 5, 2023

fedeinthemix commented Apr 6, 2023

jariji commented Apr 6, 2023

fedeinthemix commented Apr 6, 2023 •

edited

Loading

ctarn commented Apr 6, 2023 •

edited

Loading

fedeinthemix commented Apr 6, 2023

vtjnash commented Oct 31, 2023

nlw0 commented Oct 31, 2023 •

edited

Loading

vtjnash commented Oct 31, 2023 •

edited

Loading

nlw0 commented Nov 1, 2023

andyferris commented Nov 1, 2023 •

edited

Loading

mikmoore commented Mar 20, 2024 •

edited

Loading

mbauman commented Mar 20, 2024 •

edited

Loading

vtjnash commented Mar 20, 2024

JeffBezanson commented Mar 26, 2024

unspecified behavior of mapreduce is bad #49042

unspecified behavior of mapreduce is bad #49042

Comments

fedeinthemix commented Mar 18, 2023

jishnub commented Mar 21, 2023

fedeinthemix commented Mar 21, 2023 • edited Loading

jakobnissen commented Mar 21, 2023 • edited Loading

N5N3 commented Mar 21, 2023

fedeinthemix commented Mar 21, 2023

jakobnissen commented Mar 21, 2023

mcabbott commented Mar 21, 2023

ctarn commented Apr 5, 2023 • edited Loading

jishnub commented Apr 5, 2023

fedeinthemix commented Apr 6, 2023

jariji commented Apr 6, 2023

fedeinthemix commented Apr 6, 2023 • edited Loading

ctarn commented Apr 6, 2023 • edited Loading

fedeinthemix commented Apr 6, 2023

vtjnash commented Oct 31, 2023

nlw0 commented Oct 31, 2023 • edited Loading

vtjnash commented Oct 31, 2023 • edited Loading

nlw0 commented Nov 1, 2023

andyferris commented Nov 1, 2023 • edited Loading

mikmoore commented Mar 20, 2024 • edited Loading

mbauman commented Mar 20, 2024 • edited Loading

vtjnash commented Mar 20, 2024

JeffBezanson commented Mar 26, 2024

unspecified behavior of `mapreduce` is bad #49042

unspecified behavior of `mapreduce` is bad #49042

fedeinthemix commented Mar 21, 2023 •

edited

Loading

jakobnissen commented Mar 21, 2023 •

edited

Loading

ctarn commented Apr 5, 2023 •

edited

Loading

fedeinthemix commented Apr 6, 2023 •

edited

Loading

ctarn commented Apr 6, 2023 •

edited

Loading

nlw0 commented Oct 31, 2023 •

edited

Loading

vtjnash commented Oct 31, 2023 •

edited

Loading

andyferris commented Nov 1, 2023 •

edited

Loading

mikmoore commented Mar 20, 2024 •

edited

Loading

mbauman commented Mar 20, 2024 •

edited

Loading