Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify the possible uses of the init keyword in minimum, maximum and extrema #44819

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

greimel
Copy link

@greimel greimel commented Apr 1, 2022

The init keyword is quite nice because it allows to compute extrema iteratively (e.g. update bounds as new data come in).

julia> x = 0:10
0:10

julia> y = -1:5
-1:5

julia> extrema(x)
(0, 10)

julia> extrema(y)
(-1, 5)

julia> extrema(y, init=extrema(x)) == extrema(x  y) == (-1, 10)
true

This PR documents and tests this use case. See also #43604 (comment)

Copy link
Member

@tkf tkf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a valid use case. For the reduce-family of API, init does not mean initial value. It means identity/neutral element. If we were to mention this type of usage, I believe that we should mention that the user has to combine the result using the corresponding binary operator:

x1, x2 = extrema(xs)
y1, y2 = extrema(ys)
min(x1, y1), max(x2, y2)

@N5N3
Copy link
Member

N5N3 commented Apr 1, 2022

I believe many users thought init means initial value, (and the default init is named as _InitialValue())
The current implement in Base use init as initial value rather than a identity/neutral element.

If init is used as the identity/neutral element, we should never use it if the input is not empty.

@N5N3 N5N3 added docs This change adds or pertains to documentation fold sum, maximum, reduce, foldl, etc. labels Apr 1, 2022
@tkf
Copy link
Member

tkf commented Apr 1, 2022

the default init is named as _InitialValue()

This is mainly a mechanism for foldl. That's why I only talked about "the reduce-family of API."

If it's a identity/neutral element, we should never use it if the input is not empty.

An empty collection is the identity element of the free monoid (i.e., isequal(vcat(vector, []), vector)). Considering reduction as a monoid morphism (i.e., isequal(reduce(⊗, vcat(xs, ys)), reduce(⊗, xs) ⊗ reduce(⊗, ys))), it is exactly what we should use.

@N5N3
Copy link
Member

N5N3 commented Apr 1, 2022

This is mainly a mechanism for foldl. That's why I only talked about "the reduce-family of API."

This make sense. I just realize that assuming init as identity/neutral element also simplify parallel-reduction. (Although we don't have that in Base.)

@greimel
Copy link
Author

greimel commented Apr 4, 2022

If it's a identity/neutral element, we should never use it if the input is not empty.

An empty collection is the identity element of the free monoid (i.e., isequal(vcat(vector, []), vector)). Considering reduction as a monoid morphism (i.e., isequal(reduce(⊗, vcat(xs, ys)), reduce(⊗, xs) ⊗ reduce(⊗, ys))), it is exactly what we should use.

I don't understand this, sorry.

Could you elaborate why my attempt seemed to work in my examples above? What would have to happen so that extrema(x, init=extrema(y)) != extrema(x ∪ y)?

Given this potential source of confusion, shouldn't init give an error if the itr is not empty?

@nlw0
Copy link
Contributor

nlw0 commented Apr 5, 2022

Completely agree with @tkf. If I may try to help explain, the problem is that there are clear constraints expected from the arguments, but unfortunately it is not easy, or maybe possible, to enforce these constraints with code.

I think we can say init will probably always behave like it was an extra element appended to the input vector, and we kind of have to live with that. It's not supposed to be always like that, though. The user is expected to provide a value that is consistent with a monoid. Otherwise, it's basically abusing the semantics of the function, and the user is susceptible to facing bugs if this was a parallelized version of the function, for instance.

There seems to be little that can be done other than offering defaults and explaining in the documentation that init is supposed to be the Identity element of a monoid defined along with op. Changing the name seems pretty drastic to me.

To be clear, the idea is that this function has a great potential to being parallelized. Or even more than that, the user should not assume that underneath the function we will simply iterate over the list and do ((a+b)+c), or ((init+a)+b)+c, etc. ((((a+init)+b)+init)+(init+c)) should return the same result. That's the "contract" if you will. It so happens that in the way things are implemented, especially in single-threaded execution, using init as means to append a value to the input works. But this is not the "contract", not guaranteed. reduce should feel free to op(init,...) your data as many times as it wants.

In fact, it's easy to see how this is a problem when you consider strings or list concatenation. Should this init go to the left or right of the result? Only if it's an empty list or string this will not matter. Once you consider non-commutative operations, it becomes clearer that you cannot really use init like that.

It might be great, in fact, to have knowledge about whether op is commutative or not. Working with commutative monoids and Abelian groups can bring advantages. reduce is intended to the more general cases, though.

@greimel
Copy link
Author

greimel commented Apr 5, 2022

Thanks, @nlw0 for your input.

Not sure if I've understood it already. You are saying that one requirement for init is that one should be able to push!(vec, init) arbitrarily often.

I see that this is a problem for sum.

sum([itr; init]) != sum([itr; init; init])

But this is not a problem for minimum, maximum, and extrema.

minimum([itr; init]) == minimum([itr; init; init]) == minimum([itr; fill(init, N)])

What am I missing?

EDIT: Does the contract imply that init is "appended" at least once?

@greimel
Copy link
Author

greimel commented Apr 5, 2022

And to add a little more confusion, here are two contradicting lines from the reduce docstring.

The first line

collections. It is unspecified whether `init` is used for non-empty collections.

answers my previous question (no, it's not guaranteed that init is used for non-empty collections). But the second line

julia/base/reduce.jl

Lines 454 to 455 in bf53498

julia> reduce(*, [2; 3; 4]; init=-1)
-24

relies on its use for a non-empty collection.

EDIT: Even worse: this example relies on applying init exactly once. So this would qualify as an invalid use case according to @tkf, I suppose?

@nlw0
Copy link
Contributor

nlw0 commented Apr 5, 2022

The contract implies that init might be inserted within your data zero or more times, unless it's an empty list and then it's one or more times. Reducing a list of "n" times init should be init.

min and max is a bit of a weird case. Julia does not specify a default init for reduce, and minimum and maximum do not accept empty lists. I suspect +Inf and -Inf would work as neutral elements to constitute a monoid with min and max over numbers, but I understand most programmers would rather have minimum(Int32[]) fail on an empty list than to get 4294967295 as an answer.

In fact, you won't have a problem with min and max as long as init is a data value in your original list. The result is going to be consistent. It's like you're defining a peculiar, ad-hoc monoid, but it does work. It's a bit of a quirk of this specific case... And it implies you're not processing an empty list.

@nlw0
Copy link
Contributor

nlw0 commented Apr 5, 2022

The case reduce(*, [2; 3; 4]; init=-1) is that story, it runs, but it's breaking the monoid laws. It's not really guaranteed that the result should be -24 it might be 24 as well. But you can bet a milkshake it's going to be -24. Just like reduce(*, ["aa","bb","cc"], init="zz") is probably always going to be "zzaabbcc", but it's not guaranteed. Nothing that is not consistent with monoid laws is guaranteed, but you may still get deterministic behavior.

The docstring is saying that you "can't" use -, but actually you "can", it's just bonkers. Again, it violates the monoid laws, but the function call will run...

About the function not specifying that init should be returned, I don't know about that, I would imagine that it should specify, but I kind of understand being conservative and leaving up to the user the responsibility of handling empty inputs.

@nlw0
Copy link
Contributor

nlw0 commented Apr 5, 2022

A short summary, perhaps:

  • These are good to go reduce(+, [1,2,3], init=0), reduce(*, [1,2,3], init=1) and reduce(*, ["aa","bb","cc"], init="")
  • These don't obey the monoid laws: reduce(-, [1,2,3], init=0), reduce(*, [1,2,3], init=-1) and reduce(*, ["aa","bb","cc"], init="zz")
    (- is not associative, the others don't have valid identity elements)
  • reduce(min, [1,2,3],init=3) sort of follows the monoid laws as well... There's probably a special name for this group.
  • reduce(op, [], init=init) the docstring says init is necessary for empty inputs, but that it is unspecified whether it is going to be used. I would suggest we would perhaps like to specify that reduce will return init for empty inputs as long as it follows the monoid laws. Otherwise getting a result of op(init, init) should be permissible as well. I don't know if there's a reason for leaving it completely unspecified other than not willing to commit to anything in the case it's not following the monoid laws, or just that it's difficult to explain.

Comment on lines 726 to +727
other element) as it is unspecified whether `init` is used
for non-empty collections.
for non-empty collections. If `init > maximum(itr)`, return `init`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am okay with making the behavior specified here (c.f. #49042), but the docstring needs to be consistent and not both say it is unspecified behavior and to also specify the behavior here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that the current maximum (and minimum and maybe elsewhere?) documentation is merely repeating what is said about the init parameter of mapreduce. But this advice is only true for mapreduce in general, and not for the specific cases of maximum and minimum. Here we should be able to say init will indeed be treated as just another element in the input, and will be returned as the output in case of an empty list. It would just be nice to have 1. a confirmation that this is the desired behavior for maximum, minimum and mapreduce and 2. perhaps the ability to prove this is the case looking at the implementation, and what methods are called in the specific case of maximum and minimum. I'm unfortunately not too familiar with the implementation, I can't easily make sense of it myself, and I'm not sure where the implementation for these methods diverges compared to other reducing operations (if it diverges at all. is it still unspecified in general for mapreduce?).

In other words, I suggest actually removing the text mentioning anything unspecified for these methods, and only mention the term will be output for an empty list, and will generally be treated as an extra item in the input.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That behavior is already specified for mapfoldr, mapfoldl, and mapreduce(identity), so it seems reasonable to assume the same for mapreduce(max) and thus maximum as well. But that is the question intended to be answered by #49042 (it looks only like a doc change to me now).

Aside, to be pedantic about this question:

reduce(min, [1,2,3],init=3) sort of follows the monoid laws as well... There's probably a special name for this group

I think this is exactly the same monoid law as the first example. In particular, the init is supposed to be ranging over the domain of the inputs. So if the input was UInt8 instead, then the init is 0xff instead. But if the input function generating that array was 2pi*sin%Int, then the init is arguably 6, since the domain of that input function is [-6, 6]. Using typemax is just a rough approximation of the expected domain in any case.

@fingolfin fingolfin changed the title Clarify the possible uses of theinit keyword in minimum, maximum and extrema Clarify the possible uses of the init keyword in minimum, maximum and extrema Feb 8, 2024
@adienes
Copy link
Contributor

adienes commented Sep 11, 2024

probably subsumed / closed by documentation improvements in #53945

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs This change adds or pertains to documentation fold sum, maximum, reduce, foldl, etc.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants