Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Add support for keys(g::Generator) (fixes #27612) #27640

Closed
wants to merge 2 commits into from

Conversation

drewrobson
Copy link

As discussed here and here, this PR proposes that argmax(f(x) for x ∈ itr) should work and return an x in itr that maximizes f(x). This is closely related to the mathematical definition of argmax, and consistent with other current 1-arg usages of argmax, such as argmax(A::AbstractArray) and argmax(d::Dict) as pointed out here.

*: As pointed out by @Nosferican, this PR is not precisely the same as the mathematical definition of argmax, because it follows the existing pragmatic convention of returning a single maximizer instead of all of them. Discussion of this point can continue here.

For context, I also considered whether it would make sense to have a 2-arg flavor of argmax(f, itr) that returns an x in itr that maximizes f(x). Helpful feedback from @ararslan and @martinholters persuaded me that this 2-arg syntax should be reserved for a different purpose, explained here.

The strategy taken here is to define keys(g::Generator). This gives us pairs(g), argmin(g), argmax(g), findmin(g), and findmax(g) for free. Here's a demo:

julia> g = (-x^2 for x in -5:5)
Base.Generator{UnitRange{Int64},getfield(Main, Symbol("##3#4"))}(getfield(Main, Symbol("##3#4"))(), -5:5)

julia> for p in pairs(g)
           display(p)
       end
-5 => -25
-4 => -16
-3 => -9
-2 => -4
-1 => -1
0 => 0
1 => -1
2 => -4
3 => -9
4 => -16
5 => -25

julia> argmin(g)
-5

julia> argmax(g)
0

julia> findmin(g)
(-25, -5)

julia> findmax(g)
(0, 0)

If there is support for this PR, then I will add tests and update the docs as well. I should probably check the performance of this solution as well.

This gives us pairs(g), argmin(g), argmax(g), findmin(g), and findmax(g)
@ararslan ararslan added needs tests Unit tests are required for this change collections Data structures holding multiple items, e.g. sets labels Jun 18, 2018
@stevengj
Copy link
Member

Should we also define getindex(g::Generator, i) = g.f(i), then? I'm a little confused by a keys method that doesn't correspond to a getindex...

@mbauman
Copy link
Member

mbauman commented Jun 18, 2018

Yes, I feel this exact tension that you described well: #27612 (comment)

Maybe that's a bad idea, since g[10] doesn't actually work so keys(g) = 1:length(g) would be false advertising.

Alternatively, one could have a specialized definition for argmax for generators. The implementation would be nearly identical, except replace keys(g) with 1:length(g). This way, there's no need to define keys(g) in general if that is a concern.

Also note that g.iter isn't guaranteed to be something that's well-ordered or even independently iterable:

julia> g = (x for x in eachline("NEWS.md"))
Base.Generator{Base.EachLine,getfield(Main, Symbol("##3#4"))}(getfield(Main, Symbol("##3#4"))(), Base.EachLine(IOStream(<file NEWS.md>), getfield(Base, Symbol("##295#296"))(Core.Box(IOStream(<file NEWS.md>))), false))

julia> g.iter
Base.EachLine(IOStream(<file NEWS.md>), getfield(Base, Symbol("##295#296"))(Core.Box(IOStream(<file NEWS.md>))), false)

julia> g = (x for x in 1:4, y in 1:5)
Base.Generator{Base.Iterators.ProductIterator{Tuple{UnitRange{Int64},UnitRange{Int64}}},getfield(Main, Symbol("##5#6"))}(getfield(Main, Symbol("##5#6"))(), Base.Iterators.ProductIterator{Tuple{UnitRange{Int64},UnitRange{Int64}}}((1:4, 1:5)))

julia> g.iter
Base.Iterators.ProductIterator{Tuple{UnitRange{Int64},UnitRange{Int64}}}((1:4, 1:5))

julia> collect(g.iter) # this would typically be CartesianIndices, not tuples
4×5 Array{Tuple{Int64,Int64},2}:
 (1, 1)  (1, 2)  (1, 3)  (1, 4)  (1, 5)
 (2, 1)  (2, 2)  (2, 3)  (2, 4)  (2, 5)
 (3, 1)  (3, 2)  (3, 3)  (3, 4)  (3, 5)
 (4, 1)  (4, 2)  (4, 3)  (4, 4)  (4, 5)

@mbauman
Copy link
Member

mbauman commented Jun 18, 2018

Maybe we could support keys and indexing just for Generator{<:AbstractArray} and Generator{<:Base.Iterators.ProductIterator{<:Tuple{Vararg{AbstractArray}}}}?

@drewrobson
Copy link
Author

drewrobson commented Jun 18, 2018

Also note that g.iter isn't guaranteed to be something that's well-ordered or even independently iterable:

Maybe I'm misunderstanding, but in every case I've tried (including the examples you wrote) I can do this:

for i in g.iter
    println(i)
end

Doesn't that mean g.iter is independently iterable?

As for "well-ordered", if the order of iteration is not defined and/or not deterministic, then this would return the first maximizer encountered. I suppose the worry here that it is hard to say what "first" means in this case?

Edit: I was thinking this could work on argmax(f(x) for x ∈ S) for S::Set just fine. Since Set is unordered, the only assurance is that I will get an element of S that is a maximizer. I don't have an actual use case for this though, so if people are uncomfortable with that I'm not so attached to it.

@Nosferican
Copy link
Contributor

I think generally if a method takes an iterable why shouldn't it support Generator... however, for this particular case, it doesn't actually require it if go with the actual mathematical definition which is defined for sets (#27639).

@mbauman
Copy link
Member

mbauman commented Jun 18, 2018

By independently iterable, I mean that you can iterate g.iter without affecting g itself:

julia> g = (x for x in eachline("NEWS.md"))
Base.Generator{Base.EachLine,getfield(Main, Symbol("##11#12"))}(getfield(Main, Symbol("##11#12"))(), Base.EachLine(IOStream(<file NEWS.md>), getfield(Base, Symbol("##295#296"))(Core.Box(IOStream(<file NEWS.md>))), false))

julia> first(g.iter)
"Julia v0.7.0 Release Notes"

julia> first(g)
"=========================="

julia> collect(g.iter);

julia> collect(g)
0-element Array{Any,1}

@drewrobson
Copy link
Author

Aha! Thanks for clarifying.

@bramtayl
Copy link
Contributor

bramtayl commented Jun 19, 2018

Maybe getindex_unsafe which would fall back on getindex but also support potentially unsafe operations? Could potentially be used for all sorts of things, e.g. zips, lags, etc. Also something to consider would be view_unsafe for ultimate laziness

@drewrobson
Copy link
Author

drewrobson commented Jun 19, 2018

An alternative would be to define pairs(g::Generator) directly. This way, g.iter is not exposed and cannot be consumed independently of g. Then there is also no need for getindex at all.

Edit: In other words, don't define keys(g::Generator) at all.

@JeffBezanson
Copy link
Member

keys and pairs go together; both or neither should be supported.

I'm torn on this one. The functionality seems very useful but I'm not sure whether generators have indices.

I'm also not sure argmax(f, x) needs to mean argmax(map(f, x)). In the standard definition of argmax it operates on both a function and a set, which you'd think correspond to the two arguments. There is potential for confusion there. The "implicit map" convention can't be taken too seriously, since there are too many other things passing a function as the first argument could mean.

@nalimilan
Copy link
Member

All find* functions used to support any collection, assuming 1-based indices for non-AbstractArray. This was removed intentionally after discussion at #25999.

We could imagine having a special case for iterators which are based on AbstractArrays, which could inherit the indices of the original collection over which they iterate. But then we should also define keys. That's quite radical.

A more limited approach would be to add a convenience wrapper iterator withlinearindices, which would be indexable and would support all find* functions (including argmax).

@drewrobson
Copy link
Author

drewrobson commented Jun 20, 2018

The discussion here has prompted me to think about what a generator is. Here's an attempt: (sorry it's so long!)

I briefly thought a generator is a lazy Dict, but that's wrong. Iterating a Dict yields Pair{K, V} iterates, and collect(d::Dict) yields a collection with eltype Pair{K, V}. Iterating a Generator yields only values (specifically, values in the range of g.f), and collect(g::Generator) yields a collection of values (again, in the range of g.f).

So a generator is more like an array. I think g::Generator is an AbstractArray that lazily computes its values on demand. Its indices are precisely g.iter. It would follow that g::Generator is an AbstractArray even if g.iter is something else, such as a Set. The keys of an AbstractArray can be any collection, not just AbstractArray, so we have to be really careful which "collection" we are talking about: the collection of values, g, which I claim is an AbstractArray, or the collection of keys, which I claim can be any collection (*1).

Unlike in #25999, where find* functions had to be restricted because not all collections have keys/indices, there is no need to restrict find* functions to a "safe" subset of generators because we always have keys(g::Generator) = g.iter. In particular, it doesn't matter whether g.iter itself has keys (*2) or whether g.iter has a well defined ordering. These are both irrelevant since g.iter is the collection of keys, not the collection of values. The lessons of #25999 only apply to the collection of values (i.e. g).

As an experiment, I have updated this PR so that getindex(g::Generator, i) is supported and pairs(g::Generator) returns a Pairs iterator that efficiently iterates the indices and values. In other words, pairs converts a lazy array to a lazy dict-like collection. Indeed, we can now write Dict(pairs(-x^2 for x in -5:5)) to get a proper (i.e. eager, non-lazy) Dict, generated by efficiently iterating the Pairs iterator derived from the generator.

*1: @mbauman made a good point that some iterables modify underlying state and are not great candidates to be a stable keys collection. It would be great to have a way to restrict the keys collection to collections that are independently iterable, but I think this is currently hard to specify. FWIW, one can also get into trouble even applying map to collections where iteration has side effects, and yet it is still allowed. To some extent, I think we just have to trust people to use tricky iterables correctly, but maybe there's a better way (IndependentlyIterable trait? caveat: I don't know what I'm talking about).

*2: @nalimilan's suggestions (e.g. "inherit the indices of the original collection" and inventing linear indices like before #25999) remind me of the alternative way that some people were interpreting a 2-arg argmax discussed in #27613. Perhaps some people regard the expression (-x^2 for x in -5:5) as a composition of -x^2 on top of g.iter, which is itself a mapping over its indices (in this case a map from 1:11 to -5:5). But wait, those indices have indices too! We have to designate some collection to be the domain / keys and then stop, even if that collection can be indexed and could be viewed as a map over its indices as well. I am learning that there are two popular ideas about how many steps to go back. I like my way because it allows me to write best_angle = argmax(score(angle) for angle in angles), which was more or less the 1 line of code that launched me on this crusade. 😃

@vtjnash
Copy link
Member

vtjnash commented Jun 20, 2018

We have to designate some collection to be the domain / keys and then stop, even if that collection can be indexed and could be viewed as a map over its indices as well

That's a really interesting point.

Its indices are precisely g.iter.
*2

I came here to mention that they indices might also be 1:length(iter) (e.g. Iteraters.countfrom()). This is the indices that would be assigned to the elements by collect when converting from lazy-map to eager-map (especially as we are likely converging on defining map as an order-preserving operation).

However, that re-indexing operation is perhaps already expressed by enumerate, leaving us free to consider defining that the keys (multi-)set is g.iter.

@drewrobson
Copy link
Author

Yes, it's interesting. Right now, I am worrying about the fact that keys(collect(g)) ≠ keys(g) though. But to fix that, collect(g) would have to return something like an OffsetArray, at least in my simple (-x^2 for x in -5:5) example (generally, it'd be: when applying a function to a collection, retain that collection as the keys of the resulting array) and that seems nuts.

@mbauman
Copy link
Member

mbauman commented Jun 20, 2018

We could punt the keys and getindex method errors to the g.iter:

Base.keys(g::Base.Generator) = keys(g.iter)
Base.getindex(g::Base.Generator, i) = g.f(g.iter[i])

This fits quite nicely with the design we're trying out in #27038.

@vtjnash
Copy link
Member

vtjnash commented Jun 20, 2018

We could punt the keys and getindex method errors to the g.iter:

That's not right though, since – for the OP use case – the goal here is to treat (f(x) for x in y) as Dict(y => f(x) for x in y) when passed to argmax as a single value

@Nosferican
Copy link
Contributor

After the discussion, I believe argmax/argmin should be restricted to two arguments, a f::Function and an iterable.

@vtjnash
Copy link
Member

vtjnash commented Jun 20, 2018

keys(collect(g)) ≠ keys(g)

I think collect (possible renamed Array in the future?) must be defined essentially as "take the elements from g, in order, and re-index them with enumerate, such that they are compatible with Array – e.g. it's an operation that explicitly changes keys to 1:length, while preserving the element. In the past, for some of the related operations like map, we've attempted to preserve some alternative key-mappings (such as Dict=>Dict, Set=>Set, and String=>String), but I think we've found that it can't be done in a consistent manner. And, as a corollary to that, that it's more important / useful to maintain the exact output and iteration order through any eager-lazy re-expressions.

One follow-on observation derived from this is that it seems like lazy transforms may be able to always preserve their original key set (Generator -> g.iter, Pairs -> p.itr, Enumerate -> Count, Zip -> Zip(keys), ValueIterator -> keys(iter.dict)), but can't be certain to provide random access. But eager transforms (map, collect, for(each), tuple-splat / Core._apply) always re-index them, guaranteeing efficient random-access as 1:n (but sometimes also preserving the shape of higher-order dimensions, if defined and applicable), but no longer providing the same keys (unless the old keys also happened to be 1:n). This is just my theory though.

(And there always seem to be counter-examples to any theory. For example, broadcast appears to be a mapping of keys, via getindex, not iteration. This requires that the lazy operation changes the key-set (to the broadcast over the input key-sets), and that the eager operation (aka copyto) preserves those new keys.)

@vtjnash vtjnash added the triage This should be discussed on a triage call label Jun 20, 2018
@mbauman
Copy link
Member

mbauman commented Jun 20, 2018

That's not right though, since – for the OP use case – the goal here is to treat (f(x) for x in y) as Dict(y => f(x) for x in y) when passed to argmax as a single value

It's not really a question of right or wrong — it's simply trying to decide how we define keys(::Generator). Is it array-like? Or is it dictionary-like?

Given its multidimensional behavior and symmetry with comprehensions, I'd say it's more array-like.

@drewrobson
Copy link
Author

drewrobson commented Jun 20, 2018

Yes, definitely array-like. But since dictionaries and arrays both have keys, I've been looking for a definitive statement about their differences. So far, I know that a dictionary has eltype Pair{K, V} whereas an array keeps its keys separate from its values (e.g. iterates only its values even though it also has keys). And arrays have dimensions. But none of this precludes arrays having nonstandard keys (e.g. OffsetArray), although it does seem like there is a desire to keep indexing as predictable / well-behaved as possible.

Perhaps this PR violates the spirit, if not the letter, of the dict/array divide. Edit: Ok, spirit and letter, as pointed out below.

@mbauman
Copy link
Member

mbauman commented Jun 20, 2018

I've been looking for a definitive statement about their differences

Arrays require their keys to be isa AbstractUnitRange or a Cartesian product thereof. I think that is the most salient difference.

My definition in #27640 (comment) is indeed wrong, though, because —as you note — dictionaries iterate over their pairs whereas arrays iterate over the values:

Base.keys(g::Base.Generator) = keys(g.iter)
Base.getindex(g::Base.Generator, i) = g.f(g.iter[i])

julia> g = (i for i in Dict((:a=>1,:b=>2)))
Base.Generator{Dict{Symbol,Int64},getfield(Main, Symbol("##7#8"))}(getfield(Main, Symbol("##7#8"))(), Dict(:a=>1,:b=>2))

julia> g[:a]
1

julia> g[:b]
2

julia> collect(g)
2-element Array{Pair{Symbol,Int64},1}:
 :a => 1
 :b => 2

@drewrobson
Copy link
Author

drewrobson commented Jun 20, 2018

Thanks, that was very helpful! Well in that case a generator isn't guaranteed to be array-like because it is possible that neither g.iter nor keys(g.iter) is isa AbstractUnitRange or a Cartesian product thereof. So I guess we can either treat a generator as array-like in the subset of cases where it makes sense (which now reveals to me the reasoning of some of the suggestions above) or else accept that a generator is really its own thing and shouldn't support the syntax in the OP embrace that a generator is really its own thing and consider supporting it as such.

@bramtayl
Copy link
Contributor

What about just pointing people to packages like MappedArrays (or ZippedArrays, or ShiftedArrays, etc.)

@vtjnash
Copy link
Member

vtjnash commented Jun 20, 2018

I've been looking for a definitive statement about their differences. So far, I know that a dictionary has eltype Pair{K, V} whereas an array keeps its keys separate from its values (e.g. iterates only its values even though it also has keys). And arrays have dimensions

I think, most nearly, the difference is that for an Array, the key (aka index) associated with an element is defined by its order within that container (and thus it's sensible to say the array has dimensions, since the keys are ordered), whereas with a Dict, the key is instead a computation on the element. Like the definition for egal, this difference is primarily visible as an aspect of mutation (insertion / removal). And there's some overlap in many cases (an OrderedDict is capable of preserving a notion of the original array axis it was constructed from, an Array can be transformed to look like a Dict with pairs), which makes it convenient to want to ignore this distinction in some cases.

@JeffBezanson
Copy link
Member

A good rule of thumb is that queries about an object should only deal with the semantics of that specific object, and not look too deeply inside its constituent objects. So if keys is supported at all, keys(g::Generator) = g.iter would make the most sense to me.

@drewrobson
Copy link
Author

drewrobson commented Jun 21, 2018

Another interesting hint:

julia> f(x) = -x^2;
julia> s = Set(-5:5);
julia> map(f, s)
┌ Warning: `map(f, s::AbstractSet)` is deprecated, use `Set((f(v) for v = s))` instead.
│   caller = top-level scope at none:0
└ @ Core none:0
Set([0, -16, -4, -1, -25, -9])

(f(v) for v = s) is not simply array-like. It really is its own thing, like a precursor to various collections, e.g. Set(f(v) for v = s) and collect(f(v) for v = s).

If collect is renamed Array (#27640 (comment)) then that latter expression would become Array(f(v) for v = s) which is very pretty! This rename would make it clearer that the values are gaining array indices as they go into the array, which is less clear if this operation is called collect.

With this PR, it would make sense to write Dict(f(v) for v = s) as well.

In summary: As a lazy object, a generator is a precursor to a collection, but we are afforded the choice of collection (e.g. Set, Array, Dict) when converting from lazy to eager.

@JeffBezanson
Copy link
Member

With this PR, it would make sense to write Dict(f(v) for v = s) as well.

I don't think it's that simple. Currently Dict( ) expects an iterator of pairs. We already use Dict( i=>i for i in x) for example. So allowing that other form would mean sometimes Dict calls keys or pairs on its argument, and sometimes it doesn't. Or, it would mean we're changing the Dict interface to accept an indexed collection, instead of a pair iterator. That could certainly be a good design, it just has a lot more implications.

@StefanKarpinski
Copy link
Member

StefanKarpinski commented Jun 21, 2018

Triage agrees that generators don't and shouldn't have indices. Regarding the argmax question, instead this is the right way forward:

  1. argmax(f, v) returns the first value x in v at which f(x) is maximal.
  2. argmax(v) = argmax(i -> v[i], keys(v)), i.e. v is viewed as a map from indices to values.
  3. argmax(::Generator) is not defined.

@StefanKarpinski StefanKarpinski removed the triage This should be discussed on a triage call label Jun 21, 2018
@drewrobson
Copy link
Author

Thanks everyone for weighing in, I learned a lot! The way forward is clear, and the 2-arg form will satisfy my use case very nicely.

@drewrobson drewrobson closed this Jun 22, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
collections Data structures holding multiple items, e.g. sets needs tests Unit tests are required for this change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants