Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: make indexable collection API more uniform (keys, values, pairs) #22907

Merged
merged 2 commits into from
Sep 5, 2017

Conversation

JeffBezanson
Copy link
Member

The first commit adds pairs, which is supposed to return a key-value iterator for any indexable collection, and adds methods of keys and values for arrays and sets.

The second commit is just a demo of this, generalizing findmin and findmax so that the identity findmax(array) == findmax(Dict(pairs(array))) holds. This is a breaking change, since previously these functions only returned indices 1:n.

Discussion questions:

  • Do we want fallbacks values(x) = x and (maybe) keys(x) = countfrom(1)? This would allow us to keep the old behavior of findmin on generic iterators with the same code. Or, we need some way to identify iterators that support keys and values, or we could decide that findmin only works on indexable collections.
  • keys for a 1-d array returns an integer range, but for other arrays returns a CartesianRange. Seems ok to me?
  • Do we want CartesianIndex to be the "official" array index type, or should we deprecate it and just use tuples?

@JeffBezanson JeffBezanson added the collections Data structures holding multiple items, e.g. sets label Jul 21, 2017
@tkelman
Copy link
Contributor

tkelman commented Jul 21, 2017

How is pairs any different than enumerate ?

@JeffBezanson
Copy link
Member Author

enumerate(x) is exactly equivalent to zip(countfrom(1), x); it does not look at the indices of x when x is indexable.

@timholy
Copy link
Member

timholy commented Jul 22, 2017

Do we want CartesianIndex to be the "official" array index type, or should we deprecate it and just use tuples?

How do you feel about supporting t1 + t2 and t1 - 1? If you're reluctant to do that, then we need to keep CartesianIndex.

base/set.jl Outdated
@@ -8,6 +8,9 @@ mutable struct Set{T} <: AbstractSet{T}
end
Set() = Set{Any}()

# a set iterates its values
values(s::AbstractSet) = s
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is keys(s) supposed to return an error?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getindex is

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the new findmin now errors on sets? I think that makes sense.

@andyferris
Copy link
Member

Would it be better to use indices or something to match getindex?

@JeffBezanson
Copy link
Member Author

So I think keys and eachindex are basically the same function, the only potential difference being that eachindex returns "efficient" indices that might differ from "canonical" indices. This difference is a bit unfortunate. If we keep both, the key question is what kind of indices to return from functions like indmin. I would favor using the official CartesianIndex indices for all such purposes, and reserving eachindex as an optimization for when the indices won't be observed. @timholy @mbauman does that make sense?

@nalimilan
Copy link
Member

If we keep both, the key question is what kind of indices to return from functions like indmin. I would favor using the official CartesianIndex indices for all such purposes, and reserving eachindex as an optimization for when the indices won't be observed.

That's an issue that affects all find/search functions, not only indmin/indmax. Currently all these functions return a linear index, which is convenient for general iterables because it's just an integer value. OTOH returning a cartesian index for AbstractArray objects would mean that these functions would also have to return one-element tuples or CartesianIndex objects even for one-dimensional arrays and other iterables, which would be quite inconvenient.

See also #14086 (comment) and following comments, and the short mention about this (second bullet of the section) in the Search & Find Julep.

@mbauman
Copy link
Member

mbauman commented Aug 24, 2017

My gut reaction would be to:

  • Keep CartesianIndex objects. They're very handy.
  • Have keys always return integers for vectors.
  • Have keys always return CartesianIndices for multidimensional arrays.
  • Use eachindex to get the efficient index type for multidimensional arrays when you don't care what that index type is.

I think this sort of scheme would be okay propagating through to the find/etc APIs. You would opt-in to a linear index for a multidimensional array by simply doing a findmax(reshape(A, :)).

@JeffBezanson
Copy link
Member Author

I would be ok with that.

We could also add something like Indexed(keys, vals), which lets you pair an index iterator with a value iterator for the purpose of e.g. calling indmin on an arbitrary iterator. Or this could be accomplished via

pairs(x::Zip2) = x

allowing zip to be used for this purpose.

Related: #20684

@JeffBezanson JeffBezanson force-pushed the jb/keyvalue branch 2 times, most recently from 25915e7 to 68ed301 Compare August 24, 2017 20:58
@JeffBezanson
Copy link
Member Author

OK, I'm starting to get a handle on this. Here's what I did:

  • keys returns an integer range for vectors and a CartesianRange for n-d arrays
  • Make keys the basic function for getting the indices of something. eachindex falls back to it, with other methods as optimizations.
  • Use @timholy 's code for enumerate to implement pairs more efficiently for arrays, and make enumerate(::IndexStyle, a) a method of pairs instead.
  • keys can also accept an IndexStyle argument, in which case it's the same as eachindex.
  • findmin, findmax, indmin, and indmax return the same values as keys

However, I have not tackled the findmin(A, region) case yet. I assume this should change too?

@timholy
Copy link
Member

timholy commented Aug 24, 2017

You could probably get rid of linearindices in favor of keys. And I favor changes to findmin to return the same thing as eachindex.

@JeffBezanson
Copy link
Member Author

You could probably get rid of linearindices in favor of keys.

They don't do the same thing --- linearindices always returns an iterator that yields integers.

And I favor changes to findmin to return the same thing as eachindex.

This is a possible design, but it conflicts with a generic definition of findmin in terms of keys and values (or pairs), unless we fully merge eachindex and keys. It would be nice to have fewer functions, but I worry about using linear indices too much, and getting different types of indices if you change the representation of an array.

@timholy
Copy link
Member

timholy commented Aug 24, 2017

Right on linearindices. I had someone missed that keys would return CartesianRange for n-d.

While you're at this and thinking about enumerate, any thoughts on the annoyance that arose from the fact that enumerate once returned something that was both a counter and a valid index, but now that array indexing is more flexible you have to distinguish between these? #17978 and linked PR/issue therein. It seems terrible to have two different functions that do almost the same thing, yet the current state is a little weird, too.

@JeffBezanson
Copy link
Member Author

I think enumerate(x) is just zip(countfrom(1), x), and should perhaps be deprecated to that (except that it's convenient and quite heavily used). Callers that use the integer as an index should use pairs instead.

@andyferris
Copy link
Member

andyferris commented Aug 25, 2017

To offer a dissenting voice here - have we considered going the other direction, making indices better and removing keys? I mean, we have getindex on an Associative, so why don't Associatives have indices?

Or, another way to think of this is that If the set of things that I can index with are called keys, then surely I would get/set a value via getkey and setkey! rather than getindex/setindex!. However, I feel that "indices" are a better generic term to describe the indices of multidimensional arrays, associatives, and other containers. For whatever reason, "keys" gives me the distinct feel of hash maps and database tables, or else encryption.

Note that I do strongly support that we unify associatives and arrays and other containers that use getindex in some way. I also feel that indices(multidimensionalarray) should have always been some iterable of actual indices like CartesianRange.

@StefanKarpinski
Copy link
Member

I think enumerate(x) is just zip(countfrom(1), x), and should perhaps be deprecated to that (except that it's convenient and quite heavily used).

Please don't do that – enumerate is so useful in working with collections, and it's a nice little bit of familiarity for Python users that I think it would would be really sad to lose. Aside: I've started using enumerate in a handy pattern for printing separators between collection elements:

for (i, x) in enumerate(itr)
    i > 1 && print(io, ", ")
    print(io, x)
end

It doesn't require the length of the iterable, which the way I was doing this before did.

@JeffBezanson
Copy link
Member Author

I agree, I would be perfectly happy to rename keys to indices. (see #23434)

Especially with dictionaries, people very often use "key/value" terminology, so it's nice to have keys and values as a pair of functions. However, keys could just be an alias for indices. But of course, if dictionaries iterated their values then we wouldn't need values at all.

@JeffBezanson
Copy link
Member Author

I agree re: enumerate as well; I use it a lot and would not want to actually deprecate it.

@iamed2
Copy link
Contributor

iamed2 commented Aug 25, 2017

Idea: move enumerate to Base.Iterators. That way it's clear that it's a generic iterator function and not related to indices.

@JeffBezanson
Copy link
Member Author

OK, I think this is ready to go. The function is still called keys, but we can decide separately on renaming it.

I've updated all findmin and findmax methods to return CartesianIndex for non-vectors.

@StefanKarpinski
Copy link
Member

What's up with the build failures?

@ararslan ararslan added the breaking This change will break code label Sep 6, 2017
@JeffBezanson
Copy link
Member Author

In a way that's a good sign, since it means the new behavior gives what that code actually wanted, without needing to call ind2sub.

@martinholters
Copy link
Member

Would it make sense to add a deprecated ind2sub(::Tuple, ::CartesianIndex) to make this less breaking?

martinholters added a commit to HSU-ANT/ACME.jl that referenced this pull request Sep 13, 2017
martinholters added a commit to HSU-ANT/ACME.jl that referenced this pull request Sep 13, 2017
martinholters added a commit to HSU-ANT/ACME.jl that referenced this pull request Sep 14, 2017
martinholters added a commit that referenced this pull request Sep 14, 2017
Makes the return type change of #22907 less breaking by allowing the
common pattern `ind2sum(size(a), indmax(a))` to still work.
martinholters added a commit that referenced this pull request Sep 21, 2017
Makes the return type change of #22907 less breaking by allowing the
common pattern `ind2sub(size(a), indmax(a))` to still work.
mbauman pushed a commit that referenced this pull request Sep 22, 2017
…23708)

Makes the return type change of #22907 less breaking by allowing the
common pattern `ind2sub(size(a), indmax(a))` to still work.
@Jutho
Copy link
Contributor

Jutho commented Oct 13, 2017

+1 for this change but type inference on (f)indmin for tuples is completely lost. This should clearly be fixed (and tested).
v0.6:

julia> @code_typed indmin((1,2,3,4))
CodeInfo(:(begin 
        SSAValue(0) = $(Expr(:invoke, MethodInstance for findmin(::NTuple{4,Int64}), :(Base.findmin), :(a)))
        return (Base.getfield)(SSAValue(0), 2)::Int64
    end))=>Int64

v0.7:

julia> @code_typed indmin((1,2,3,4))
CodeInfo(:(begin
        SSAValue(0) = $(Expr(:invoke, MethodInstance for findmin(::NTuple{4,Int64}), :(Base.findmin), :(a)))::Tuple{Any,Any}
        return (Base.getfield)(SSAValue(0), 2, true)
    end)) => Any

@Jutho
Copy link
Contributor

Jutho commented Oct 13, 2017

Ok so I looked at the new implementation of this PR. The obvious solution is to also use the IndexValue iterator for tuples. Is the type bound on A in IndexValue{I,A<:AbstractArray} really necessary?

A second question about the following lines:

pairs(::IndexLinear,    A::AbstractArray) = IndexValue(A, linearindices(A))
pairs(::IndexCartesian, A::AbstractArray) = IndexValue(A, CartesianRange(indices(A)))

# faster than zip(keys(a), values(a)) for arrays
pairs(A::AbstractArray)  = pairs(IndexCartesian(), A)
pairs(A::AbstractVector) = pairs(IndexLinear(), A)

Why not just pairs(A::AbstractArray) = IndexValue(A, eachindex(A)). Or does the eachindex iterator need to go away in time?

@Jutho
Copy link
Contributor

Jutho commented Oct 13, 2017

Actually also for Dicts the current behavior, while nicer than the previous, is completely type unstable. The use of generators for this does not seem like a good idea, unless the type inference of generators can be fixed. In general I find myself not using generators as much as I would have liked because of the associated type instabilities they bring along.

@Jutho
Copy link
Contributor

Jutho commented Oct 13, 2017

Ok here are a few possible solutions that seem to work: I can prepare a PR for any of them, once a concensus is reached:

  • Solution 1:
    Replace
pairs(collection) = Generator(=>, keys(collection), values(collection))

with

pairs(collection) = Generator(=>, zip(keys(collection), values(collection)))

and add a definition Pair(p::Tuple{Any,Any}) = Pair(p[1],p[2])

This seems to fix the type instability with dicts and tuples.

  • Solution 2:
    Widen type restriction in IndexValue iterator and make pairs(::Tuple) use IndexValue. Solves the problem for tuples, not for dicts, unless ...

  • Solution 3

Replace IndexValue iterator with a KeyValue iterator that is used for AbstractArray, Tuple and Dict (which corresponds to the goal of this PR to make these more uniform and reduce the distinction between them).

@Jutho
Copy link
Contributor

Jutho commented Oct 14, 2017

Actually, solution 4 is even simpler:

pairs(collection) = (k=>v for (k,v) in zip(keys(collection), values(collection))

This looks like it should be identical to the current definition

pairs(collection) = Generator(=>, keys(collection), values(collection))

yet the former is inferable, the latter is not. Not sure why that is, but I expect that changing the definition to the first one is something everybody could agree on?

@JeffBezanson
Copy link
Member Author

I think this can be fixed by adding an extra Generator constructor. I'll try it.

@Jutho
Copy link
Contributor

Jutho commented Oct 14, 2017

I've just tried adding Generator(f,I1,I2) = ... explicitly and didn't work.

pairs(collection)  = Generator(kv->(kv[1]=>kv[2]), zip(keys(collection), values(collection)))

seems to work though. I was just preparing a PR.

@JeffBezanson
Copy link
Member Author

Should be fixed by #24145.

@GiggleLiu
Copy link
Contributor

So what should I use instead of ind2sub to get the true index?

@StefanKarpinski
Copy link
Member

Please ask questions on the Julia discourse discussion forum.

@tkf tkf mentioned this pull request Aug 24, 2019
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrays [a, r, r, a, y, s] breaking This change will break code collections Data structures holding multiple items, e.g. sets
Projects
None yet
Development

Successfully merging this pull request may close these issues.