Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add into(T::Type, iterable) -> collection::T #36537

Open
wants to merge 7 commits into
base: master
Choose a base branch
from
Open

Conversation

tkf
Copy link
Member

@tkf tkf commented Jul 4, 2020

close #36288

This PR implements the API I proposed in #36288. Here is the (refined) API I ended up implementing in this PR. Quoting the docstrings:

Surface API

into(T::Type, iterable) -> collection::T
into(T::Type) -> iterable -> collection::T

Construct a new collection of type T that contains the elements in iterable. If iterable is also a container, it acts as a shallow-copy.

If T has eltype, keytype, or valtype information, all elements in collection are converted to the destination type to guarantee the constraint collection isa T.

If T has size or length information (e.g., NTuple and StaticArray), providing collection with unmatched size or length throws an error.

Unary form into(T::Type) returns a callable iterable -> into(T, iterable).

Extended help

Example

into takes care of the conversion of storage and element types:

julia> into(Array{Int}, BitVector([0, 1, 0, 0]))
4-element Array{Int64,1}:
 0
 1
 0
 0

into acts like a shallow copy:

julia> xs = Ref.([1, 2, 3]);

julia> ys = into(Vector, xs)
3-element Array{Base.RefValue{Int64},1}:
 Base.RefValue{Int64}(1)
 Base.RefValue{Int64}(2)
 Base.RefValue{Int64}(3)

julia> ys[1] = Ref(100);

julia> xs
3-element Array{Base.RefValue{Int64},1}:
 Base.RefValue{Int64}(1)
 Base.RefValue{Int64}(2)
 Base.RefValue{Int64}(3)

julia> ys[2][] = 200;

julia> xs
3-element Array{Base.RefValue{Int64},1}:
 Base.RefValue{Int64}(1)
 Base.RefValue{Int64}(200)
 Base.RefValue{Int64}(3)

into always treats input iterable as a collection:

julia> into(Dict, (:a => 1) => (:b => 2))
Dict{Symbol,Int64} with 2 entries:
  :a => 1
  :b => 2

julia> Dict((:a => 1) => (:b => 2))  # but the constructor may not
Dict{Pair{Symbol,Int64},Pair{Symbol,Int64}} with 1 entry:
  :a=>1 => :b=>2

into(T) returns a function iterable -> into(T, iterable) which is appropriate for using with |>:

julia> 1:3 |> into(NTuple{3})
(1, 2, 3)

Implementation

The owner of the type of iterable should implement __from__. The owner of the output type T should implement __into__. If it is desirable to apply pre- and/or post-processing, the owner of the output type T may implement into.

Overload API

Base.__into__(T::Type, iterable) -> collection::T
Base.__from__(T::Type, iterable) -> collection::T

Overload-only API for providing the implementation for into(T, iterable).

To avoid method ambiguities, __into__ should be implemented only by the owner of T and __from__ should be implemented only by the owner of typeof(iterable). The owner of T (resp. typeof(iterable)) may choose to allow typeof(iterable) (resp. T) to overload __into__ (resp. __from__) by documenting specific type bounds for T (resp. typeof(iterable)).

into(T, iterable) calls __from__(T, iterable) which in turn by default calls __into__(T, iterable).

Implementation

If T is a subtype of AbstractArray,

isequal(vec(collect′(iterable)), vec(collect(collection)))

must hold where collect′ is defined as

collect′(iterable) =
    if IteratorEltype(collection) isa HasEltype
        collect(eltype(collection), iterable)
    else
        collect(iterable)
    end

If iterable is a stateful iterator, collect inside collect′ is "run" as if the world is rolled back to the state just before calling into.

If the collections of type T do not support duplicates, issubset may be used instead of isequal. In particular, subtypes of AbstractDict and AbstractSet must satisfy the above equality with issubset.

If the collections of type T do not maintain insertion order, issetequal may be used instead of isequal.

@tkf tkf added the collections Data structures holding multiple items, e.g. sets label Jul 4, 2020
@StefanKarpinski
Copy link
Member

How about making this a two-argument method of collect?

@nalimilan
Copy link
Member

One problem is that collect(T, iterable) already has a meaning: T is the eltype. We would have to use something else, e.g. collect(container_type, eltype, iterable).

That said, I'm not a fan of adding a new function, nor of the into name. We already have constructors, convert and collect, which is complex enough to master...

@tkf
Copy link
Member Author

tkf commented Jul 7, 2020

@nalimilan

One problem is that collect(T, iterable) already has a meaning: T is the eltype.

Yes, that's the biggest reason why we can't use convert.

We would have to use something else, e.g. collect(container_type, eltype, iterable).

We can't use this when container_type cannot specify eltype (e.g., BitArray, IdDict). This is also inappropriate when you want the input collection to specify the output eltype.

We already have constructors, convert and collect

I extensively discussed why constructor and collect are not appropriate in #36288 and waited more than three weeks to accumulate possible objections. As of writing, I only see positive reactions in the proposal. convert is not enough because it is inappropriate to use it for into(Set, ::Vector).

@nalimilan
Copy link
Member

Sorry, I hadn't seen the issue as I was completely drowned in the previous weeks. But it's not unusual in my experience to get comments only when making the PR...

I think I'd prefer calling this collectas or collectto, to indicate clearly that it does exactly the same thing as collect but allowing you to choose the container type. That we wouldn't really introduce a new concept.

Also, maybe the fallback could be defined as copy!(similar(T, 0), iterable)? Currently copy! only supports AbstractArray sources, but the documentation is more general. That way custom array types would implement multiple useful APIs at the same time, and the relationship between all of these would be quite clear.

@KristofferC
Copy link
Member

FWIW, into seems clear to me. collectas or collectto both read quite poorly.

@test @inferred(into(Tuple{}, 1:0)) === ()
@test @inferred(into(NTuple{1,Int}, [10])) === (10,)
@test @inferred(into(NTuple{2,Int}, [10, 20])) === (10, 20)
@test @inferred(into(NTuple{2}, [10, 20])) === (10, 20)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love the inferrability here

@quinnj
Copy link
Member

quinnj commented Jul 8, 2020

I really like this proposal; I think the functionality is clear, and immediately seems more flexible/useful than just collect without being harder to use. I don't love the __into__ and __from__ names for overloading, and I also think we'll probably see cases of people just overloading into directly even when they should overload __into__, but I'm not sure that can be helped. I also don't have any better suggestions for overload names, and given that it should be rare (right?) to need to overload, I think they're probably fine as-is.

For a custom array type, it wasn't clear to me what I might need to do in order for things to "just work" for my array type? I think it'd be helpful to have some docs/walk-through of how to implement for a custom array type (and do so correctly). I guess there's not really a good "default implementation" for AbstractArray?

@mbauman
Copy link
Member

mbauman commented Jul 8, 2020

Is there really a need to separate dispatch here? What is the use-case in specializing __from__? How would an iterable even do so? What's an example of a __from__ implementation? This feels like a place for multiple dispatch to me — ambiguities are a feature.

@tkf
Copy link
Member Author

tkf commented Jul 8, 2020

But it's not unusual in my experience to get comments only when making the PR...

@nalimilan Yes, I agree. I should have clarified in the OP that the motivation and background are discussed elsewhere.

I think I'd prefer calling this collectas or collectto, to indicate clearly that it does exactly the same thing as collect but allowing you to choose the container type.

I think these are reasonable alternatives and I'm open to rename.

Also, maybe the fallback could be defined as copy!(similar(T, 0), iterable)?

I think the problem is that similar(T, 0) may not return T. Also, this fallback doesn't work well when T does not have the eltype like similar(Vector, 0).

I've struggled about what to do (or don't) with the fallback. I think the question is if it is better to enforce the invariance into(T, iterable) isa T or fallback to a wrong type? This constraint is ensured ATM by ::T here

into(::Type{T}, iterable) where {T} = __from__(T, iterable)::T

I think it's valuable that you can depend on the invariance like into(T, iterable) isa T. For more flexible fallback, I think a specific API like into(Similar(T), iterable) would be nice. This can be done by defining the default fallback as __into__(::Type, iterable) = nothing.

I also don't have any better suggestions for overload names, and given that it should be rare (right?) to need to overload, I think they're probably fine as-is.

@quinnj Actually, this interface is opt-in as I couldn't think of any better way to ensure the invariances you'd want from this interface. For example, if we have the fallback into(T, iterable) = T(iterable), the caller can't rely on that into acts as shallow copy anymore.

What's an example of a __from__ implementation?

@mbauman For example, we can make eachline exception-safe like this:

function __from__(T, it::EachLine)
    try
        return __into__(T, it)
    finally
        it.ondone()
    end
end

It's also useful for something like Iterators.Flatten where you'd want to hand the control of the iterations to the data source (although it's not possible because Base doesn't have extensible mutate-or-widen API).

(If we had an extensible mutate-or-widen API like BangBang.append!! and iteration control API like Transducers.__foldl__, we don't need __from__. This way, the default into can be defined in terms of foldl and so it's the data source that controls the iteration by default. Then, if the data sink T wants to take the control of the iteration, it can define into(::Type{T}, ::Any). However, implementing this in Base would be much more amphibious project.)

This feels like a place for multiple dispatch to me — ambiguities are a feature.

I think ambiguity is going to get in our way for this. Something like eachline above seems to be impossible to define.

In general, I think we should decouple surface/call API and overload API whenever makes sense. This way, we can do more pre/post processing like error checks and also impose more invariance of the API (like into(T, iterable) isa T).

@nalimilan
Copy link
Member

I think the problem is that similar(T, 0) may not return T. Also, this fallback doesn't work well when T does not have the eltype like similar(Vector, 0).

Ah, right. These things are incredibly tricky...

I've struggled about what to do (or don't) with the fallback. I think the question is if it is better to enforce the invariance into(T, iterable) isa T or fallback to a wrong type?

Do you think there can be situations where this invariance cannot be satisfied? I wouldn't have thought so (in the worst case, conversion of source elements fails, but that's expected).

@tkf
Copy link
Member Author

tkf commented Jul 9, 2020

Do you think there can be situations where this invariance cannot be satisfied?

I was rather worried about the cases where the function accidentally returns and the caller gets a collection of a wrong type. This is very easy to happen if we use something based on similar due to the fallback similar(::AbstractArray, ...).

Perhaps more illustrative points are other properties like "into acts as a shallow copy" (i.e., clear ownership) and "input is always treated as iterator" (so into(Dict, (:a => 1) => (:b => 2)) works). They are very important to get right if you want to rely on them. However, since the constructor can do anything, it's not possible to enforce them without the type authors to implement into directly.

@stevengj
Copy link
Member

stevengj commented Jul 16, 2020

I would prefer that the function name started with collect, e.g. collectinto, to make it clearer that this is essentially a variant of collect. (It would also make the function more discoverable via tab-completion.)

(The fact that other languages call their type-conversion function into seems less relevant to me.)

It's not clear to me why we need two overload entry points. If package A defines the collection, and package B defines the iterable, then whichever package imports both A and B defines the collectinto method. Where is the the potential conflict?

@tkf
Copy link
Member Author

tkf commented Jul 16, 2020

collectinto sounds OK to me. I suggested collectto as well in #36288.

other languages call their type-conversion function into

Not that Clojure uses into to mean as proposed in this PR and not for type conversion.

Where is the the potential conflict?

Do you think it's possible to implement __from__(T, it::EachLine) in the above example #36537 (comment) without the double-dispatch pattern?

I think the double-dispatch is required when the "data source" want to take control in general. For example, I'd like to use this for Transducers.eduction to implement into(T, eduction(...)) for any (supportable) T.

@stevengj
Copy link
Member

I like collectinto rather than collectto if only to avoid the doubled consonant.

I see your point with the double dispatch…

@KristofferC
Copy link
Member

Just call it collect_to then?

@tkf
Copy link
Member Author

tkf commented Jul 17, 2020

I like collectinto rather than collectto if only to avoid the doubled consonant.

How about collectas?

@quinnj
Copy link
Member

quinnj commented Jul 17, 2020

I still prefer into; it's short, concise, and unlikely to clash with existing variable names. I also like that it's different from collect because it deals with collection types instead of element types.

@tkf
Copy link
Member Author

tkf commented Jul 17, 2020

Yeah, I like into, too. I think it's neat in the curried form:

itr |> Map(f) |> Filter(p) |> into(DataFrame)

@StefanKarpinski
Copy link
Member

It does seem like the best to me given that collect doesn't work.

@nalimilan
Copy link
Member

collectas is better IMHO. It’s relatively short, readable, and makes it clear it’s a generalization of collect.

@KristofferC
Copy link
Member

In that case collect_as, otherwise it has the same awkwardness as occursin.

@quinnj
Copy link
Member

quinnj commented Jul 20, 2020

collectas is better IMHO. It’s relatively short, readable, and makes it clear it’s a generalization of collect.

I disagree. I don't think it's super readable (makes me think of some kind of spanish/french verb conjugation). And I think the underscore version is uglier, though more clear.

@quinnj
Copy link
Member

quinnj commented Mar 29, 2021

Bump; recently ran into unexpected inconsistency between map and collect and wished I had this.

@tkf
Copy link
Member Author

tkf commented Mar 29, 2021

Lately, I started thinking that I should create a package for this since I feel the double dispatch system is still rather incomplete. The dispatch pipeline would be much more powerful if we had a standard mutate-or-widen collection interface (i.e., what I call append!!, merge!!, and union!!). Without such an interface, implementing a generic __from__ is hard and we'd rely on __into__ all the time. It would then need to use iterate which is bad when the source collection has some complex structure (e.g., nested/blocked memory) than the destination collection. But it'd increase the API surface and so I'd have a much harder time convincing people.

@quinnj
Copy link
Member

quinnj commented Mar 30, 2021

Yeah, having this as a package would be great to play around with.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
collections Data structures holding multiple items, e.g. sets
Projects
None yet
Development

Successfully merging this pull request may close these issues.

RFC: into(T::Type, iterable) -> collection::T
7 participants