Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stack(vec_of_vecs) for vcat(vec_of_vecs...) #21672

Closed
StefanKarpinski opened this issue May 2, 2017 · 74 comments · Fixed by #27188
Closed

stack(vec_of_vecs) for vcat(vec_of_vecs...) #21672

StefanKarpinski opened this issue May 2, 2017 · 74 comments · Fixed by #27188
Labels
arrays [a, r, r, a, y, s] help wanted Indicates that a maintainer wants help on an issue or pull request

Comments

@StefanKarpinski
Copy link
Member

StefanKarpinski commented May 2, 2017

This comes up periodically and it's a bit embarrassing that our official answer right now involves a inefficient splatting. We should have a function – maybe call it stack? – that efficiently does what vcat(vec_of_vecs...) does without the splatting. It should also stack in slices in different dimensions so that you can either stack slices as rows or columns, etc.

@StefanKarpinski StefanKarpinski added this to the 1.0 milestone May 2, 2017
@ararslan ararslan added the arrays [a, r, r, a, y, s] label May 2, 2017
@mbauman
Copy link
Member

mbauman commented May 2, 2017

And return an Array or a custom array? Either way, there's a simple implementation here: http://stackoverflow.com/a/37488453/176071. In this context I'd swap the index ordering so it's column-major contiguous.

@StefanKarpinski
Copy link
Member Author

For the base implementation, this would just convert a vector of vectors to matrices, as a specific case. I think whether it stacks in column major or row major order should probably be controlled by a dimension argument. There are generalizations one can imagine doing here.

@TotalVerb
Copy link
Contributor

TotalVerb commented May 3, 2017

There's a general issue with the names of reduction operations on collections of things (e.g. one-argument join) are sometimes different from the names of operations on things given as varargs (e.g. string, which is effectively a splatted-out one-argument join), even if they are conceptually the same operation.

Examples are:

  • maximum (collection) vs. max (splat)
  • minimum vs. min
  • sum vs. +
  • prod vs. *
  • join vs. string
  • append! vs. push!

These are not so bad because the names are similar and standard, but vcat and stack would exacerbate the problem, because stack is certainly not a standard name for this operation.

In my opinion, a better name for stack is simply reduce(vcat, ...). This is inefficient because of substantial copying, but it would not change the semantics to define a new overload of reduce specialized on typeof(vcat) so as to avoid the excessive copying, as vcat does presently. Overall there should be no performance improvement from using a dedicated stack function versus a specialized reduce(vcat, ...) with the same code. This has the benefit of not requiring additional nomenclature to be learned.

For example, the task of "make all these vectors xss row vectors, then stack them vertically" translates to

reduce(vcat, map(transpose, xs))

which is quite readable, combines existing vocabulary and is nevertheless (thanks to RowVector) possible to make efficient through adding a specialization to reduce.

@nalimilan
Copy link
Member

nalimilan commented May 3, 2017

FWIW stack is already a standard term in the data frames/data tables world (and not only in Julia). See the DataTables docs. So I'd rather use a different name -- or go with @TotalVerb's suggestion.

@dpsanders
Copy link
Contributor

I agree with @StefanKarpinski that we need a single-word function for this.

@Godisemo
Copy link

Godisemo commented May 3, 2017

What about flatten?

@kmsquire
Copy link
Member

kmsquire commented May 3, 2017

FWIW, numpy uses stack for this operation.

@StefanKarpinski
Copy link
Member Author

flatten is a quite distinct operation that already exists.

@StefanKarpinski
Copy link
Member Author

The function stack could be used to mean one thing with arrays and another with data frames/tables – I don't think the operations are really at odds with each other, although IIUC the data table operation might be better called vcat.

@andyferris
Copy link
Member

I've been wondering if it's better to first support a "view" which you can copy, rather than the reverse process (where we have a function in Base which makes a contiguous copy, and later on we decide it would be nice to have a view for such an operation).

Thus, we could potentially have

m = NestedArray([[1, 2], [3, 4]]) # A 2x2 AbstractMatrix{Int} which is a "view", perhaps called `StackedArray`
copy(m) # contiguous in memory, `Matrix{Int}`. Perhaps `Array(m)` would be clearer/allow possibility for `similar` to return a nested structure.

and no stack function.

@TotalVerb
Copy link
Contributor

See also CatViews.jl by @ahwillia.

@nalimilan
Copy link
Member

The function stack could be used to mean one thing with arrays and another with data frames/tables – I don't think the operations are really at odds with each other, although IIUC the data table operation might be better called vcat.

Yes, that's certainly possible. I just wanted to mention this because if other terms are available it would be better not to mix these two different operations (I wasn't aware of the Numpy precedent).

FWIW, data frames/tables already implement vcat, but stack is a a very different operation (which operates on a single data frame/table).

@StefanKarpinski
Copy link
Member Author

Another idea that occurred to me yesterday is to just have specialized methods for reduce(vcat, Vector{<:Vector}) and reduce(hcat, Vector{<:Vector}). It doesn't generalize, but it may be sufficient. There's no reason not to have optimized methods for those specific cases anyway.

@StefanKarpinski
Copy link
Member Author

What does stack do on data frames? From the name, I would expect it to stack on data frame on top of another with the same columns.

@Godisemo
Copy link

Godisemo commented May 4, 2017

I think stack on dataframes is the same operation that in R is called gather. Correct me if i'm wrong. See this cheat sheet from RStudio.

@nalimilan
Copy link
Member

Another idea that occurred to me yesterday is to just have specialized methods for reduce(vcat, Vector{<:Vector}) and reduce(hcat, Vector{<:Vector}). It doesn't generalize, but it may be sufficient. There's no reason not to have optimized methods for those specific cases anyway.

Yes, that's what @TotalVerb proposed above, right?

What does stack do on data frames? From the name, I would expect it to stack on data frame on top of another with the same columns.

This is vcat. stack rearranges rows and column in a data frame/table by creating one column for each level of a variable. It's a bit hard to describe in a few words, but see this example. That's indeed called gather or melt in the recent iterations of Hadley Wickham (from RStudio)'s R packages, but the term in base R and in Pandas is stack. Now that several variants of the name are commonly used, it's not very clear which terminology is the most standard though...

@andyferris
Copy link
Member

Correct me ifI'm wrong, but I thought the "stacked" form of a data frame was a bit like a sparse representation of the table, storing (row, column, value) tuples. (Or maybe that was called something else?)

@nalimilan
Copy link
Member

Yes, that's more or less the result of stack, i.e. only three columns giving variable name, level and value, and which can be converted to one column for each variable (plus the value column) using unstack.

@StefanKarpinski
Copy link
Member Author

StefanKarpinski commented May 5, 2017

It's rather unfortunate that we can't use stack, which is the most natural name for this operation, because the name is used for an operation on data frames for which the term "stack" is neither standard nor intuitive. I'm somewhat inclined to "take" the term and let data frames use something more intuitively appropriate like gather or melt for that operation. I always liked the "melt" / "cast" terminology from R's reshape2.

@nalimilan
Copy link
Member

Sure, the existence of the data frames/tables function shouldn't be a blocker if stack is natural in the present context. I'd need to do more research to check which of stack, gather and melt is the most common terminology (and most likely to remain so in the future) for data frames. BTW, we already have melt, which is equivalent to stack but with a slightly different interface (for those interested, the original discussion is here).

Sorry, I didn't expect my original comment to generate such a long sub-thread.

@mbauman
Copy link
Member

mbauman commented May 5, 2017

I'm not sure stack is really such an ideal name here; it's fine, but I don't think it's all that obvious or meaningful. I was hoping to find some inspiration from the myriad of SO posts asking about this, but they almost all use verbiage like convert or transform. I don't particularly like pigeon-holing this to just reduce(hcat given that it's so easy to generalize with Cartesian iteration such that the resulting size is (size(A[1])..., size(A)...).

One possible alternative: unnest. This makes it clear that it's operating on arrays of arrays and evokes the imagery of transforming A[i][j] to A[j, i].

I think I'd also lean towards doing the Array copy in base, with a package like CatViews or GrowableArrays supporting the custom array type.

@StefanKarpinski
Copy link
Member Author

StefanKarpinski commented May 5, 2017

The term "stack" makes sense to me because you're taking a bunch of separate vectors/slices of some shape and stacking them along an axis to create a structure with the same shape but an additional "stacked" dimension. I agree that it makes less sense when you're thinking of it in terms of indices, where nesting and unnesting is clearer – that could be a better way to think about it.

@StefanKarpinski
Copy link
Member Author

StefanKarpinski commented May 5, 2017

One way to express this could be how you'd like indices to be nested in the output. I.e. you could write nested(A, (2,(1,3))) to transform A so that the output value is indexed as B[j][i,k] where, if the original input was a vector of vectors of vectors, it would be indexed as A[i][j][k]. Indices would implicitly be numbered and any shape of input would be allowed, so the input could be A[i,j,k] or A[i][j,k] or A[i,j][k] or A[i][j][k].

@andyferris
Copy link
Member

The B = nested(A, (2,(1,3))) thing seems pretty cool. Would it return a view or a contiguous copy?

Assuming it's a view - one gotcha is that (these days) indices for different array types and dimensions could be of different types, so the (2, (1, 3)) part may need to be recognized as a compile time constant so that indices(B) (and even the internals of getindex, if different indices use different types of Integer) is type stable. E.g. Perhaps in the end you want to create a NestedArray{T, 3, (2, (1, 3))} <: AbstractArray{T, 3}) or something, a bit like how PermutedDimsArray stores the permutation as a type parameter (I'm now looking at it's outer constructor here and wondering if it is type stable?).

I think I'd also lean towards doing the Array copy in base, with a package like CatViews or GrowableArrays supporting the custom array type.

Right, this functionality seems perfect for being developed upon outside of Base and merged when we all like the API. To support custom indices, perhaps dense (or full) would be a good generic way to go from the (nested) view to the (stacked) copy? (unless we are happy to have similar always drop the nesting, in which case copy is natural).

@StefanKarpinski
Copy link
Member Author

I think the most straightforward approach here would be to optimize these two operations in Base:

julia> vov = [[1,2,3],[4,5,6],[7,8,9]]
3-element Array{Array{Int64,1},1}:
 [1, 2, 3]
 [4, 5, 6]
 [7, 8, 9]

julia> reduce(hcat, vov)
3×3 Array{Int64,2}:
 1  4  7
 2  5  8
 3  6  9

julia> mapreduce(transpose, vcat, vov)
3×3 Array{Int64,2}:
 1  2  3
 4  5  6
 7  8  9

Since these operations already produce the desired result, having optimized implementations for these and then telling people to use these as the idiom for doing this would work nicely without any risk of breaking anyone's code. There's also no reason not to optimize this. Any more sophisticated functionality can be developed out of Base in a package.

Summary: the action item here, rather than introduce any new verbs into the standard library, is to write optimized methods of reduce and mapreduce for arrays of arrays with vcat and hcat as reducers.

@oxinabox
Copy link
Contributor

oxinabox commented May 9, 2017

While at it, can we have (hpush!) ( or with underscores push_column!)
and vpush! (i.e. push_row!) ?

Perhaps that should be a separate issue, but it is closely linked,
implementation is very similar to what needs to be done to make the optimised reduce(hcat...

@andyferris
Copy link
Member

There aren't any resizeable multidimensional arrays in Base, is there?

(It would be a good feature for a nested array view/wrapper type, though!)

@oxinabox
Copy link
Contributor

oxinabox commented May 9, 2017

There aren't any resizeable multidimensional arrays in Base, is there?

Yes, but there are resizeable 1D Arrays.
push! adds an element to a Vector.
and by mixing reshape and push!, you can do this in 2D

(It would be a good feature for a nested array view/wrapper type, though!)

Sure, would.
Make a PR to CatViews.jl, that would kick-arse

@ChrisRackauckas
Copy link
Member

ChrisRackauckas commented Jul 12, 2021

Again, StackViews is relying on indexing: https://github.com/JuliaArrays/StackViews.jl/blob/master/src/StackViews.jl#L128-L133 . So Array(SV) will work, but it will fail on GPUs, or at least be extremely slow. That's not a general solution. We need something that works on all array types in a way that is optimized for each new array type, which is not something that is indexing-based will be able to provide.

@antoine-levitt
Copy link
Contributor

antoine-levitt commented Jul 12, 2021

There's two interesting things that we could want here: an eager and a lazy version. It seems simpler to have Base implement the eager version, ie stack([randn(2), randn(2)]) would be a plain 2x2 array. Lazy versions could be implemented by packages.

@mcabbott
Copy link
Contributor

mcabbott commented Jul 12, 2021

My round-up of the half-dozen packages for this is at https://github.com/mcabbott/LazyStack.jl#other-packages . When last I checked the behaviour of LazyStack.stack on CuArrays was to fall back to an eager method. It should make a lazy container (which relies on indexing) only for things which are backed by Arrays.

@JeffBezanson JeffBezanson reopened this Jul 12, 2021
@JeffBezanson JeffBezanson added the triage This should be discussed on a triage call label Jul 12, 2021
@JeffBezanson
Copy link
Member

From triage:

  • Yes
  • We should implement the eager version in Base
  • We're ok with requiring all the input arrays to have the same # of dimensions, and returning something with one added dimension. Is that sufficient for most uses in the ecosystem currently?

@JeffBezanson JeffBezanson added help wanted Indicates that a maintainer wants help on an issue or pull request and removed triage This should be discussed on a triage call labels Jul 15, 2021
@antoine-levitt
Copy link
Contributor

Cool, thanks for discussing it! @mcabbott do you have code for this at the ready? If not I can try to make a PR.

@mcabbott
Copy link
Contributor

mcabbott commented Jul 15, 2021

There's code in my PR for the case that the dimensions are ordered func(As)[I, J] == As[J][I] for any ndims. It's a trivial adaptation of what's there for reduce(hcat, ...). (In fact IIRC it allows func(As)[I,1,1,J] == As[J][I] with some trivial dims.) It requires all slices the same size.

If this new function doesn't take a dims keyword, and puts the slices together in a fixed way, then restricting either ndims to 1 seems unnecessary & prevents stacking e.g. a matrix of matrices. The above ordering is very natural to column-major storage, but if I'm thinking correctly no multi-dim operations in Julia currently privilege this.

If it does have some way to control in which dimensions the slices lie, then that requires more thought. The present eachslice allows one dims which indicates the array dim which becomes the iterator; the upgrade in #32310 allows multiple dims and their order matters. If this function wants to be the inverse of the existing one, then perhaps the rule is:

xs = [rand(3,4) for _ in 1:5];

size(func(xs; dims=3)) == (3, 4, 5)
size(func(xs; dims=1)) == (5, 3, 4)   # pushes other dims
size(func(xs; dims=4)) == (3, 4, 1, 5) # inserts trivial dim

Shouldn't be hard to code this efficiently; I guess it needs elements of both reduce(vcat, ...) and reduce(hcat, ...). Note BTW that convention of mapslices for what dims means is the opposite to that of eachslice.

Maybe #31644 (trying to make reduce(hcat, ::Generator) efficient) also deserves mention. The ideal outcome would be a function which handles that.

@antoine-levitt
Copy link
Contributor

antoine-levitt commented Jul 15, 2021

I didn't consider stacking eg matrices of matrices, but you're right there's nothing preventing stack from supporting this. stack(As)[I,J] == As[J][I] seems eminently reasonable. There's indeed no existing function that does this to my knowledge, but this is consistent both with column-major (and therefore with reshape), with the ordinary usage in linear algebra (in which matrices are usually more naturally seen as being collections of their columns rather than collections of their rows) and with precedent in packages (eg RecursiveArrays).

Regarding a specific way of control the dimensions, what would be the use case? It seems to me this would be too confusing for people to actually use and they'd be better off writing their own routines. I know I always have to lookup usage and think about it for a while when using mapslices

stack(::Generator) would be super nice also. Ideally it should behave the same way as stack(collect()), meaning it should support things like stack(randn(2,2) for i=1:2, j=1:2)) returning a 4D array.

@mcabbott
Copy link
Contributor

way of control the dimensions, what would be the use case?

Sometimes people wish to assemble a matrix from rows, you can do reduce(vcat, transpose([fill(row^2,3) for row in 1:5])) but maybe this would be neater?

My package went for the no-options, everything N-dimensional, version. Since it's lazy, allowing options just means you end up re-writing PermutedDimsArray(stack(...), perm). But for an eager function, maybe it's more important to avoid permutedims. Not all permutations are captured by my dims keyword above (nor by eachslice).

@dorn-gerhard
Copy link

Following the whole discussion I see two needed applications for arbitrary arrays of arrays:

  1. Turning an array of arrays into a single md-array with more dimensions (is the subarrays have all the same size):
    A[i, j, k][a, b] -> A[a, b, i, j, k]
    The option of where to locate the indices i, j, k (as proposed three posts earlier) and in which order could be solved by permutedims or defined by a keyword stack(A, dims = [1,7,5]) ending with A[i, a, b, 1, k, 1, j]
    But I guess shuffling dimensions is a performance issue.

  2. Stacking an array of arrays by appending in one or more dimensions, which for vectors of arrays works with
    vcat and hcat to stack into the first or second dimension.
    reduce(vcat, [rand(2,3) for _ = 1:4]): A[i][a,b] -> B[(a,i),b] = A[(i-1)*len_a + a, b]
    reduce(hcat, [rand(2,3) for _ = 1:4]): A[i][a,b] -> B[(b,i),a] = A[a, (i-1)*len_b + b]
    Arrays of arrays are treated as vector of arrays A[i,j,k][a,b] -> A[:][a,b]

This task can be generalized to arbitrary multidimensional arrays and condensing two (or more) indices (columnwise (a,b) = (b-1)*len_a + a):
e.g. A[i, j, k][a, b] -> A[a, b, i, j, k] -> B[i, (j, b), a, k] = A[i, (b-1)*len_j+j, a, k]
which can be done by B = reshape(permutedims(A,(3,2,4,1,5)), len_i, len_b*len_j, len_a, len_k)
but which is rather slow.

So an efficient function for nr 1. (turning (md)arrays of (md)arrays into md arrays) which is the latest suggestion for stack
would be great and maybe you have efficient suggestions for collapsing indices of md arrays (e.g. something with :)

@mcabbott
Copy link
Contributor

I think your 1. is all that's planned; whether to take dims at all, and whether to allow dims=(1,7,5) not just dims::Integer, are the questions. And what to call it, e.g. is colliding with DataFrames.stack OK?

For your 2. there are quite a lot of possible ways to do that in N dims. The notation you use here is almost exactly that of TensorCast.jl, for example:

julia> using TensorCast; A = [rand(2,3) for _ = 1:4];

julia> @cast B[(a,i),b] := A[i][a,b]; B == reduce(vcat, A)
true

@aiqc
Copy link

aiqc commented Jul 23, 2021

When I think of "stack" I think about making a df taller via batches of concatenated rows, which isn't really what this is about. This is more of a grouping/ nesting within a newly created dimension? E.g. a 2D new is made where existing 1Ds are nested in it, or a 3D new is made where existing 2Ds are nested in it.

image

@antoine-levitt
Copy link
Contributor

Yeah I was thinking this also actually... I'm not a native english speaker but it seems to me stack can have two meanings : stack a and b next to other, and stack a and b on top of each other. Reasonably, stack(rand(2), rand(2)) could be a Vector of size 4, or a 2x2 matrix, so it's maybe bad to pick one?

@goretkin
Copy link
Contributor

goretkin commented Jul 28, 2021

When comparing a definition
function stack(...)

to

function reduce(::typeof(vcat), ...)

is the main benefit of the first that it is easier to avoid method ambiguity errors?

I think it can be a good idea to have abbreviated names for more generic operations, the same way people tend to prefer zeros(T, dims) over fill(zero(T), dims). But zeros is written in terms of fill (well, really fill!), and I would assume we should write stack in terms of reduce(vcat, ..., and add methods to the latter.

@stillyslalom
Copy link
Contributor

Another approach: a zero-positional-arg version of cat returning a functor with associated reduce specializations, e.g.

julia> import Base: cat, reduce

julia> struct Concatenate{N}
           dims::NTuple{N,Int}
       end

julia> Concatenate(dim::Int) = Concatenate((dim,))
Concatenate

julia> cat(; dims) = Concatenate(dims)
cat (generic function with 4 methods)

julia> reduce(c::Concatenate, itr::AbstractVector{<:AbstractArray}) = cat(itr...; dims=c.dims)
reduce (generic function with 24 methods)

julia> reduce(cat(dims=3), [[1, 2], [3, 4]])
2×1×2 Array{Int64, 3}:
[:, :, 1] =
 1
 2

[:, :, 2] =
 3
 4

@mcabbott
Copy link
Contributor

mcabbott commented Dec 4, 2021

@goretkin

is the main benefit of the first that it is easier to avoid method ambiguity errors?

This doesn't seem to be clearly stated above, but one complaint about reduce(hcat, xs) is that it's not very discoverable. If you don't know that it has special methods, you might assume it concatenated one by one.

The other complaint is that it doesn't generalise very easily. You can write cat([fill(1,2,3) for _ in 1:4]...; dims=3) as reduce((x...) -> cat(x...; dims=3), xs) (pairwise) or reshape(reduce(hcat, xs), ...) (for this dimension). If you have a matrix of vectors instead, then things are awkward.

So more likely the reduce(hcat, ...) special case could have its implementation replaced by the general one, rather than the reverse.

@stillyslalom

Another approach: ... reduce(cat(dims=3), [[1, 2], [3, 4]])

This was one of the ideas explored in #37196, and one objection there is that cat(; dims=3) already does something.

@MasonProtter
Copy link
Contributor

MasonProtter commented Jun 1, 2023

I believe this issue can be closed now that we have #43334

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrays [a, r, r, a, y, s] help wanted Indicates that a maintainer wants help on an issue or pull request
Projects
None yet
Development

Successfully merging a pull request may close this issue.