Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

create Accumulate iterator #25766

Open
wants to merge 25 commits into
base: master
Choose a base branch
from
Open

create Accumulate iterator #25766

wants to merge 25 commits into from

Conversation

simonbyrne
Copy link
Contributor

@simonbyrne simonbyrne commented Jan 26, 2018

This makes it possible to use the collect machinery for determining output types.

Also moves code out of multidimensional.jl (which is a bit of an odd place for it).

To do:

  • get slicing working.
  • array types for pairwise

@kshyatt kshyatt added the iteration Involves iteration or the iteration protocol label Jan 27, 2018
rcum_promote_type(op, ::Type{Array{T,N}}) where {T,N} = Array{rcum_promote_type(op,T), N}

# accumulate_pairwise slightly slower then accumulate, but more numerically
# stable in certain situations (e.g. sums).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"stable" is the wrong word here. Use "accurate"


# accumulate_pairwise slightly slower then accumulate, but more numerically
# stable in certain situations (e.g. sums).
# it does double the number of operations compared to accumulate,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"double the number of op calls" would be more informative

end

function cumsum!(out, v::AbstractVector, dim::Integer)
# we dispatch on the possibility of numerical stability issues
Copy link
Member

@stevengj stevengj Jan 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rephrase to "on the possibility of roundoff errors"

(Again, this is misusing the term "numerical stability". Even naive summation is backwards stable, it is just less accurate than pairwise summation.)

itrstate, accval = accstate
val, itrstate = next(acc.iter, itrstate)
if accval === uninitialized
accval = reduce_first(acc.op, val)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this is type-unstable even for things like Accumulate(+, itr)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but the new Union optimisations mean that there is no performance hit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand it, the Union optimizations help with cases like this, but retaining type stability will still be faster. Worth Nanosoldiering or otherwise benchmarking to be sure though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In most cases, the only type instability will be in the first call to next, which is typically unrolled, so we should be able to avoid that.

@jw3126
Copy link
Contributor

jw3126 commented Jan 28, 2018

There are type stability issues, if v0 is different from eltype(iter), (which is not uncommon in the wild):

julia> collect(Base.Accumulate(+, 1., Int[]))
0-element Array{Real,1}
julia> collect(Base.Accumulate(+, 1., Int[1]))
1-element Array{Float64,1}:
 2.0

This makes it possible to use the `collect` machinery for determining output types.

Also moves code out of multidimensional.jl (which is a bit of an odd place for it).

Still need to get slicing working.
@simonbyrne simonbyrne changed the title WIP: create Accumulate iterator create Accumulate iterator Jan 30, 2018
@simonbyrne
Copy link
Contributor Author

Okay, this should now fix some of the type stability issues.

In summary:

It fixes #25506, and generally makes all the promotion machinery work without falling back on promote_op (as per @martinholters suggestion).

I've also changed it so that cumsum/cumprod now have the same promotion behaviour as sum/prod. This was argued against previously (#18364 (comment)), but now that we use separate reduction operators, the old behaviour is still accessible via accumulate(+,x)/accumulate(*,x).

The main question is whether we want mapaccumulate (#21152)?

@simonbyrne
Copy link
Contributor Author

@nanosoldier runbenchmarks(ALL, vs=":master")

@ararslan
Copy link
Member

@nanosoldier runbenchmarks(ALL, vs=":master")

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan

@simonbyrne
Copy link
Contributor Author

Ah, the performance hit seems to be that copyto! doesn't unroll the first next.

@simonbyrne
Copy link
Contributor Author

@nanosoldier runbenchmarks(ALL, vs=":master")

@ararslan
Copy link
Member

@nanosoldier runbenchmarks(ALL, vs=":master")

@martinholters
Copy link
Member

👍 to the general idea here. I would have loved to review this in more detail, but haven't had the time and it's unlikely that I will have today.

@@ -1407,6 +1407,8 @@ end

@deprecate which(s::Symbol) which(Main, s)

@deprecate accumulate!(op, dest::AbstractArray, args...) accumulate!(dest, op, args...)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why dest first? Function arguments have highest priority.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point.

test/arrayops.jl Outdated
@inferred accumulate(*, String[])
@test accumulate(*, ['a' 'b'; 'c' 'd'], 1) == ["a" "b"; "ac" "bd"]
@test accumulate(*, ['a' 'b'; 'c' 'd'], 2) == ["a" "ab"; "c" "cd"]
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding

@inferred accumulate(+, 1.0, Int[])
@test eltype(accumulate(+, 1.0, Int[])) == Float64

would be good. (Probably not in this testset.)


function accumulate!(dest, op, v0, X, dim::Integer)
dim > 0 || throw(ArgumentError("dim must be a positive integer"))
axes(A) == axes(B) || throw(DimensionMismatch("shape of source and destination must match"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is A and B should be dest and X.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we also need more accumulate! tests. Historically accumulate! would be implicitly called anyway by each accumulate test so this was not necessary. AFAICT there is only a single accumulate! test now.

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan

@martinholters
Copy link
Member

Looking only at the 1D case for now, I'm getting:

julia> A = rand(100);

julia> @btime Base.accumulate_pairwise(Base.add_sum, undef, $A, Base.HasShape{1}()); # this is what cumsum(A) does
  771.947 ns (9 allocations: 1.06 KiB)

julia> @btime Base._accumulate(Base.add_sum, undef, $A, Base.HasShape{1}(), nothing); # this uses the Accumulate iterator
  175.124 ns (2 allocations: 912 bytes)

For comparison, on master:

julia> @btime cumsum($A);
  147.019 ns (1 allocation: 896 bytes)

Using an element type which does not do pairwise summing:

julia> A = rand(Int, 100);

julia> @btime cumsum($A); # master
  126.825 ns (1 allocation: 896 bytes)

julia> @btime cumsum($A); # this PR
  125.654 ns (2 allocations: 912 bytes)

The overhead due to the Accumulate iterator apparently is not the problem. But the new accumulate_pairwise is. However, it has to be changed to give consistent results (in terms of return type) with the non-pairwise version, I suppose.

So...

Can we just make the API-breaking parts of this change and actually add the Accumulate iterator later?

Doesn't seem to gain us much.

@stevengj
Copy link
Member

Can we just fix #25506 in the meantime?

@jw3126
Copy link
Contributor

jw3126 commented Mar 20, 2018

@stevengj #25515 would fix just #25506. I closed in favor of this PR. Should I reopen it as a fix until this PR lands?

@simonbyrne
Copy link
Contributor Author

how about I just keep the original pairwise stuff, since that seems to be the problem? We can always change that later.

The one thing to do in 0.7 is to change the small integer behaviour, since that is the main breaking change.

@martinholters
Copy link
Member

This is remarkable:

julia> A=rand(100);

julia> @btime cumsum($A);
  944.120 ns (9 allocations: 1.06 KiB)

julia> Base._similar_for(c::AbstractArray, ::Type{T}, itr, ::Base.HasShape) where {T} = similar(c, T, axes(itr))

julia> @btime cumsum($A);
  666.361 ns (7 allocations: 1.02 KiB)

That's on a different machine than above, same cumsum on master is about 200ns here. So still quite a gap, but noting that _similar_for isn't changed in this PR, maybe there's something to be gained elsewhere, too. OTOH, IIUC, before this PR, that _similar_formethod is only called for EltypeUnknown, HasShape iterators which are not Genertors---which I don't think there are any in Base, or?

@martinholters
Copy link
Member

Although everything is type-stable and inferable, if I disable the ifs in accumulate_pairwise by manual prediction for this case as in:

function accumulate_pairwise(op, v0, itr, ::Union{HasLength,HasShape{1}})
    i = start(itr)
    #if done(itr,i)
    #    return collect(Accumulate(op, v0, itr))
    #end
    v1,i = next(itr,i)
    y = reduce_first(op,v0,v1)

    Y = _similar_for(1:1, typeof(y), itr, IteratorSize(itr))
    L = linearindices(Y)
    n = length(L)
    j = first(L)
    while true
        Y[j] = y
        if done(itr,i)
            return Y
        end
        y,j,i,wider = _accumulate_pairwise!(op,Y,itr,y,j+1,i,last(L)-j,true)
        #if !wider
            return Y
        #end
        R = promote_typejoin(eltype(Y), typeof(y))
        newY = similar(Y, R)
        copyto!(newY,1,Y,1,j)
        Y = newY
    end
end

I'm getting

julia> @btime cumsum($A);
  197.452 ns (1 allocation: 896 bytes)

which is on par with master.

The while loop might be better written using a recursion, but I'm puzzled as to why the first if condition is also problematic.

@martinholters
Copy link
Member

My theory for the effect of disabling the ifs is that it pushes the method across the inlining threshold. And then we suffer from the functions not being specialized on op. By forcing all relevant methods to be either inlined or specialized on op (i.e. (op::F, ...) where F), I obtain:

master this PR with specialization
A = rand(100); @btime cumsum($A) 154.226 ns (1 allocation: 896 bytes) 154.356 ns (1 allocation: 896 bytes)
A = rand(10^6); @btime cumsum($A) 1.249 ms (2 allocations: 7.63 MiB 1.284 ms (2 allocations: 7.63 MiB)

Yay! Unfortunately, that doesn't help the more-than-one-dimensional case. I'll look into that...

@simonbyrne, ok if I push my updates to your branch?

@simonbyrne
Copy link
Contributor Author

Yes, certainly! I won't have time to look at this for a few days, but make what changes you see fit.

@martinholters
Copy link
Member

Ok, I could substantially reduce the setup time for the multi-dimensional case and make it type-stable (if the input permits it):

julia> A = ones(1,1);

julia> @btime cumsum($A, dims=1);
  42.639 ns (1 allocation: 96 bytes)

julia> @btime cumsum($A, dims=2);
  42.780 ns (1 allocation: 96 bytes)

Compare with master:

julia> @btime cumsum($A, dims=1);
  33.831 ns (1 allocation: 96 bytes)

julia> @btime cumsum($A, dims=2);
  149.753 ns (4 allocations: 144 bytes)

Unfortunately, the time for the actual computation is still worse than on master by more than a factor of 2:

# this PR
julia> srand(0);

julia> A = rand(100,100);

julia> @btime cumsum($A, dims=1);
  24.235 μs (2 allocations: 78.20 KiB)

julia> @btime cumsum($A, dims=2);
  22.350 μs (2 allocations: 78.20 KiB)
#master
julia> @btime cumsum($A, dims=1);
  8.637 μs (2 allocations: 78.20 KiB)

julia> @btime cumsum($A, dims=2);
  9.056 μs (5 allocations: 78.25 KiB)

@martinholters
Copy link
Member

Uh, no, this isn't it. While just iterating over CartesianIndices(axes(X)) is nice from a type-stability point of view, using the no-op loop

function _accumulate!(op::F, dest::AbstractArray{T}, v0, X, dim, inds, st, first_in_dim, dim_delta, widen) where {F,T}
    while !done(inds, st)
        i, st = next(inds, st)
    end
    return dest
end

is still slower than master, although it doesn't compute anything besides incrementing the index.

@simonbyrne
Copy link
Contributor Author

I've managed to fix the performance of plain accumulate(+, X)

v0::V
iter::I
end
Accumulate(op, iter) = Accumulate(op, undef, iter) # use `undef` as a sentinel
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is exactly what I was afraid would happen with undef.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can create another type to use as a sentinel: any suggestions for names? (I had originally used uninitialized, which makes slightly more sense).

@JeffBezanson
Copy link
Sponsor Member

I will repeat the request to do the small integer type change first, so we can do the rest of this any time.

@mbauman
Copy link
Sponsor Member

mbauman commented Mar 29, 2018

For what it's worth, here are the updated numbers on my spot-checks:

 # master/c2efb04eae                       |   # sb/accumulate/f0bb59dea5            
 julia> srand(0);                          |   julia> srand(0);                      
                                           |                                         
 julia> A = rand(100);                     |   julia> A = rand(100);                 
                                           |                                         
 julia> @btime cumsum($A);                 |   julia> @btime cumsum($A);             
   180.927 ns (1 allocation: 896 bytes)    |     189.005 ns (1 allocation: 896 bytes)
                                           |                                         
 julia> A = rand(10^6);                    |   julia> A = rand(10^6);                
                                           |                                         
 julia> @btime cumsum($A);                 |   julia> @btime cumsum($A);             
   1.463 ms (2 allocations: 7.63 MiB)      |     1.520 ms (2 allocations: 7.63 MiB)  
                                           |                                         
 julia> A = rand(10,10);                   |   julia> A = rand(10,10);               
                                           |                                         
 julia> @btime cumsum($A, dims=1);         |   julia> @btime cumsum($A, dims=1);     
   121.121 ns (1 allocation: 896 bytes)    |     285.664 ns (1 allocation: 896 bytes)
                                           |                                         
 julia> @btime cumsum($A, dims=2);         |   julia> @btime cumsum($A, dims=2);     
   259.957 ns (4 allocations: 944 bytes)   |     287.504 ns (1 allocation: 896 bytes)
                                           |                                         
 julia> A = rand(1000,1000);               |   julia> A = rand(1000,1000);           
                                           |                                         
 julia> @btime cumsum($A, dims=1);         |   julia> @btime cumsum($A, dims=1);     
   1.396 ms (2 allocations: 7.63 MiB)      |     3.046 ms (2 allocations: 7.63 MiB)  
                                           |                                         
 julia> @btime cumsum($A, dims=2);         |   julia> @btime cumsum($A, dims=2);     
   1.108 ms (5 allocations: 7.63 MiB)      |     2.553 ms (2 allocations: 7.63 MiB)  

@ararslan
Copy link
Member

@nanosoldier runbenchmarks(ALL, vs=":master")

@JeffBezanson JeffBezanson removed the triage This should be discussed on a triage call label Apr 5, 2018
@StefanKarpinski
Copy link
Sponsor Member

This doesn't need to be triaged anymore, right? Since after #26658 is merged this won't be breaking.

@simonbyrne
Copy link
Contributor Author

@stevengj
Copy link
Member

stevengj commented Mar 1, 2020

Any update on this?

@JeffBezanson
Copy link
Sponsor Member

There is now an Accumulate iterator in Iterators, plus #34656. This could maybe be rebased to use that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
iteration Involves iteration or the iteration protocol
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants