Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Staged bsxfun and other broadcast operations #3100

Merged
merged 5 commits into from
May 21, 2013

Conversation

toivoh
Copy link
Contributor

@toivoh toivoh commented May 14, 2013

This patch adds an implementation of bsxfun(op::Function, As::Array...)
that generates inner loop functions depending on op, the number of arguments
As, and the dimension of the result.

The machinery is also used to provide

  • broadcasting operators .+ .- .* ./ for Array arguments
  • bsxfun!(op::Function, As::Array...), which stores the result in
    As[1] Edit: Removed this one, pending a rewrite with separate output argument.
  • get_bsxfun(op::Function) and get_bsxfun!(op::Function) to instantiate a
    broadcasting operation from a given per-element operation
  • bsx_getindex(A::AbstractArray, inds::Array...),
    a pointwise broadcasting indexing operation as described in Broadcasting pointwise indexing operator #2591,
    and a matching bsx_setindex!(A::AbstractArray, X::Array, inds::Array...)

It should be straightforward to extend the functions to work with StridedArray arguments instead of just Array, but I'm uncertain of the proper interface to work with subarrays using manual linear indexing.

The code generates inner loop functions such as

function bsx_plus_inner!(shape::NTuple{2,Int}, arrays::NTuple{2,Array},
                         strides::Matrix{Int}, result::Array)
    @assert (size(strides)==(2,2))
    (n1,n2) = shape
    if (n1==0); return; end
    if (n2==0); return; end
    (A1,A2) = arrays
    k1 = 1
    k2 = 1
    (s1_1,s2_1,s1_2,s2_2) = strides
    k = 1
    for i2 = 1:n2
        for i1 = 1:n1
            begin 
                result[k] = +(A1[k1],A2[k2])
                k += 1
            end
            k1 += s1_1
            k2 += s2_1
        end
        k1 += s1_2
        k2 += s2_2
    end
end

The example above is for bsxfun(+, A, B) with two-dimensional result.
A linear index is updated using strides for each array, which should allow to accomodate
both broadcasting and subarrays.

On my machine, the timing example from #2991 gives

elapsed time: 0.806272502 seconds  # subcol_forloop(a, b)
elapsed time: 3.1062779 seconds    # subcol_bsxfun(a, b)
elapsed time: 0.936735113 seconds  # .- from this pull request

i.e. the generated .- is only 16% slower than the handcoded example in #2991.

It's a somewhat complex piece of code (that generates code that generates code), and I think that it warrants some discussion. I could polish this forever, but I think that I need some reactions before moving forward. What do you guys think? Would something like this be useful? What parts of it?

Edit: Renamed bsxfun and bsx to broadcast.

@toivoh toivoh mentioned this pull request May 14, 2013
@JeffBezanson
Copy link
Member

Very very cool!! And at 200 lines of code, admirably compact really. I think it's pretty likely that we will want this.

@StefanKarpinski
Copy link
Member

Yes, very nice. Thanks for tackling this – it's been hanging over our heads ever since you convinced us it was the right thing to do. How about calling the module and file something less inscrutable like Broadcast and broadcast.jl? If the function remains functionally identical to the Matlab one (I'm not super familiar with bsxfun) then keeping the name is probably appropriate, but if it's behavior changes, we might want to rename it to broadcast since that's actually meaningful in English.

@johnmyleswhite
Copy link
Member

As a non-native speaker of Matlab and a native speaker of English, I really prefer the term broadcast over bsxfun.

@ViralBShah
Copy link
Member

bsxfun is one of the most bizarre matlab names I have come across, and perhaps isequalwithequalnans is the only one that beats it.

@StefanKarpinski
Copy link
Member

At least you can read isequalwithequalnans and understand what it does. The name bsxfun is problematic in both directions: if you encounter it, you have no idea what it means and I for one can never remember how to spell it – does the "s" come first or the "x"?

@timholy
Copy link
Member

timholy commented May 15, 2013

I'm excited about this, and as usual you've done a remarkable job.

My main concern was about performance with small-array inputs, where the overhead of computing sizes might be quite noticeable. I did some profiling, using

A = randn(5,8);
b = randn(5);
@sprofile (for i = 1:100000; Bsx.bsxfun(.+, A, b); end)

with the following highlights: out of 1100 total samples,

  • 409 in bsx.jl; broadcast_shape; line: 13;
    • 121 in bsx.jl; broadcast_shape; line: 19
    • 180 in array.jl; size; line: 14 (most of this is due to allocation/garbage collection)
  • 481 in bsx.jl; broadcast_args; line: 63
    • 156 bsx.jl; calc_loop_strides; line: 49
    • 96 bsx.jl; calc_loop_strides; line: 50
    • 114 /tmp/bsx.jl; calc_loop_strides; line: 53

I tried writing an alternative version of broadcast_shape:

function mybroadcast_shape(As::AbstractArray...)
    nd = ndims(As[1])
    for i = 2:length(As)
        nd = max(nd, ndims(As[i]))
    end
    bshape = ones(Int, nd)
    for A in As
        for d = 1:ndims(A)
            n = size(A, d)
            if n != 1
                if bshape[d] == 1
                    bshape[d] = n
                elseif bshape[d] != n
                    error("arrays cannot be broadcast to a common size")
                end
            end
        end
    end
    return tuple(bshape...)
end

The main difference is that this goes to some effort to avoid allocation, but it's actually not that much longer or harder to understand. The results (after warm-up):

julia> @time (for i = 1:10^5; t = Bsx.broadcast_shape(A, b); end)
elapsed time: 0.313923066 seconds

julia> @time (for i = 1:10^5; t = Bsx.mybroadcast_shape(A, b); end)
elapsed time: 0.168526285 seconds

So this might be worthwhile. I suspected it was going to be harder to write a more efficient version of calc_loop_strides, but that turned out not to be the case:

function mycalc_loop_strides(shape::Dims, As::AbstractArray...)
    # squeeze out singleton dimensions in shape
    dims = Array(Int, 0)
    loopshape = Array(Int, 0)
    nd = length(shape)
    sizehint(dims, nd)
    sizehint(loopshape, nd)
    for i = 1:nd
        s = shape[i]
        if s > 1
            push!(dims, i)
            push!(loopshape, s)
        end
    end
    nd = length(loopshape)

    strides = [(size(A, d) > 1 ? stride(A, d) : 0) for A in As, d in dims]
    # convert from regular strides to loop strides
    for k=(nd-1):-1:1, a=1:length(As)
        strides[a, k+1] -= strides[a, k]*loopshape[k]
    end

    tuple(loopshape...), strides
end

This one is admittedly less pretty, but the timing results are surprising:

julia> @time (for i = 1:10^5; shape, strides = Bsx.calc_loop_strides((5,8), A, b); end)
elapsed time: 0.527392012 seconds

julia> @time (for i = 1:10^5; shape, strides = Bsx.mycalc_loop_strides((5,8), A, b); end)
elapsed time: 0.188888371 seconds

So this work seems to suggest we should be able to get a roughly 2x further performance increase for small-array inputs.

@lindahua
Copy link
Contributor

This is a great piece of work!

@toivoh
Copy link
Contributor Author

toivoh commented May 15, 2013

Really happy for the great response, everyone!

@StefanKarpinski: Yeah, I've been feeling kinda the same way :)

About the naming, I'm not a big fan of bsxfun either. I figured that someone else understood the name :) So how about
renaming bsxfun to broadcast and bsxfun! to broadcast!, and then just replace bsx by broadcast everywhere, including the file name? In that case, I guess that should go for the existing bsxfun as well.

@timholy: Thanks for taking the time to go over those functions! I've incorporated your versions into the pull request. I've primarily been trying to get decent performance with big matrices so far, so there might be more places that could benefit from a gentle hand when it comes to small ones. One possibility to push that aspect even further could be to generate more of the code dependent on the number of arrays and their dimensions, but at the same time that would probably make it less maintainable. It's probably better to see what we can get by simple means first.

@StefanKarpinski
Copy link
Member

It seems like everyone dislikes the name bsxfun and so much that no one even minds breaking compatibility with Matlab. If nothing else, we could have a permanent deprecation for it. I'm inclined to just merge this whenever you feel that it's ready. Any issues will get ironed out as people use it.

@lindahua
Copy link
Contributor

In MATLAB, bsxfun means Binary Singleton eXpansion and it was supposed to only apply to binary functions -- writing something like bsxfun(f, a, b, c) is not allowed in MATLAB.

If this function in Julia can accept more than two arguments, then bsxfun is no longer an appropriate name. Since the semantics change, using a different name is not breaking compatibility (at least to me).

@lindahua
Copy link
Contributor

+1 for broadcast.

@toivoh
Copy link
Contributor Author

toivoh commented May 15, 2013

Ok, I can do the renaming and take one final look over it, then it should be ready to merge.

One thought: As I wrote it, bsxfun!(+, A,B) adds A and B together and then stores the result into A. Perhaps it would be better to have a separate out argument? Perhaps as a keyword argument out to bsxfun? (broadcast) Maybe I should take bsxfun! out until we have settled on the api.

There's still a number of things that would be nice to add, but those could come in subsequent pull requests:

  • SubArray support
  • a reasonable fallback for general AbstractArray arguments
  • a copy! implementation to aid slow SubMatrix copy #3098

I'm also hoping to find a nice abstraction to the broadcasting machinery, so that users can implement things similar to bsxfun and bsx_getindex with it.

@toivoh
Copy link
Contributor Author

toivoh commented May 15, 2013

Another question: Is there a trick to make the makefile recognize new test scripts? I added bsx.jl to test/ and added the name to test/runtests.jl, but make test-bsx just gives

make[1]: *** No rule to make target `bsx'.  Stop.

@timholy
Copy link
Member

timholy commented May 15, 2013

I'm also on-board for renaming it to broadcast. I like the idea of the permanent deprecation. I also like the idea of a separate output argument for the ! version. Presumably broadcast! never aliases, so there should be no harm in broadcast!(A, A, B).

@StefanKarpinski
Copy link
Member

I think the broadcast!(+, A, B, C) version is best as the most general mutating version.

@timholy
Copy link
Member

timholy commented May 15, 2013

Ah, right, I forgot the operator. How about broadcast!(Z, +, A, B, ...) where Z is the output? Separating it with the operator might clarify things.

@StefanKarpinski
Copy link
Member

You want the operator first so that you can provide it using the do-block syntax. It also just seems better there, imo.

@toivoh
Copy link
Contributor Author

toivoh commented May 15, 2013

Ok, I think that this should be good to go now. I did the rename (haven't touched the original bsxfun, though) and squashed the commits. I left out broadcast! for now, will put it in when I have a proper implementation.

@timholy
Copy link
Member

timholy commented May 15, 2013

You want the operator first so that you can provide it using the do-block syntax.

Right. I always forget about that.

@bsxfan
Copy link

bsxfan commented May 15, 2013

Now I'll have to change my handle to @broadcastfan.

Jokes aside, I'm very happy to see this development. As an exercise in learning Julia, I have been playing around with some operator overloading implementations of forward and reverse-mode automatic differentiation. Broadcasting operators require some care when propagating derivatives through them---you end up having to either broadcast or sum the derivatives. I'm hoping to find a nice way to use @toivoh's solution to do this.

@StefanKarpinski
Copy link
Member

Now I'll have to change my handle to @broadcastfan.

I thought about that. Still a good inside reference :-)

@ViralBShah
Copy link
Member

This is really nice, and yes +1 for calling it broadcast. I suspect that some of this machinery could also be used for devectorization of expressions, which is currently what @lindahua 's Devec.jl package does.

It would be nice to also have an update to the manual and std library reference as part of this pull request, so that new users can actually learn and start using this.

@toivoh
Copy link
Contributor Author

toivoh commented May 17, 2013

@bsxfan: It would definitely be nice if we can arrange so that you can reuse the broadcast machinery for AD, and I think that it should be quite doable. At least for forward AD, I guess that if you can do AD on f, then you should get AD of broadcast(f, ...) by passing in that function and its arguments instead of f. Please ask if you have any questions.

The machinery already does support devectorization in the sense that you can pass custom elementwise functions such as f(x,y,z,w) = x*y - z/w. Though the functionality is not as exposed yet, it is also possible to inline code directly. I hope that code in broadcast.jl can be the expert in base on how to code broadcast loops, and that it can be reused everywhere that that functionality is needed. I have not looked closely at the Devectorize package (I see that it is quite capable), but perhaps the broadcast machinery could serve as a back end, or me and @lindahua could join efforts to make this a back end that works for both cases. There must be substantial overlap.

I hope that I can get some time soon to update documentation.

@bsxfan
Copy link

bsxfan commented May 17, 2013

@toivoh, to stimulate your thinking about updating (in-place) broadcasting operations, here is an example of the kind of thing I need for AD: https://github.com/bsxfan/ad4julia/blob/master/Procrustes.jl
(I'm using this for reverse-mode AD. The broadcasting is just partly implemented.)

As mentioned, I sometimes need to broadcast and sometimes to reduce dimensionality by summing. But the thing that kept me busy for a while was when and how to do the in-place updates.

  • I learnt today that for matrices, D += S is not an in-place operation. The old D is garbage collected and a new one is allocated. I believe that even D[:] += S[:] will do unnecessary copying, because I think this will be expanded to D[:] = D[:] + S[:], in which case I suspect that both D and S may be copied to evaluate the RHS during the getindex [:] operations. I settled on a loop: D[i] += S[i]. Timing experiments support this theory---the loop is much faster.
  • D mostly dictates the dimensions of the result, but the type may be promoted. For example on input, D may be real and S complex. Also, size(D) may be expanded from (n,) to (n,1). If the type is promoted, then in-place addition is not possible, but for the size expansion it is.

@toivoh
Copy link
Contributor Author

toivoh commented May 18, 2013

I added broadcast!(f, dest, As...) with explicit destination argument.
I also made broadcast(f) and broadcast!(f) return the general broadcast functions created for f.
Please tell me if this is too cute. There is no ambiguity, since broadcast(f) with no arguments was meaningless.
(Or should that just have called f())?

… = f()

Also provide separate broadcast!_function
@toivoh
Copy link
Contributor Author

toivoh commented May 18, 2013

Ok, I realized that it was too cute since broadcast(f) = f() actually makes sense.
Provided separate broadcast_function(f) and broadcast!_function(f) instead. (I think that the underscores are warranted, but feedback is appreciated.)

@toivoh
Copy link
Contributor Author

toivoh commented May 21, 2013

So, what else should I do before this gets merged? Squash the commits again, anything else?

@StefanKarpinski
Copy link
Member

I think it's fine as is.

StefanKarpinski added a commit that referenced this pull request May 21, 2013
Staged bsxfun and other broadcast operations
@StefanKarpinski StefanKarpinski merged commit 9718d26 into JuliaLang:master May 21, 2013
@StefanKarpinski
Copy link
Member

This is fun:

julia> rand(5,5) .+ [1:5]
5x5 Float64 Array:
 1.87304  1.60196  1.67241  1.29966  1.28922
 2.55065  2.6901   2.31729  2.18837  2.67263
 3.31203  3.44544  3.34805  3.19786  3.40516
 4.90888  4.3661   4.88753  4.14251  4.49846
 5.31663  5.58011  5.70678  5.36226  5.4865

julia> rand(5,5) .+ [1:5]'
5x5 Float64 Array:
 1.28892  2.52568  3.0037   4.18982  5.24107
 1.10544  2.91998  3.52396  4.34637  5.44386
 1.52712  2.58849  3.52277  4.30855  5.22099
 1.39967  2.66229  3.74268  4.64086  5.93869
 1.58634  2.55684  3.78382  4.36793  5.44376

@StefanKarpinski
Copy link
Member

Unfortunately the ./ operator doesn't seem do what I'd want it to:

julia> A = rand(5,5);

julia> A ./ sum(A,1)
ERROR: argument dimensions must match
 in promote_shape at operators.jl:161
 in ./ at array.jl:909

julia> A ./ sum(A,2)
ERROR: argument dimensions must match
 in promote_shape at operators.jl:161
 in ./ at array.jl:909

That would be extremely handy for matrix normalization.

@toivoh
Copy link
Contributor Author

toivoh commented May 21, 2013

Thanks!
That's weird, it definitely should. So much for thinking that it was enough to test .+ Will look into it.

@JeffBezanson
Copy link
Member

This can totally replace bsxfun, right? We should get rid of the old one.

@toivoh
Copy link
Contributor Author

toivoh commented May 21, 2013

@StefanKarpinski: Ah, of course. There's an old method for ./ lying about that is more specific. I'll take a pass through those and see what should be done about them.

@JeffBezanson: I should probably write some kind of fallback to deal with non-strided arrays first. Then we can get rid of the old bsxfun.

@wlbksy
Copy link
Contributor

wlbksy commented May 26, 2013

1. Since ./ works under several circumstance, would mod support broadcast too?
2. Int broadcast is not supported?

julia> [1;2;3]./[2 3 4]
ERROR: argument dimensions must match
 in promote_shape at operators.jl:148
 in ./ at array.jl:867

julia> [1;2;3]./[2 3 4.]
3x3 Float64 Array:
 0.5  0.333333  0.25
 1.0  0.666667  0.5
 1.5  1.0       0.75

EDIT: sorry for problem 1, that should not be a problem.

@toivoh
Copy link
Contributor Author

toivoh commented May 26, 2013

Yeah, I don't see a problem with broadcasting for mod, as long as there is general consensus that it should be broadcasting. I just started with operators that literally begin with ., as it seems most obvious that those are elementwise.

About Int: I have a fix for that in #3201. Since there doesn't seem to be any objections, I will merge it soon.

@ViralBShah
Copy link
Member

It would be nice to have broadcast for the boolean operators as well.

@toivoh toivoh deleted the staged_bsxfun branch June 9, 2013 06:46
@toivoh
Copy link
Contributor Author

toivoh commented Jun 9, 2013

Definitely. I think that I need to figure out how read and write BitArrays with reasonable efficiency in a broadcast loop first, though; I don't want to introduce any major performance regressions along with the broadcast stuff. Will have to look at the BitArray implementation to see what I could do.

My long term plan is to have an abstract type NDIterator{T,N} to iterate over the broadcast grid of a single argument:

it = step1(it)  # take one step along the innermost dimension
it = step2(it)  # take one step along the second dimension, stepping back the first
value = it[]    # read value at the current position
it[] = value    # write value to the current position

There could be different implementations for different array types, and perhaps for special cases, such as an Array that can be iterated over purely by linear indexing.

Then BitArray could supply its own NDIterator that would know how to do the broadcasting efficiently. I'm not sure if there is something simpler that could be done for BitArrays in the meantime.

@toivoh
Copy link
Contributor Author

toivoh commented Jun 9, 2013

Looking at the BitArray implementation, linear indexing seems to be as efficient as one could hope, so I can probably start off by just plugging in BitArrays as if they were Array{Bool}. That would allow to use BitArrays as inputs and outputs, and also StridedArrays with a BitArray as parent.

@bsxfan
Copy link

bsxfan commented Jun 9, 2013

@toivoh Here are two feature requests for your broadcasting tools:

  1. Provide a variant of broadcast!() where the destination is updated rather than just overwritten, with behaviour similar to some of the BLAS functions, so that one can efficiently do calculations like D = D + a .+ b, or `D = D + a .* b. If it makes sense, you could even delegate some suitable cases to BLAS.
  2. Deprecate scale, because broadcasting can now do the same as scale. The only reason one would want to use scale now that we have broadcasting, is that for suitable argument types, scale calls into BLAS, with the hope that that may be faster. If you would automatically call into BLAS for suitable arguments, I think scale could be retired.

@toivoh
Copy link
Contributor Author

toivoh commented Jun 9, 2013

You can already do this by e.g.

broadcast!(+, D, D, a, b)                   # for D = D + a .+ b
broadcast!((d,a,b)->(d + a*b), D, D, a, b)  # for D = D + a .* b, not sure if the kernel will be inlined as written

It's perfectly safe to use the destination as a source, but of course there will be some efficiency loss. I could provide a version of broadcast that reuses the destination as a first argument, but I'm not sure what I should call it. Eventually, I would like to make the broadcast machinery less magical so that people could use it for things like this by themselves.

I must admit that I am quite unfamiliar with BLAS, so I'm not sure of in which kind of cases it could be used. If someone who is more familiar wants to special case some broadcast operations to call into BLAS, I could provide some guidance for the broadcast side of things.

@bsxfan
Copy link

bsxfan commented Jun 12, 2013

@toivoh I've just compared your suggested broadcasting update solution, broadcast!((d,a,b)->(d + a*b), D, D, a, b) to a hand-coded loop and to ger! in BLAS. Unfortunately the broadcast solution is much slower:

julia> D = randn(500,1000); x = randn(500); y = randn(1000);
julia> @elapsedloop 100 broadcast!((d,x,y)->d+x*y,D,x,reshape(y1,length(y)))
21.070778291000003

Now define: myger(D,x,y) = ((m,n)=size(D);for j=1:n, i=1:m D[i,j] += x[i]*y[j] end; D)

julia> @elapsedloop 100 myger(D,x,y)
0.27179146300000007

And now call into BLAS:

julia> @elapsedloop 100 BlasX.ger!(D,1.0,x,y)
0.028453931000000016

Notes:

@pao
Copy link
Member

pao commented Jun 12, 2013

This is probably getting killed by the use of a lambda in broadcast!().

@bsxfan
Copy link

bsxfan commented Jun 12, 2013

One can achieve a faster broadcast-based solution like this:

julia> @elapsedloop 100 D += x.*reshape(y,1,length(y))
1.635495283000001

I think the main reason why this is still slower than the handcoded loop is that the latter avoids creating a temporary 500 by 1000 matrix to store the RHS.

@dmbates
Copy link
Member

dmbates commented Jun 12, 2013

@ViralBShah, @andreasnoackjensen I think the ger! function from https://github.com/bsxfan/ad4julia/blob/master/BlasX.jl should be added to Base.LinAlg.BLAS (and an syr! function too even though it is not in the ad4julia package). Also, perhaps the sympack, symunpack and spr functions from the ad4julia package. One small nit regarding the current implementation is the use of assert for catching invalid arguments. I seem to remember @JeffBezanson saying that assert should be used to signal logic errors not invalid arguments but I would be happy to be wrong about that.

@lindahua
Copy link
Contributor

+1 for adding ger and spr.

@toivoh
Copy link
Contributor Author

toivoh commented Jun 12, 2013

@bsxfan: can you try

op(d,x,y) = d+x*y
broadcast!(op, ...)

instead of broadcast!((d,x,y)->d+x*y, ...)? Hopefully op should be inlined into the inner loop. Perhaps it can come close to the hand written loop.

@JeffBezanson
Copy link
Member

You are correct about assert. An assertion failure means there is a bug in the code containing the assert, not in the caller.

@bsxfan
Copy link

bsxfan commented Jun 13, 2013

@toivoh Unfortunately, predeclaring op(d,x,y) = d+x*y as you suggest does not help.

@lindahua
Copy link
Contributor

For this case, unless op is completely inlined, you will get a performance hit. A function call is generally far more heavier than just doing one addition and one multiplication, which, in theory, can be done in a single cycle.

@lindahua
Copy link
Contributor

After several conference deadlines, I now have some time to spend on Julia. I am considering to revisit Devectorize.jl to let it support broadcasting, such that, this can be written as

@devec d + x .* y

@toivoh
Copy link
Contributor Author

toivoh commented Jun 14, 2013

I'm still surprised that it doesn't seem to get inlined. I could probably provide macro versions of broadcast and broadcast! that would make sure to inline the supplied expression.

@lindahua: Do you think that it would make sense to collaborate on a common back end to generate the broadcast loops? I still want to extend the machinery in broadcast.jl to handle more kinds of arrays efficiently, it would seem like a shame to duplicate the effort. But I'm not completely aware of the needs of Devectorize.jl in this respect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.