Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: more efficient ∇getindex #962

Closed
wants to merge 6 commits into from
Closed

Conversation

mcabbott
Copy link
Member

@mcabbott mcabbott commented May 4, 2021

This is a small improvement to the gradient for getindex in the easiest case, by returning a very simple one-nonzero-element array.

It also changes how accum works: adding two of these will produce an Array, but accumulating the next one may as well mutate that. Maybe it is safe to do that in general, I am not 100% sure? Maybe CI will tell us? [It is not, and it didn't, see below]

Note that it will never produce a SparseArray. My take is that if you call getindex once then this OneElement is fine, and if you call it on every element in the array, then accumulating in an Array is optimal; in-between cases where you call it on 1% of the elements seem likely to be very rare, and sparse arrays add all sorts of complication. [This won't apply anymore]

And, are there CuArray concerns? ∇getindex is careful about making a similar zero to write into, but this change is only for scalar indexing, which is (usually) disallowed anyway.

This doesn't give a huge speedup, but it does save a lot of memory. I'm not too sure what the "10089 allocations" here actually are, maybe there is some further trick to remove them? (Might also be useful for someone to time this on a different computer too -- these are M1 mac, which has unusual memory, and Julia 1.7 native.)

f4(x) = sum([x[i]^2 for i in eachindex(x)])
@btime gradient(f4, $(collect(1:1000)))
#  1.997 ms (12829 allocations: 16.05 MiB)  originally
#  1.502 ms (11087 allocations: 8.46 MiB)   with OneElement alone
#  1.520 ms (11830 allocations: 8.31 MiB)   with accum(x::DenseArray,...) alone
#    992.917 μs (10089 allocations: 738.44 KiB)  this PR

@btime gradient(x -> sum(x * x'), $(rand(10,1000)));  # no getindex
#  79.625 μs (30 allocations: 237.06 KiB)  original
#  79.334 μs (28 allocations: 158.86 KiB)  with accum

On the example from #644 -- discussion there had other comparable proposals:

function _evalpoly(x, p)
    N = length(p)
    ex = p[end]
    for i in N-1:-1:1
        ex = muladd(x, ex, p[i])
    end
    ex
end
x, p = rand(), randn(10000);

@btime _evalpoly(x, p);
# 12.416 μs (1 allocation: 16 bytes)

@btime Zygote.gradient(_evalpoly, x, p);
# 170.356 ms (730104 allocations: 1.53 GiB)     original
# 139.957 ms (710106 allocations: 800.45 MiB)   with accum(x::DenseArray,...) alone
# 136.468 ms (740104 allocations: 801.90 MiB)   with OneElement alone
#  87.931 ms (720108 allocations: 38.35 MiB)    with both
#  59.191 ms (720108 allocations: 38.35 MiB)    with accum(..., y::OneElement) too

Xref also #365 (getindex) and #905 (accumulation).

@oxinabox
Copy link
Member

oxinabox commented May 4, 2021

in the longer term this will be resolved once we support ChainRules inplacable thunks.
Til then this might be a good idea.

It also changes how accum works: adding two of these will produce an Array, but accumulating the next one may as well mutate that. Maybe it is safe to do that in general, I am not 100% sure? Maybe CI will tell us?

We can define a custom rule for accum that changes the primal so that it doesn't mutate if you are doing nested AD.

@oxinabox
Copy link
Member

oxinabox commented May 4, 2021

On my computer

julia> versioninfo()
Julia Version 1.6.2-pre.0
Commit dd122918ce* (2021-04-23 21:21 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin20.3.0)
  CPU: Intel(R) Core(TM) i7-8559U CPU @ 2.70GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, skylake)

here are the benchmarks

@btime gradient(f4, $(collect(1:1000)))
  2.027 ms (7068 allocations: 15.77 MiB)    current master
  90.018 μs (2073 allocations: 307.52 KiB)  this PR before inplace was removed
  682.171 μs (3071 allocations: 8.04 MiB)    this PR after inplace was removed

@btime gradient(x -> sum(x * x'), $(rand(10,1000)));  # no getindex
  108.439 μs (29 allocations: 237.16 KiB)   current master
  105.164 μs (27 allocations: 158.95 KiB)   this PR before inplace was removed
  108.524 μs (29 allocations: 237.16 KiB)   this PR after inplace was removed

julia> @btime _evalpoly(x, p);
  9.589 μs (1 allocation: 16 bytes)

julia> @btime Zygote.gradient(_evalpoly, x, p);
  942.210 ms (680106 allocations: 1.52 GiB)         current master
  67.668 ms (640110 allocations: 29.20 MiB)         this PR before inplace was removed
  564.833 ms (660106 allocations: 792.75 MiB)    this PR after inplace was removed

Notable differences from yours in the top post:
All allocations are lower (especially in the collect example)
but the allocations are much lower still for this PR.

This PR makes the collect example waaaay faster than what you saw: an order of magnitude faster.

The current maseter timing fro the eval poly example is waaay worse for me.


This is great. We should have been doing this for ages.
This is going to be the single biggest quality of life improvement in quite a while.

@mcabbott
Copy link
Member Author

mcabbott commented May 4, 2021

Those times look great, thanks. I should try on 1.6 + rosetta too, maybe there are other bugs involved.

We can define a custom rule for accum that changes the primal so that it doesn't mutate if you are doing nested AD.

Good point, hadn't thought about nested AD at all. There appear to be no tests, so perhaps that can be looked at in another PR. Nested getindex specifically might also be easier in this case, xref #944 which I have not read closely.

@DhairyaLGandhi
Copy link
Member

Is that correct in general? Aso, could we replicate this with only instead?

src/lib/lib.jl Outdated Show resolved Hide resolved
@oxinabox
Copy link
Member

oxinabox commented May 4, 2021

Is that correct in general?

Is what correct in general?
This looks like a correct inplementation of a n-dimentional one-hot array.
And inplace accumulation of gradients is safe and correct in general due to the face that AD is logically linear (which means each differential is only ever used once so safe to mutate since it is not going to be used anywhere ekse at the point you are calling accum on it, at least during the pullback (less safe if it is something the user is inputting though, since then it might be used more than once.)

Aso, could we replicate this with only instead?

Which bit?

@willtebbutt
Copy link
Member

willtebbutt commented May 5, 2021

Should we be concerned about race conditions?

@DhairyaLGandhi and @MikeInnes know more about this than I do (I know basically nothing), but I would imagine that if you have parallel code somewhere, this modification could be problematic if Zygote winds up trying to accumulate to the same location at the same time (I'm not sure whether this can ever happen in practice)? My impression is that Zygote currently works fine with threads / parallelism generally -- would this change that?

@oxinabox
Copy link
Member

oxinabox commented May 5, 2021

Should we be concerned about race conditions?

I don't think so.
From what I can see Task related things never call accum directly (most pullbacks never call accum). They return something that gets accumulated outside of that.
I suspect also there is something about structure of AD programs that also mean we don't have to worry about it.
Parallelism in reverse must follow parallelism in forward but backwards, which should mean that primals are free from race conditions in forward must have differentials that are free from race conditions in reverse (thoguh that might not apply since we are turning a copy into a mutation).

I think we can assume it is safe and wait for evidence to the contrary (though debugging race conditions sucks)

src/lib/lib.jl Outdated Show resolved Hide resolved
@mcabbott
Copy link
Member Author

mcabbott commented May 8, 2021

Times with rosetta as promised:

julia> @btime gradient(f4, $(collect(1:1000)));
  1.033 ms (7068 allocations: 15.77 MiB)      # tagged
     96.625 μs (2074 allocations: 472.20 KiB) # this PR

julia> @btime gradient(x -> sum(x * x'), $(rand(10,1000)));  # no getindex
  101.084 μs (29 allocations: 237.16 KiB)  # tagged
   96.000 μs (27 allocations: 159.09 KiB)  # this PR

julia> @btime Zygote.gradient(_evalpoly, x, p);
  275.883 ms (680106 allocations: 1.52 GiB)    # tagged
   59.369 ms (640110 allocations: 35.76 MiB)   # this PR

julia> versioninfo()
Julia Version 1.6.0
  OS: macOS (x86_64-apple-darwin19.6.0)
  CPU: Apple M1

Also some here: #905 (comment)

Edit -- without mutation, same computer:

julia> @btime gradient(f4, $(collect(1:1000)));
  433.084 μs (3072 allocations: 8.20 MiB)

julia> @btime Zygote.gradient(_evalpoly, x, p);
  141.938 ms (660106 allocations: 799.31 MiB)

@mcabbott
Copy link
Member Author

mcabbott commented May 9, 2021

It turns out that #905 is the way to produce sort of bug I was a bit worried about. Something is doing Δ -> (Δ, Δ, Δ), which results in accum(x, y, zs...) = accum(accum(x, y), zs...) running with x === zs[1], which leads to:

julia> gradient(x -> sum(abs, x + x + x), ones(2))
([4.0, 4.0],)

julia> ForwardDiff.gradient(x -> sum(abs, x + x + x), ones(2))
2-element Vector{Float64}:
 3.0
 3.0

@oxinabox
Copy link
Member

oxinabox commented May 9, 2021

Yeah fair. Good catch.
I think this would be fixed by having a restriction on pullbacks that says you can't return aliased differentials for non-aliased primal inputs.
Which for practical implementations means that if you are returning more than one instance of the input to your pullback you need to make copies.

Presumably this came out of if a +.

Enforcing this rule against aliasing would add some copies but not as many as inplace accumulation would solve.

I think there might be something in literature about this and a dup operation?

@mcabbott
Copy link
Member Author

mcabbott commented May 9, 2021

Yes that might be enough. I don't know the literature (obviously!) but it's hard to then picture it going wrong. Would there be cases where this hurts, as well as ones where it helps overall?

This policy of no aliasing would probably be necessary for any of the in-place updates planned in ChainRules, too, right? Not just accum.

If that were to be the policy, you could surely hack in a test to check for aliasing, here, I mean pirate something during testing. ChainRules could easily do that too.

Just making the update vararg like x .= accum.(x, ys...) in fact avoids the problem here, but obviously won't in general. Doing that without mutation seems like a good idea, saves a lot of memory in #905. Maybe less in real cases, but e.g. gradient(x -> x[1] + x[2] + x[3], ones(3)) does hit that method.

@mcabbott
Copy link
Member Author

OK, I think I see how to ensure broadcasting takes preventative copies. For times on this page it only costs 10% off the fastest unsafe way, but in #905 (comment) it saves no memory, and is slower than master.

@oxinabox
Copy link
Member

I have updated my timings in #962 (comment)
Which still leaves this PR as a ~2-3x improvement over current master.
I say we should merge this one as is and come back to workout how we make inplace accumulation safe in a follow up.


Yes that might be enough. I don't know the literature (obviously!) but it's hard to then picture it going wrong.

Yeah it is hard to picture it going wrong, and putting differential memory addresses into acorrespondence in quanity with primal feels right.

I don't know the literature that well either, not really.
The literature is a utter mess of stuff that appears super niche to fortran and stuff that is super theoretical.
So much of it is older than closures and talking about AD without using the closure representaion of pullbacks is kinda suffering.
A lot of the more modern literature isn't in literature at all, but in like github issues and twitter threads.
but I recall dup being mentioned in a talk by Tom Minka on formulating AD as message passing.
but not well, so i could be getting confused.
This should show up somewhere in some intersection of linear types, message passing, and effects.
idk though maybe people talking about that don't believe in mutation?

Would there be cases where this hurts, as well as ones where it helps overall?

Yes, cases where you don't do anything with the output that you just copied.
If you don't any up accumulating inplace against it then the copy has been done for no reason.
Ideally we would have copy-on-write arrays to fix this.

This policy of no aliasing would probably be necessary for any of the in-place updates planned in ChainRules, too, right? > Not just accum.

I think you mean not just getindex, all the in-lace updates in ChainRules are pretty much for accum,
and yes it would be needed.

If that were to be the policy, you could surely hack in a test to check for aliasing, here, I mean pirate something during testing. ChainRules could easily do that too.

yeah. We could hack that into wrap_chainrules_output.
and something like that might be what we should do, rather than the rule author having to do it?
I want to think about this more still

@mcabbott
Copy link
Member Author

mcabbott commented May 10, 2021

Re this PR as it stands, one comment is that it does a bit less to preserve the types of arrays, e.g.:

julia> using StaticArrays

julia> gradient(x -> x[1], SA[1,2,3])[1]
3-element Zygote.OneElement{Int64, 1, Tuple{Int64}, Tuple{SOneTo{3}}} with indices SOneTo(3):
 1
 0
 0

julia> gradient(x -> x[1] + x[2], SA[1,2,3])[1]
3-element SizedVector{3, Int64, Vector{Int64}} with indices SOneTo(3):
 1
 1
 0

compared to master:

julia> gradient(x -> x[1] + x[2], SA[1,2,3])[1]   # better than a SizedVector on similar case
3-element MVector{3, Int64} with indices SOneTo(3):
 1
 1
 0

julia> gradient(x -> sum(x .+ x'), SA[1,2,3])[1]  # but many other cases don't preserve this
3×1 Matrix{Int64}:
 6
 6
 6

julia> gradient(x -> sum(abs, x .+ x'), SA[1,2,3])[1]  # and many simple things don't work at all
ERROR: MethodError: no method matching _mapfoldl(::typeof(identity), ::typeof(+), ::Tuple{Int64, Int64}, ::StaticArrays._InitialValue, ::Size{(3, 3)}, ::SMatrix{3, 3, Int64, 9})

I wondered a bit whether OneElement should capture the type of the array, but the method of similar I think you'd want to use when turning back into a dense array doesn't seem to exist, surprisingly?

julia> similar(typeof([1,2]), axes([1,2]))
2-element Vector{Int64}:
 12
 57

julia> similar(typeof(SA[1,2]), axes(SA[1,2]))
2-element MVector{2, Int64} with indices SOneTo(2):
 4585456624
          3

julia> similar(typeof([1,2]), Float64, axes([1,2]))  # with eltype
ERROR: MethodError: no method matching similar(::Type{Vector{Int64}}, ::Type{Float64}, ::Tuple{Base.OneTo{Int64}})

julia> similar([1,2], Float64, axes([1,2]))  # with an instance it's fine
2-element Vector{Float64}:
 0.0
 2.2655161774e-314

julia> similar(typeof(SA[1,2]), Float64, Size(SA[1,2]))  # this exists, but not for Base types
2-element MVector{2, Float64} with indices SOneTo(2):
 2.162767274e-314
 0.0

@oxinabox
Copy link
Member

Are you ready for this to be merged?
Given all existing tests pass, and new ones, this feels safe.

Copy link
Member Author

@mcabbott mcabbott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, let's do this.

Failure on master ought to be fixed by FluxML/IRTools.jl#86

(Minor aesthetic changes in the review.)

src/lib/array.jl Show resolved Hide resolved
src/lib/array.jl Outdated Show resolved Hide resolved
mcabbott pushed a commit to mcabbott/Zygote.jl that referenced this pull request May 25, 2021
RFC: more efficient `∇getindex`
@mcabbott mcabbott mentioned this pull request May 25, 2021
@mcabbott
Copy link
Member Author

Oh I see. I think we both hit the merge button at the same time, yours succeeded & mine gave a confusing error... which I tried to rebase to fix... but this is merged & can be closed.

@mcabbott mcabbott closed this May 25, 2021
marcoq added a commit to marcoq/Surrogates.jl that referenced this pull request Jun 9, 2021
marcoq added a commit to marcoq/Surrogates.jl that referenced this pull request Jun 9, 2021
@DhairyaLGandhi
Copy link
Member

Most of the work for static arrays is pretty orthogonal to the concerns of accumulation. It seems odd to make future work harder.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants