Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mark some arrays as safe for accumulation #578

Closed
wants to merge 2 commits into from

Conversation

mcabbott
Copy link
Member

@mcabbott mcabbott commented Aug 25, 2022

As far as I know InplaceableThunk never does anything at present, since gradients are accumulated with +, and this un-thunks before adding.

This PR marks some arrays as safe to mutate, and then mutates them when adding another array, or an InplaceableThunk. The only arrays it marks are ones produced by adding a thunk to something, as this is sure to be a new array. This is no help for an accumulation of 2 terms, but helps with the 3rd and after, and should make N terms O(1).

#539 explores a different rule, assuming any @thunk expands to something safe to mutate. That helps with the 2nd Thunk. Both could be done.

The marker it uses is a new AbstractThunk. This is nice because consumers of gradients must already know to call unthunk if they need an array.

But in fact there's a small zoo of functions which automatically un-thunk. I'm not sure these are a great idea, as using them without thinking means you are likely to un-thunk twice or N times by accident. And using them on purpose requires memorising what's in the zoo and what's not. This PR adds one more problem: If you were relying on + to un-thunk for you, it will now assume you are accumulating, mutate, and wrap the answer in another thunk.

The other snag is that a rule like that for z = x+y whose pullback passes the same dz to both dx and dy risks wrong answers if it passes this new marker through. (Whereas earlier Thunks were safe.) The rrule in CR does not appear to have this problem, as reshape automatically unthunks (and will be called N times). Rules for .+ and maybe .- may need care... I believe CR's will simply fail here since ndims(@thunk [1,2]) is an error. Perhaps no rrule should ever transmit a thunk unchanged, and this could be automated somewhere?

Maybe this isn't quite safe enough.

What was the plan for InplaceableThunk? Was anything written when it was designed?

@mcabbott mcabbott added the inplace accumulation for things relating to inplace accumulation of gradients label Aug 25, 2022
Comment on lines +322 to +332
julia> @btime Diffractor.gradient(x -> _getindex(x,1), $(rand(128 * 100)));
min 1.012 μs, mean 11.103 μs (2 allocations, 100.05 KiB)

julia> @btime Diffractor.gradient(x -> _getindex(x,1)+_getindex(x,2), $(rand(128 * 100)));
min 7.625 μs, mean 46.941 μs (6 allocations, 300.14 KiB) # unthunk, unthunk, add -- unchanged

julia> @btime Diffractor.gradient(x -> _getindex(x,1)+_getindex(x,2)+_getindex(x,3), $(rand(128 * 100)));
min 16.791 μs, mean 67.720 μs (10 allocations, 500.23 KiB) # before
min 8.625 μs, mean 44.642 μs (6 allocations, 300.14 KiB) # after

min 1.036 μs, mean 12.684 μs (2 allocations, 100.05 KiB) # with stronger assumption, overwrite any thunk
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some benchmarks.

Comment on lines 65 to 66
Before ChainRulesCore 1.16, it would guess `true` for most wrappers based on `parent`,
but this is not safe, e.g. it will lead to an error with ReadOnltArrays.jl.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This branch is on top of #577, mostly to make tests pass.

@codecov-commenter
Copy link

codecov-commenter commented Aug 25, 2022

Codecov Report

Base: 93.11% // Head: 93.11% // No change to project coverage 👍

Coverage data is based on head (f6123ee) compared to base (f6123ee).
Patch has no changes to coverable lines.

❗ Current head f6123ee differs from pull request most recent head 4d48df8. Consider uploading reports for the commit 4d48df8 to get more accurate results

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #578   +/-   ##
=======================================
  Coverage   93.11%   93.11%           
=======================================
  Files          15       15           
  Lines         901      901           
=======================================
  Hits          839      839           
  Misses         62       62           

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@mcabbott
Copy link
Member Author

BTW, status is that I think it's probably too unsafe to load this onto +.

It would be better to have a dedicated function for accumulation that it not used in the sloppy use of "maybe this will unthunk automatically". Diffractor.accum is such a function, and attaching this behaviour to it would be better. Possibly CRC should own this function.

@ToucheSir
Copy link
Contributor

How would such a function be distinct from add!!? Is that interface not a good fit for https://github.com/JuliaDiff/ChainRulesCore.jl/pull/578/files#diff-e87adbe51202ce0c86d1adf5ec04bc9857f26830dd7a634aa3b1b13cc15b967bR162-R183?

@mcabbott
Copy link
Member Author

mcabbott commented Sep 4, 2022

Whatever else it does, I think add!!(x::Array, y::Array) will always mutate x. I think this isn't acceptable for accum as, right now, we do not demand that rrules return something known to be safe to mutate.

Few are actually unsafe, essentially just +, - & broadcasted versions. So perhaps you could fix those and then make this a demand. I'm a bit scared that some other rules elsewhere will break this, e.g. by returning a Tangent containing within it some arrays which alias each other. The rule for + (and any others) would either need to make preventative copies always, or else wrap in some marker like ReadOnlyArray. Either way an explicit opt-out.

The safer approach seems to be opt-in. The result of dx + dx is certainly safe. #539 further regards @thunk as an opt-in. We could encourage @thunk A in rules for any array A you know to be safe (even when the thunk delays no computation.)

@ToucheSir
Copy link
Contributor

Thanks for the explanation. Is there something that can be done on the rule level to annotate one as being safe like +? I know the word "linear" has been thrown around a few times, but unsure if that applies here. Or is this not something that can be captured at a rule boundary?

@mcabbott
Copy link
Member Author

mcabbott commented Sep 4, 2022

Unsafe like +, I'd say. In that it returns gradients for two arguments which may well alias each other.

Done manually? You could wrap both in a ReadOnlyArray. But forgetting this means silent wrong answers.

Done automatically? I messed around in FluxML/Zygote.jl#981 with making the Z-to-CR bridge look for repeated pointers... but I don't know how well this can work. Someone is going to have a rule in the wild which does Tangent{A}(; x = arr, y = dy), Tangent{B}(; p = dp, q = arr'), and mutating that arr is not safe.

@oxinabox
Copy link
Member

add!!(x,y) may or may not mutate x.
It makes no promise as to if it will or will not.

@oxinabox
Copy link
Member

But yes, there is a bunch of extra thinking that remains to be done as to if in-place accumulation is ever safe in the presence of aliasing.
I suspect it's important that the tangent only is an alias for another tangent, if the primal is also an alias for another primal.
And if you have mutations support that requires a "if and only if" for aliasing.
And yes that thinking requires thinking about "linear".

@mcabbott
Copy link
Member Author

I suspect it's important that the tangent only is an alias for another tangent, if the primal is also an alias for another primal.

This strikes me as being quite far from status quo. Flux allows for shared parameters (and dealing with this is 99% of the headache in Optimisers.jl) but their gradients will almost never be aliased, as they are freshly allocated by *. But gradients elsewhere are (at present) allowed to be aliased, e.g. if the forward pass is z = hcat(v,w) + vcat(r',s') then the four vectors here could all have gradients which are views of dz. (Not sure the hcat rules does in fact make views, not yet.)

This PR wants to design around that. Assume any array you get from an rrule's pullback aliases others in horrible ways. But as soon as you add two, you know you have made a new array, and can safely mutate it to add the 3rd.

Can this be safe? What are the edge cases?

  • If some rule relies on + to auto-unthunk, then they may expect to get an Array not another thunk afterwards. If that leads to an error, OK, they will change it to unthunk(dy) manually. If the next step is something else which auto-unthunks, then all will be well.

  • If a rule has a trivial pullback, like z = f(x) has dz -> (NoTangent, dz), then is it OK for the AccumThunk to pass right through? Accumulation of dz is finished by the time this pullback is called, since all operations on z must happen after f is called on the forward pass. Accumulation of dx happens later, if x is also used in some g(x), but that doesn't matter right?

  • If the pullback duplicates, dz -> (NoTangent, dz, dz), then accumulation on dx and dy will interfere with each other. E.g. something like prod(x+y) / first(x) will have a second contribution to dx which should not influence dy. Are there any rules besides + which will do this? Not just produce aliasing, but pass unchanged an AccumThunk they receive from some other rule on to several others? (In fact CR's rule for + does unthunk once, to avoid unthunking several times.)

@mcabbott
Copy link
Member Author

More broadly, am I right to think the "functional" approach of how rrule works is more-or-less inherited from Zygote?

The alternative would be to allocate all buffers on the forward pass. Then there would never be a need for thunks, you would always update the existing array. I think this is the path taken by Tracker, and if I understand right also by Enzyme. (For scalars, thunks save work on inactive paths, but this would presumably track activity.)

@ToucheSir
Copy link
Contributor

How would preallocation interact with something like Zygote.jacobian which calls the pullback multiple times? I'm not aware of how ReverseDiff and Enzyme handle that, so perhaps it's a non-issue.

@mcabbott
Copy link
Member Author

You'd have to zero them all before the second backward pass. (After copying the contents to one slice of the output.)

Tracker has will give an error if you accidentally call back! twice, but I don't immediately see how its jacobian handles this.

@ToucheSir
Copy link
Contributor

ToucheSir commented Sep 21, 2022

It appears Tracker has some capability for auto-accumulation? https://github.com/FluxML/Tracker.jl/blob/7ab871f4e4d6410e98bb1d5f527e512eb912aff8/src/back.jl#L48-L58. Surely it's not as easy as always using += instead of = when writing to buffers in a pullback?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
inplace accumulation for things relating to inplace accumulation of gradients
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants