Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broadcast views to avoid allocations. #353

Closed
wants to merge 2 commits into from
Closed

Broadcast views to avoid allocations. #353

wants to merge 2 commits into from

Conversation

maleadt
Copy link
Contributor

@maleadt maleadt commented Aug 31, 2018

@KristofferC
Copy link
Collaborator

Can the zero allocations be tested?

@jrevels
Copy link
Member

jrevels commented Aug 31, 2018

This unfortunately doesn't seem to do the trick.

Here's a simple benchmark (it only tests one of these seed! methods, but should be representative):

using ForwardDiff, BenchmarkTools
x = rand(1000);
cfg = ForwardDiff.GradientConfig(nothing, x);
@btime ForwardDiff.seed!($(cfg.duals), $x, $(cfg.seeds));

Before #351:

julia> @btime ForwardDiff.seed!($(cfg.duals), $x, $(cfg.seeds));
  12.895 ns (0 allocations: 0 bytes)

After #351:

julia> @btime ForwardDiff.seed!($(cfg.duals), $x, $(cfg.seeds));
  1.078 μs (16 allocations: 3.80 KiB)

After this PR:

julia> @btime ForwardDiff.seed!($(cfg.duals), $x, $(cfg.seeds));
  1.032 μs (16 allocations: 3.69 KiB)

Unfortunately, I think the best thing to do at the moment is to revert (EDIT: this part of) #351 so we can tag ForwardDiff v0.9.0 (which is needed for people to get a version of ForwardDiff working on Julia v1.0). Then we can take our time trying to get this working without regressions.

Sorry again, I really should have caught this when reviewing #351 originally...

@maleadt
Copy link
Contributor Author

maleadt commented Sep 6, 2018

Turns out the overhead came from the tuple slicing. Avoiding that, I can get the timings down most of the way, but there's still some overhead due to broadcast + view allocation:

seed!(duals::AbstractArray{Dual{T,V,N},N1} where N1, x, seed::Partials{N,V}) where {T, V, N} in Main.Loop at /tmp/test.jl:9
  4.115 μs (1 allocation: 96 bytes)
seed!(duals::AbstractArray{Dual{T,V,N},N1} where N1, x, seed::Partials{N,V}) where {T, V, N} in Base.Broadcast at /tmp/test.jl:50
  2.912 μs (1 allocation: 96 bytes)

seed!(duals::AbstractArray{Dual{T,V,N},N1} where N1, x, seeds::Tuple{Vararg{Partials{N,V},N}}) where {T, V, N} in Main.Loop at /tmp/test.jl:17
  81.289 ns (1 allocation: 896 bytes)
seed!(duals::AbstractArray{Dual{T,V,N},N1} where N1, x, seeds::Tuple{Vararg{Partials{N,V},N}}) where {T, V, N} in Base.Broadcast at /tmp/test.jl:56
  126.638 ns (3 allocations: 992 bytes)

seed!(duals::AbstractArray{Dual{T,V,N},N1} where N1, x, index, seed::Partials{N,V}) where {T, V, N} in Main.Loop at /tmp/test.jl:25
  63.850 ns (1 allocation: 96 bytes)
seed!(duals::AbstractArray{Dual{T,V,N},N1} where N1, x, index, seed::Partials{N,V}) where {T, V, N} in Base.Broadcast at /tmp/test.jl:62
  89.895 ns (3 allocations: 192 bytes)

seed!(duals::AbstractArray{Dual{T,V,N},N1} where N1, x, index, seeds::Tuple{Vararg{Partials{N,V},N}}) where {T, V, N} in Main.Loop at /tmp/test.jl:35
  84.639 ns (1 allocation: 896 bytes)
seed!(duals::AbstractArray{Dual{T,V,N},N1} where N1, x, index, seeds::Tuple{Vararg{Partials{N,V},N}}) where {T, V, N} in Base.Broadcast at /tmp/test.jl:70
  121.852 ns (3 allocations: 992 bytes)

seed!(duals::AbstractArray{Dual{T,V,N},N1} where N1, x, index, seeds::Tuple{Vararg{Partials{N,V},N}}, chunksize) where {T, V, N} in Main.Loop at /tmp/test.jl:35
  64.874 ns (1 allocation: 896 bytes)
seed!(duals::AbstractArray{Dual{T,V,N},N1} where N1, x, index, seeds::Tuple{Vararg{Partials{N,V},N}}, chunksize) where {T, V, N} in Base.Broadcast at /tmp/test.jl:70
  135.899 ns (3 allocations: 992 bytes)

https://gist.github.com/maleadt/d794ff3de39c4e5a94e98370288fcf09

How performance sensitive is seed!?

@jrevels
Copy link
Member

jrevels commented Sep 10, 2018

Nice!

How performance sensitive is seed!?

It gets called in the "inner loop" of the pertubation chunking algorithms for API functions e.g. ForwardDiff.gradient!. Timing-wise, the performance impact of seed! is hugely dependent on the target function and input/output dimensions. We unfortunately can't let it allocate, since doing so breaks an existing ForwardDiff use case (repeated/rapid computation of small Jacobians, desirable for e.g. real-time control/optimization where memory is limited).

@maleadt
Copy link
Contributor Author

maleadt commented Sep 10, 2018

We unfortunately can't let it allocate

Then this can't work until JuliaLang/julia#14955

@jrevels
Copy link
Member

jrevels commented Sep 10, 2018

Would it be possible to specialize this implementation on CuArrays' types so they can at least get the desired behavior? The big downside there would obviously be that the overloads would need to live in a separate companion package (unless we'd make ForwardDiff depend on CuArrays, or CuArrays depend on ForwardDiff, neither of which is desirable).

@maleadt
Copy link
Contributor Author

maleadt commented Sep 10, 2018

That seems like needlessly duplicating functionality that's actually pretty generic.
What about a HasFastViews trait?

@jrevels
Copy link
Member

jrevels commented Sep 10, 2018

What about a HasFastViews trait?

Sounds good to me. That still requires a companion package though, right? Or do you mean a trait like this in Base? If the latter, we could avoid having to wait for a 1.1 release by just making a lightweight package that ForwardDiff/CuArrays/etc. could all share as a dependency (e.g. ExtraTraits.jl).

@charleskawczynski
Copy link
Contributor

Now that JuliaLang/julia#14955 is resolved, can we revive this work? It seems that the solution in #406 yields zero allocations:

julia> using ForwardDiff, BenchmarkTools
[ Info: Precompiling ForwardDiff [f6369f11-7733-5829-9624-2563aa707210]

julia> x = rand(1000);

julia> cfg = ForwardDiff.GradientConfig(nothing, x);

julia> @btime ForwardDiff.seed!($(cfg.duals), $x, $(cfg.seeds));
  55.533 ns (0 allocations: 0 bytes)

@charleskawczynski
Copy link
Contributor

I think we can close this now that #472 has merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants