Use `NNlib.bias_act!` #2327

mcabbott · 2023-09-04T22:58:09Z

Uses FluxML/NNlib.jl#457 to speed up & save memory, up to half the memory for a forward pass. Largest savings in the gradient will be for large batch size, and activation functions like identity, relu, tanh whose input need not be stored.

julia> lenet = Chain(  # from the model zoo
           Conv((5, 5), 1=>6, relu),
           MaxPool((2, 2)),
           Conv((5, 5), 6=>16, relu),
           MaxPool((2, 2)),
           Flux.flatten,
           Dense(256 => 120, relu),
           Dense(120 => 84, relu), 
           Dense(84 => 10),
       );

julia> img = rand32(28, 28, 1, 128);

julia> @btime $lenet($img);
  min 867.875 μs, mean 1.434 ms (160 allocations, 5.60 MiB)  # before
  min 831.500 μs, mean 1.100 ms (149 allocations, 3.31 MiB)  # after

julia> @btime gradient(m -> sum(abs2, m($img)), $lenet);
  min 7.128 ms, mean 10.280 ms (567 allocations, 14.19 MiB)
  min 6.296 ms, mean 6.930 ms (546 allocations, 9.61 MiB)

Closes #2151 which I forgot about.

Edit, now also with Enzyme, for which there is no special code -- it is able to understand the mutation, and benefits slightly. (Why it's slower than Zygote here I don't know, that's EnzymeAD/Enzyme.jl#2069 which is an orthogonal question.)

julia> @btime $lenet($img);
  min 655.583 μs, mean 1.107 ms (160 allocations, 5.60 MiB)  # before
  min 628.458 μs, mean 836.427 μs (149 allocations, 3.31 MiB)  # after

julia> @btime Flux.gradient((m,x) -> sum(abs2, m(x)), $lenet, $img);  # Zygote, as above, different computer
  min 4.979 ms, mean 6.300 ms (558 allocations, 14.18 MiB)
  min 4.759 ms, mean 5.683 ms (541 allocations, 9.61 MiB)

julia> @btime Enzyme.gradient(Reverse, (m,x) -> sum(abs2, m(x)), $lenet, $img);
  min 8.347 ms, mean 9.752 ms (538 allocations, 15.42 MiB)
  min 7.365 ms, mean 8.791 ms (518 allocations, 10.83 MiB)

ToucheSir · 2023-09-05T03:39:51Z

src/layers/conv.jl

  cdims = conv_dims(c, x)
  xT = _match_eltype(c, x)
-  σ.(conv(xT, c.weight, cdims) .+ conv_reshape_bias(c))
+  NNlib.bias_act!(c.σ, conv(xT, c.weight, cdims), conv_reshape_bias(c))


GPUCompiler doesn't like this when c.σ === sigmoid and a bias is set, https://buildkite.com/julialang/flux-dot-jl/builds/4240#018a62b9-4aa7-4a4a-80fe-661494ca9939/351-799. It's not clear to me why Dense would be fine given it uses the same machinery.

Thanks for digging. Error is on

broadcast!(::ComposedFunction{typeof(sigmoid_fast), typeof(+)}, ::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, ::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, ::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer})

where ComposedFunction comes from here:

https://github.com/FluxML/NNlib.jl/blob/1b30040fabadd41efa0d9dde5841b90f9f85cf2d/src/bias_act.jl#L32-L33

Agree it's odd that Dense doesn't hit the same.

I can replicate this issue with just CUDA.jl and NNlib, so we should consider adding some GPU tests for bias_act! on the NNlib side. Interestingly enough normal sigmoid works just fine, so something is strange with sigmoid_fast in particular.

Have a theory now based on more testing. sigmoid_fast also works if one removes the @inline. I think what's happening is that with the @inline, it's being inlined into the body of ComposedFunction too early and preventing ComposedFunction itself from being inlined because its body is now too complex.

Edit: confirmed with Cthulhu. Not sure what the best course of action here would be. Do we rely heavily on the @inline for CPU perf?

Could always override fast_act for GPU arrays. Uglier but preserves CPU performance if there is some gain there.

Could always override fast_act for GPU arrays

Good point. Allowing this is precisely why fast_act takes a second argument.

Unfortunately, it looks like this error still persists :(

Rebased to see how it worked with Enzyme etc, but still didn't get around to fixing this error.

Can save a lot of memory but haven't seen much of a speedup out of it.

is the error solved?

GPU tests currently pass.

Attempting to explicitly trigger this, by testing some gradients with CUDA and sigmoid, I see no errors & no wrong answers.

julia> using Flux, CUDA julia> mlp = Chain(Flux.flatten, Dense(28^2 => 32, sigmoid), Dense(32 => 10)); julia> img = rand32(28, 28, 1, 128); julia> lenet = Chain( # from the model zoo Conv((5, 5), 1=>6, sigmoid), MaxPool((2, 2)), Conv((5, 5), 6=>16, sigmoid), MaxPool((2, 2)), Flux.flatten, Dense(256 => 120, sigmoid), Dense(120 => 84, sigmoid), Dense(84 => 10), ); julia> Flux.gradient((m,x) -> sum(abs2, m(x)), mlp, img)[1].layers[2].bias[1:3] 3-element Vector{Float32}: 41.608467 20.979347 2.015152 julia> Flux.gradient((m,x) -> sum(abs2, m(x)), lenet, img)[1].layers[1].bias 6-element Vector{Float32}: 0.9354934 -1.4983172 -0.6205859 -0.6315984 0.6592647 1.2965859 julia> Flux.gradient((m,x) -> sum(abs2, m(x)), mlp |> cu, img |> cu)[1].layers[2].bias[1:3] 3-element CuArray{Float32, 1, CUDA.DeviceMemory}: 41.60848 20.979351 2.015153 julia> Flux.gradient((m,x) -> sum(abs2, m(x)), lenet |> cu, img |> cu)[1].layers[1].bias 6-element CuArray{Float32, 1, CUDA.DeviceMemory}: 0.93553036 -1.498424 -0.6206611 -0.63131595 0.6591014 1.2970955 julia> @eval Flux begin # core of this: https://github.com/FluxML/Flux.jl/pull/2327 function (a::Dense)(x::AbstractVecOrMat) _size_check(a, x, 1 => size(a.weight, 2)) xT = _match_eltype(a, x) # fixes Float64 input, etc. NNlib.bias_act!(a.σ, a.weight * xT, a.bias) # does σ.(W*x .+ b), with fast paths end function (c::Conv)(x::AbstractArray) _conv_size_check(c, x) cdims = conv_dims(c, x) xT = _match_eltype(c, x) NNlib.bias_act!(c.σ, conv(xT, c.weight, cdims), conv_reshape_bias(c)) end end julia> Flux.gradient((m,x) -> sum(abs2, m(x)), mlp, img)[1].layers[2].bias[1:3] 3-element Vector{Float32}: 41.608467 20.979347 2.015152 julia> Flux.gradient((m,x) -> sum(abs2, m(x)), lenet, img)[1].layers[1].bias 6-element Vector{Float32}: 0.9354934 -1.4983172 -0.6205859 -0.6315984 0.6592647 1.2965859 julia> Flux.gradient((m,x) -> sum(abs2, m(x)), mlp |> cu, img |> cu)[1].layers[2].bias[1:3] 3-element CuArray{Float32, 1, CUDA.DeviceMemory}: 41.60848 20.979351 2.015153 julia> Flux.gradient((m,x) -> sum(abs2, m(x)), lenet |> cu, img |> cu)[1].layers[1].bias 6-element CuArray{Float32, 1, CUDA.DeviceMemory}: 0.93553036 -1.498424 -0.6206611 -0.63131595 0.6591014 1.2970955

src/layers/basic.jl

src/layers/normalise.jl

codecov · 2024-11-05T08:21:59Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 60.37%. Comparing base (c86580b) to head (31fd7cf).
Report is 1 commits behind head on master.

Additional details and impacted files

@@             Coverage Diff             @@
##           master    #2327       +/-   ##
===========================================
+ Coverage   33.54%   60.37%   +26.82%     
===========================================
  Files          31       31               
  Lines        1911     1938       +27     
===========================================
+ Hits          641     1170      +529     
+ Misses       1270      768      -502

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

rm comments

Co-authored-by: Carlo Lucibello <carlo.lucibello@gmail.com>

mcabbott · 2024-11-08T00:31:19Z

Let's do this. If it's a disaster for some reason on 0.15 we can easily revert.

ToucheSir added performance run downstream test labels Sep 4, 2023

ToucheSir reviewed Sep 5, 2023

View reviewed changes

CarloLucibello reviewed Sep 5, 2023

View reviewed changes

src/layers/basic.jl Outdated Show resolved Hide resolved

mcabbott commented Sep 5, 2023

View reviewed changes

src/layers/normalise.jl Show resolved Hide resolved

mcabbott force-pushed the bias_act branch from 48d5e45 to 1a3e33e Compare March 19, 2024 19:14

mcabbott force-pushed the bias_act branch from 1a3e33e to 4ab8343 Compare March 30, 2024 19:25

mcabbott force-pushed the bias_act branch from 4ab8343 to bc7b64d Compare November 4, 2024 22:45

mcabbott added this to the v0.15 milestone Nov 6, 2024

mcabbott added 3 commits November 6, 2024 17:34

use NNlib.bias_act

1eaee21

rm comments

mend

8be49a9

add to news

085260a

mcabbott force-pushed the bias_act branch from f8626a1 to 085260a Compare November 6, 2024 22:35

Update src/layers/basic.jl

31fd7cf

Co-authored-by: Carlo Lucibello <carlo.lucibello@gmail.com>

CarloLucibello approved these changes Nov 8, 2024

View reviewed changes

mcabbott merged commit af1e5fc into FluxML:master Nov 8, 2024
19 of 21 checks passed

mcabbott deleted the bias_act branch November 8, 2024 03:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use `NNlib.bias_act!` #2327

Use `NNlib.bias_act!` #2327

mcabbott commented Sep 4, 2023 •

edited

Loading

ToucheSir Sep 5, 2023

mcabbott Sep 5, 2023

ToucheSir Sep 6, 2023

ToucheSir Sep 6, 2023 •

edited

Loading

darsnack Sep 6, 2023

mcabbott Sep 6, 2023

ToucheSir Apr 2, 2024

mcabbott Apr 2, 2024

CarloLucibello Nov 5, 2024

mcabbott Nov 6, 2024

codecov bot commented Nov 5, 2024 •

edited

Loading

mcabbott commented Nov 8, 2024

Use NNlib.bias_act! #2327

Use NNlib.bias_act! #2327

Conversation

mcabbott commented Sep 4, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ToucheSir Sep 6, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Nov 5, 2024 • edited Loading

Codecov Report

mcabbott commented Nov 8, 2024

Use `NNlib.bias_act!` #2327

Use `NNlib.bias_act!` #2327

mcabbott commented Sep 4, 2023 •

edited

Loading

ToucheSir Sep 6, 2023 •

edited

Loading

codecov bot commented Nov 5, 2024 •

edited

Loading