Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tidy up Maxout #1794

Merged
merged 4 commits into from
Dec 13, 2021
Merged

Tidy up Maxout #1794

merged 4 commits into from
Dec 13, 2021

Conversation

mcabbott
Copy link
Member

Maxout is from #698 . This:

  • adds pretty printing
  • changes the explicit signature to Maxout(layer, layer, layer), rather than providing a tuple, to be more like other layers (with deprecation)
  • adds more examples to the docstring, and combines the two
  • changes not to use mapreduce. I see now this was a performance choice at the time, discussed here Add MaxOut layer #647 (comment) , but with Zygote this is much slower.

Before:

julia> using Flux

julia> m3 = Maxout(() -> Dense(5, 7, tanh), 3)
Maxout{Tuple{Dense{typeof(tanh), Matrix{Float32}, Vector{Float32}}, Dense{typeof(tanh), Matrix{Float32}, Vector{Float32}}, Dense{typeof(tanh), Matrix{Float32}, Vector{Float32}}}}((Dense(5, 7, tanh), Dense(5, 7, tanh), Dense(5, 7, tanh)))

julia> x = rand(Float32, 5, 11);

julia> @btime gradient(sum∘m3, $x);
  min 112.792 μs, mean 123.774 μs (930 allocations, 49.09 KiB. GC mean 3.71%)

After:

julia> m3 = Maxout(() -> Dense(5, 7, tanh), 3)
Maxout(
  Dense(5, 7, tanh),                    # 42 parameters
  Dense(5, 7, tanh),                    # 42 parameters
  Dense(5, 7, tanh),                    # 42 parameters
)                   # Total: 6 arrays, 126 parameters, 888 bytes.

julia> x = rand(Float32, 5, 11);

julia> @btime gradient(sum∘m3, $x);
  min 34.541 μs, mean 38.448 μs (493 allocations, 32.48 KiB. GC mean 6.63%)

@mcabbott
Copy link
Member Author

In fact, the improvement is less clear at larger sizes. But I think we can optimise the gradient of max.(...) more easily than pairwise mapreduce?

julia> m = Maxout(() -> Dense(50, 70, tanh), 5);

julia> x = rand(Float32, 50, 110);

julia> @btime gradient(sum∘m, $x);
  min 806.917 μs, mean 1.248 ms (1268 allocations, 1.61 MiB. GC mean 7.54%)   # before
  min 782.125 μs, mean 919.414 μs (751 allocations, 1.24 MiB. GC mean 6.52%)  # after

julia> @btime m3($x); # forwards
  min 303.500 μs, mean 353.118 μs (28 allocations, 422.41 KiB. GC mean 5.61%)  # before
  min 324.584 μs, mean 367.592 μs (22 allocations, 331.89 KiB. GC mean 4.31%)  # after

@ToucheSir
Copy link
Member

The stranger result to me is the forward pass. It's not clear why the before version would be faster, and it appears both versions are unrolling the per-element comparisons.

@mcabbott
Copy link
Member Author

mcabbott commented Dec 1, 2021

I have been surprised before by max, I think it's very careful about NaN etc. Still seems strange here, though.

julia> x, y, z = (rand(100,100) for _ in 1:99);

julia> @btime $y .= max.($x, 0.0, $z);
  min 8.361 μs, mean 8.406 μs (0 allocations)

julia> @btime $y .= clamp.($x, clamp.(0.0, $z, Inf), Inf);  # just ifelse
  min 2.403 μs, mean 2.443 μs (0 allocations)

@ToucheSir
Copy link
Member

Come to think of it though, max.(...) might not be such a bad idea for the GPU path. I don't have the time to run benchmarks at just this moment, but if that is the case we could make a case for eating the slightly slower forward pass on CPU for fewer allocations and a faster backwards one?

@mcabbott
Copy link
Member Author

mcabbott commented Dec 2, 2021

I was going to suggest reverting to mapreduce for now, this PR can just be minor tidying up & the next one can look more at performance.

I wrote some gradients for max.(x,y,z), seems possible to make it roughly twice as quick on CPU. But it's more code than I thought it would be, how many high-performance hand-written kernels do we want in Zygote?
https://gist.github.com/mcabbott/48aa0f4e0ec730d39995ad9917cad8f1

The last commit is unrelated, but makes CPU tests pass on 1.7. Could be cherry-picked if someone wants to make a separate PR. Have not investigated why the result is slightly different.

src/layers/basic.jl Outdated Show resolved Hide resolved
src/layers/basic.jl Outdated Show resolved Hide resolved
mcabbott and others added 2 commits December 13, 2021 12:04
Co-authored-by: Brian Chen <ToucheSir@users.noreply.github.com>
Co-authored-by: Brian Chen <ToucheSir@users.noreply.github.com>
@mcabbott
Copy link
Member Author

bors r+

@bors
Copy link
Contributor

bors bot commented Dec 13, 2021

Build succeeded:

@bors bors bot merged commit fe803a1 into FluxML:master Dec 13, 2021
@mcabbott mcabbott deleted the maxout branch December 13, 2021 19:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants