Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use muladd for LSTM cell matmuls #2023

Merged
merged 1 commit into from
Sep 4, 2022
Merged

Use muladd for LSTM cell matmuls #2023

merged 1 commit into from
Sep 4, 2022

Conversation

ToucheSir
Copy link
Member

@ToucheSir ToucheSir commented Jul 20, 2022

This appears to help significantly on CPU, ref. https://discourse.julialang.org/t/slow-lstm-on-gpu-in-flux/84228. I have not yet benchmarked GPU performance, but it is likely a wash. It would be great to have a single function that does x1 * y1 + x2 * y2, does one exist in LinearAlgebra?

@mcabbott
Copy link
Member

mcabbott commented Jul 22, 2022

Great if muladd helps something, I think I tried for some other layers and was disappointed. I guess GPU timing is the final test?

It is a very simple out .= z; mul!(out, A, B) which you could easily extend to allow (A*B + C*D) .+ z, but as far as I know nobody has packaged that up.

@mcabbott
Copy link
Member

mcabbott commented Sep 4, 2022

Timing as in this gist: https://gist.github.com/JLDC/66c8eed6c36cb9cda85ff6404284d841 (from Discourse thread above) I get CPU:

julia> main()  # 2nd run:
181.671379 seconds (273.90 k allocations: 49.164 GiB, 90.20% gc time)  # before
141.337037 seconds (257.10 k allocations: 34.292 GiB, 87.77% gc time)  # after

julia> main()  # on a different computer! Julia 1.8, first run
 46.989183 seconds (2.30 M allocations: 49.275 GiB, 36.97% gc time, 11.01% compilation time)  # before
 35.523793 seconds (2.27 M allocations: 34.402 GiB, 14.94% gc time, 14.46% compilation time)  # after

GPU, inserting |> gpu and CUDA.@time. Quite noisy, best of a few each, same device:

julia> main()
# before
  1.994243 seconds (2.26 M CPU allocations: 213.138 MiB, 7.88% gc time) (61.50 k GPU allocations: 49.165 GiB, 36.31% memmgmt time)
# after
  1.800311 seconds (2.20 M CPU allocations: 213.963 MiB, 4.43% gc time) (53.70 k GPU allocations: 34.287 GiB, 17.56% memmgmt time)

julia> CUDA.device()
CuDevice(3): Tesla P100-PCIE-16GB

Copy link
Member

@mcabbott mcabbott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we should do this.

And probably write a mulmuladd which calls mul! twice.

@ToucheSir ToucheSir merged commit dedc7ce into master Sep 4, 2022
@ToucheSir ToucheSir deleted the bc/lstm-muladd branch September 4, 2022 03:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants