-
Notifications
You must be signed in to change notification settings - Fork 218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance improvements by calling cuDNN API #321
Conversation
cc @denizyuret |
|
@maleadt @denizyuret
|
Great!
Good catch, I agree.
What is your reasoning here? Making accurate algorithms the default may keep us behind in benchmarks and discourage people for adapting Julia as their deep learning language of choice. Accuracy is almost never important in deep learning (I have worked with 8-bit floats with little degradation in performance) and low accuracy sometimes helps as a regularizer and improves generalization.
As mentioned in #318 my experience with Knet is:
The code is in https://github.com/denizyuret/Knet.jl/blob/master/src/conv.jl |
@denizyuret
|
I feel like Julia and its packages always favor accuracy over performance. So for the sake of consistency, we could do the same here. I completely agree with your reasoning that accuracy is not that important for Deep Learning, so I would be happy to use the most performant code by default, but that is not a decision I want to make by myself. |
@ViralBShah what do you think? Can we break with tradition and make the most performant variants of CUDNN algorithms the default for the sake of deep learning packages? We can add keyword options to choose the more precise algorithms and document them, but I think it is important (esp. for new users) that the defaults work fast. |
I just added a way of caching the algorithm returned from |
That's great. About the LRU: my purpose in limiting the number was not to conserve memory: the entries do not take that much space. It was more about not wasting too much time with Find calls if the array sizes keep changing, which LRU will not limit. Although in practice this probably never happens so it doesn't matter that much as long as we have a cache size that can handle training common deep models. And I can see the logic of LRU if the user switches between different models during the same session. |
lib/cudnn/nnlib.jl
Outdated
@@ -67,6 +94,7 @@ fix1d(cdims::DenseConvDims{1,K,C_in,C_out,S,P,D,F}) where {K,C_in,C_out,S,P,D,F} | |||
fix1d(pdims::PoolDims{1,K,S,P,D}) where {K,S,P,D,F} = | |||
PoolDims{2,(K...,1),(S...,1),(P...,0,0),(D...,1)}((pdims.I..., 1), pdims.C_in) | |||
|
|||
conv_forward = CircularDict{Tuple, Int32}(100) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why only cache a limited amount? If you key with not much uniqueness, only (T, size, DenseConvDims)
, we should have a high enough hit rate? The entries are small, too, and don't keep anything alive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cf. my comment here: #321 (comment)
I was not really aiming to conserve memory, since (like you say) the entries do not take that much space. |
So I just did the measurements I promised: (all measurements are in milliseconds)
The reason that Resnet-50 is slower, is that there are lots of memory issues. I believe that if we can come up with a better heuristic for the |
These look very good. Can you give a bit more detail about Resnet-50? Is it the workspacesize used during algorithm discovery, or during training/inference you think is the problem? Is it running close to the GPU memory limit? About workspacesize heuristics, I use two: (1) Limit the maximum workspacesize during discovery to a reasonable percentage of available memory and/or a reasonable multiple of the input array size: https://github.com/denizyuret/Knet.jl/blob/12430600ea0782cb16cd1aa5887796a5f0d4359f/src/conv.jl#L538 (2) When you are picking an algorithm looking at the Find results, prefer ones with lower memory up to a 10% speed penalty: https://github.com/denizyuret/Knet.jl/blob/12430600ea0782cb16cd1aa5887796a5f0d4359f/src/conv.jl#L617 |
@denizyuret , I was already using your first heuristic but overlooked the second. I have updated my previous comment with new measurements from the experimentation branch. |
@gartangh, please be careful when adding bias, activation functions etc: I have observed inferior performance relative to handwritten kernels in previous versions of cudnn, so I would not make these default without benchmarking. |
@denizyuret , as per your request: Where before is the master branch and after the conv_bias_act branch. |
9558736
to
d7cf913
Compare
That's with all the recent broadcast optimizations, I take it? |
6dcdf63
to
137ff45
Compare
Interestingly the cudnn documentation suggests all Julia broadcastable arrays should work: "Each dimension of the bias tensor A must match the corresponding dimension of the destination tensor C or must be equal to 1. " -- we should file a bug report :) |
@denizyuret , here we go: https://developer.nvidia.com/nvidia_bug/3084210. |
No, there is no such rule or "tradition". Packages should do whatever's best for them, and authors should make independent decisions about what is best. |
This is indeed very promising,thank you! I'm still interested in diving deeper in resnet, is there a profile of the call graph with this branch? Also, it seems that all activations are now constant time, can we verify this against CuDNN implementations elsewhere |
Also, are there updates to the numbers posted in #321 (comment) |
@JeffBezanson , I am selecting the fast algorithms over the accurate ones by default now.
@DhairyaLGandhi , here you go: (I made some changes to my benchmark setup, so the numbers are not exactly the same as in #321 (comment)). All numbers are still in milliseconds, and this is tested on an NVIDIA V100 16GB.
@DhairyaLGandhi , the problem with ResNet-50 is gone after the changes to my benchmark setup.
|
I'd like to run some benchmarks but could not find the right combination of CUDA, GPUArrays, NNlib etc. that works with your branch (checking out the masters of all the others did not work), @gartangh can you advise? |
CUDA v1.2.0 (#conv_bias_act from https://github.com/gartangh/CUDA.jl) |
bump |
a1e83ee
to
c7307d7
Compare
c7307d7
to
2ad6e4c
Compare
bors try |
Needs the different branch from NNlib,otherwise we wouldn't get the benefit I would imagine |
There's more than just the |
tryBuild succeeded: |
5ba060e
to
f3b62e3
Compare
bors r+ |
Build succeeded: |
Depends on:
FluxML/NNlib.jl#228