-
-
Notifications
You must be signed in to change notification settings - Fork 611
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RNN design for efficient CUDNN usage #1365
Comments
Great write-up @jeremiedb! Would love to see this added to Flux.jl |
I think, for performance and ease-of-use reasons, it is desirable to have a container type for handling batches of variable-length sequences similar to pytorch PackedSequence or tensorflow RaggedTensor. Otherwise, the user needs to store a vector of lengths and write logic to keep the model from back-propagating loss through the padding. The RNN kernel itself also can't skip unnecessary computation if the data doesn't have the padding information. |
CUDNN RNN operator does support the handling the varying seq lengths by proviving the a vector of Int of length = batch_size that indicates the valid length of each sequence. In any case, I would expect that at the end, the user will always need to provide the seq_length information in some way. @pshashk Other than from bringing more high level API usability, are you seeing benefits from a specific RNN container vs adding a seq_length element to the usual data iterator? |
For prior art, see https://github.com/mbauman/RaggedArrays.jl and https://github.com/SciML/RecursiveArrayTools.jl. |
As much as I like the explicitness of the current approach, I guess that this would also alleviate some ONNX export/import headaches: Would a natural consequence be that one also wants to do broadcasting over more than the batch dimension in Dense to allow them to be stacked after FusedRnns as well? I think I saw a PR about that long ago... Can backwards compatibility be handled by dispatching on the number of array dimensions? Perhaps not so nice due to type inference and surprising interop with other layers, but I guess a sentinel value like |
One advantage is the option to implement padding-aware methods for other layers. Convolution, normalization, and pooling applied after RNN may introduce the dependence between weights and the amount of padding, which is rarely desirable. I'm not sure if there is enough demand for that to live in Flux and not some complimentary package. |
Making a clean transition to a 3D array is fine but have we checked for correctness of the gradients? Further, it was done this way to match cudnn style, can we verify the performance considerations ? We should be still be able to use the single time step API to use cudnn if it works better, but will have to think about whether this is sound. |
@DhairyaLGandhi Maybe there's a confusion, but I'm fairly confident about the CUDNN approach being 3D based, as discussed here and also evidenced from the CUDA.jl rnn api: https://github.com/JuliaGPU/CUDA.jl/blob/fc690e20a90a1211f91d561c3bfc010957381c12/lib/cudnn/rnn.jl#L120 or directly at CUDNN API doc My original concern came from what appeared to be unreliable support of RNNs on GPUs and thought that starting off a known reference such as the 3D CUDNN could help laying out a more robust/performant foundation. It seems though that Knet did managed to get performing RNN based on the iterative sequence approach. Also, here JuliaGPU/CUDA.jl#343, Knet's author discussed having a more extensive CUDNN functions support. I guess best scenario would be to have the option to get vanilla RNNs through 3D and CUDNN dispatch, while keeping the flexibility of the 2D approach for adhoc stuff. But first priority I think should be to have functional RNNs on GPUs, That's what I tried to explore in above PR, by keeping the original 2D design and dropping cudnn. |
1367: RNN update to drop CUDNN, fix LSTM bug and output type stability r=CarloLucibello a=jeremiedb PR related to #1114 #1360 #1365 Some experiment for RNN handling. Hidden state of each cell structure was dropped as they weren't needed (AFAIK, only needed for size inference for CUDNN, but bias size could be used as a substitute to cells' `h` there as well). Looked to drop dependence on CUDNN entirely, so it's a pure Flux/CUDA.jl. File `src/cuda/curnnjl` no longer used. No modifications were made to the cell computations. Initial test seems to show decent performance, but yet to benchmark. Pending issue: despite having dropped completely the CUDNN dependency, there's still an instability issue that seems present when running on GPU. This is illustrated in the test at lines 1-50 of file `test\rnn-test-jdb.jl`. If that test runs on CPU, it goes well thorugh the 100 iterations. However, the same on GPU will thow NAs after couple dozens of iterations. My only hypothesis so far: when performing the iteration over the sequence through `m.(x)` or `map(rnn, x)`, is the order of the execution safe? Ie: is it possible that there isn't a `sync()` on the CUDA side between those seq steps, which may mess up the state? ### PR Checklist - [x] Tests are added - [ ] Entry in NEWS.md - [ ] Documentation, if applicable - [ ] Final review from `@dhairyagandhi96` (for API changes). Co-authored-by: jeremiedb <jeremie_db@hotmail.com> Co-authored-by: jeremie.db <jeremie.db@evovest.com>
Given the fact that we have severe correctness issues with explicit differentiation (#2185) due to the fact that |
With #2500 RNN, GRU and LSTM take |
With the current Flux RNN design, where each data batch is a Vector of
length = seq_length
and whose elements are of size[features, batch_size]
, I'm afraid it misses some performance opportunities.In particular, CUDNN RNN function is designed to directly handle an array of size
[features, batch_size, seq_length]
and apply the entire RNN chain (vanilla RNN, GRU, LSTM) on the full sequence. Plus, the CUDNN operator supports the stacking of multiple layers, unidirectionnel and bidirectional, as well as a vector specifying sequence length.The current Flux RNN design goes over the sequence, one timestep as a time, through the
m.(x)
(ormap(m, x)
) guidance indicated in the docs. This seems to effectively translates in a single timestep at ta time call to the underlying CUDNN: https://github.com/JuliaGPU/CUDA.jl/blob/fc690e20a90a1211f91d561c3bfc010957381c12/lib/cudnn/rnn.jl#L111, where seq_Length of 1 is hard coded.Also, a single layer is assumed:
Flux.jl/src/cuda/curnn.jl
Line 13 in 98e7222
layers
optional argument is left to its defaut value of 1: (https://github.com/JuliaGPU/CUDA.jl/blob/fc690e20a90a1211f91d561c3bfc010957381c12/lib/cudnn/rnn.jl#L42).From the approach found in other DL frameworks, for example in MXNet, although a step by step approach is supported, a high performant fused RNN is also available: https://mxnet.apache.org/versions/1.6/api/python/docs/api/ndarray/ndarray.html#mxnet.ndarray.RNN.
Such operator works on data shaped `features X batch_size X seq_length.
It looks like the CUDA.jl /https://github.com/JuliaGPU/CUDA.jl/blob/master/lib/cudnn/rnn.jl, is almost there to support the arbitrary sequence length, as well as to allow the use of the bidirectionnal, seq_length and clipping options.
To take adavantage of such backend, it seems like moving away from the Vector of sequences to the 3D array representation would be needed. I think it would make sense as it's fairly intuitive to consider each batch as a single array. With such 3D array, using the traditionnal non-fused approach could be done for example through mapslices which wouldn't be a big departure from current thinking:
And a high performance RNN could be accessible with something along those lines:
Is such direction for RNN sound? I may well have overlooked considerations particular to Flux, but I think it would be greatly beneficial to bring both:
@AzamatB @DhairyaLGandhi
The text was updated successfully, but these errors were encountered: