-
-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Overview of the fastest CPU RNNs implementation #228
Comments
Yes.. I second for this feature! |
I have GRU Cells forward and backprop mostly working and tested. I tried to implement the optimizations mentionned by Silicon Valley AI lab/Baidu Research here and asked for clarification because their GRU variant 4 claims "more speed, same memory usage" seem to actually be "more speed, more memory usage". I will probably implement forward, backward and inference primitives for all RNNs (all layers?) as there are huge gain to be had if we can re-use/destroy the input tensors or at least the intermediate gates during inference when there is no need for backprop. |
Tracking more implementations. There is an ongoing rewrite of MxNet CPU RNNs using Fused kernels:
They can serve as a reference benchmark. I also noticed that there is experimental RNN Cell support in MKL DNN introduced here oneapi-src/oneDNN@f35779d. Not too sure how it relates to oneapi-src/oneDNN#218 |
The GRU Cell, forward, backward and inference are fully implemented with tests in #231. Now I'm implementing GRU (in a fused manner), however some question are unresolved:
|
@mratsim I don't quite understand the weight reusing between cpu and gpu. Do you mean weights trained on gpu cannot be applied to cpu just because cpu and gpu implementations have different equations? If so, how does tensorflow handle this situation? AFAIK, tensorflow has different equations with cudnn but it also has integrated cudnn. |
As nv and baidu's blogs said, cudnn's equations are more friendly for optimization. But I'm wondering if there are any accurary differences between these two sets of equation. |
Thanks for dropping by, regarding weights reuse, Keras plain prevents sharing CuDNN and CPU weights and they reimplemented a CPU version compatible with CuDNN.
Now in the grand scheme of things, I suppose they can actually be re-used and the first couple batches will act like transfer learning/domain adaptation for CNNs. Regarding accuracy, Baidu's and Nvidia tests showed that there is almost no accuracy difference. This paper even showed 3 much more radical variants that only took into account the last hidden state and 2 of them performed just as well as the fully gated GRU. Equations from Wikipedia article.
Regarding time-major speed, it was indeed my feeling. For variable-length inputs, I suppose we have to wait for CuDNN 8. |
* Implement forward pass of GRU Cell - RFC #228 * Rename previous implementation "inference" and use original paper names * Add forward GRU Cell with weights saving * linear_backward doesn't need bias to get gradBias * Add GRU cell backpropagation + tests
New paper LSTM benchmarks of deep learning frameworks: https://arxiv.org/pdf/1806.01818.pdf |
RNNs and particularly LSTM and GRU made a significant contribution to deep learning applications.
They are the default go-to tool for natural language processing, are heavily explored in reinforcement learning, many visual+text combined tasks and time-series prediction (though in competition with WaveNets)
CuDNN implementation is already heavily optimized however CPU implementation should be the fastest possible as well.
General overview
Tensorflow
Readable implementation
"Unreadable" C++ implementations (static graphs)
Benchmarks
Unfortunately only GPU benchs are available:
Optimized implementations
Note on biases and equations
The various implementations do not agree on biases, and the equations chosen.
To allow loading weights on both CPU and GPU, it would be best to use the same equations as CuDNN.
List of relevant issues:
The text was updated successfully, but these errors were encountered: