Overview of the fastest CPU RNNs implementation #228

mratsim · 2018-05-10T10:17:51Z

RNNs and particularly LSTM and GRU made a significant contribution to deep learning applications.

They are the default go-to tool for natural language processing, are heavily explored in reinforcement learning, many visual+text combined tasks and time-series prediction (though in competition with WaveNets)

CuDNN implementation is already heavily optimized however CPU implementation should be the fastest possible as well.

General overview

GRU Paper
CS231n 2017 - lecture 10
Colah tutorial
Towards Data Science

Tensorflow vs PyTorch/CuDNN
Tensorflow

r = sigmoid(W_{ir} x + b_{ir} + W_{hr} h + b_{hr}) 
z = sigmoid(W_{iz} x + b_{iz} + W_{hz} h + b_{hz})
n = tanh(W_{in} x + b_{in} +  W_{hn} (r * h) + b_{hn}))
h' = (1 - z) * n + z * h

PyTorch equations

r = sigmoid(W_{ir} x + b_{ir} + W_{hr} h + b_{hr}) 
z = sigmoid(W_{iz} x + b_{iz} + W_{hz} h + b_{hz})
n = tanh(W_{in} x + b_{in} + r * (W_{hn} h + b_{hn}))
h' = (1 - z) * n + z * h

Note that in the paper equations are:

r = sigmoid(W_{ir} x + b_{ir} + W_{hr} h + b_{hr}) 
z = sigmoid(W_{iz} x + b_{iz} + W_{hz} h + b_{hz})
n = tanh(W_{in} x + b_{in} +  W_{hn} (r * h) + b_{hn}))
h' = (1 - z) * h + z * n

And CuDNN

it = σ(Wi * xt + Ri * ht-1 + bWi + bRu)
rt = σ(Wr * xt + Rr * ht-1 + bWr + bRr)
h't = tanh(Wh * xt + rt ◦ (Rh * ht-1 + bRh) + bWh)
ht = (1 - it)◦h't + it◦ht-1

Readable implementation

WildML - GRU
Pure Numpy GRU implementation used by Intel Nervana Neon for testing
Neon GRU test suite
Neon implementation
PyTorch implementation
CuTorch fused RNN implementation
Torch implementation of LSTM by jcjohnson
Official Torch RNN
Theano RNN
Lasagne/Theano official implementation
Tensorflow tutorial for GRU
MXNet high-level API

"Unreadable" C++ implementations (static graphs)

Caffe RNN
PaddlePaddle LSTM and GRU
Tensorflow LSTM and GRU
Mxnet

Benchmarks

Unfortunately only GPU benchs are available:

Optimized implementations

GRU4Rec in Theano, apparently this was 170x faster than Tensorflow code
Nvidia on how to optimize RNNs and paper.
Baidu Research:
- in-depth part 1
  - Combine across timesteps the multiplications by weights
  - Gemm NN and Gemm TN do not have the same speed (including for CUBLAS)
- part 2 on Graph optimization
  - Concatenation across timesteps and gates
  - Moving the Reset Gate
  - Saving activation
- Persistent RNNs for small batches with weights in GPU registers
  - Paper
  - blog
  - keynote.
Yandex (Russian Search Engine) Faster-RNNLM
- Focus on the One Billion Word Benchmark and can process about 250k words per second with 8 threads at 3.3 Ghz
Paper with 3 variants of GRU with less parameters, Rahul Dey and Fathi M. Salem
- See also Wikipedia

Note on biases and equations

The various implementations do not agree on biases, and the equations chosen.

WildML has 1 bias per equation, Keras and Neon too.
Chainer, Torch and CuDNN have 2 biases.

To allow loading weights on both CPU and GPU, it would be best to use the same equations as CuDNN.

List of relevant issues:

PyTorch forum: Redundant biases for LSTM
Keras: weights on GPU cannot be reused on CPU and solutions (i.e. redoing a CPU layer):
- LSTM
- GRU

The text was updated successfully, but these errors were encountered:

sclee15 · 2018-05-11T12:03:10Z

Yes.. I second for this feature!

mratsim · 2018-05-12T15:43:29Z

I have GRU Cells forward and backprop mostly working and tested.

I tried to implement the optimizations mentionned by Silicon Valley AI lab/Baidu Research here and asked for clarification because their GRU variant 4 claims "more speed, same memory usage" seem to actually be "more speed, more memory usage".

svail/diff_graphs#2

I will probably implement forward, backward and inference primitives for all RNNs (all layers?) as there are huge gain to be had if we can re-use/destroy the input tensors or at least the intermediate gates during inference when there is no need for backprop.

mratsim · 2018-05-13T10:02:27Z

Tracking more implementations.

There is an ongoing rewrite of MxNet CPU RNNs using Fused kernels:

LSTM - [WIP][MXNET-107] Fused LSTM implementation for CPU apache/mxnet#10104
GRU - [MXNET-107]Fused GRU implementation for CPU apache/mxnet#10311
LSTM inference-only - Cpu lstm inference apache/mxnet#9977

They can serve as a reference benchmark.

I also noticed that there is experimental RNN Cell support in MKL DNN introduced here oneapi-src/oneDNN@f35779d. Not too sure how it relates to oneapi-src/oneDNN#218

mratsim · 2018-05-13T22:10:11Z

The GRU Cell, forward, backward and inference are fully implemented with tests in #231.

Now I'm implementing GRU (in a fused manner), however some question are unresolved:

What default between [Time/sequence, batch, features] and [batch, time/sequence, features].
- PyTorch is time major and can switch with batch_first = true
- Keras and Tensorflow are batch major and can switch with time_major = true
- CuDNN was time-major, not sure if it still is, the documentation for CuDNN 7.1 is unclear. For example cudnnFindRNNForwardInferenceAlgorithmEx mentions:
  
  xDesc
  Input. An array of fully packed tensor descriptors describing the input to each recurrent iteration (one descriptor per iteration). The first dimension (batch size) of the tensors may decrease from element n to element n+1 but may not increase. Each tensor descriptor must have the same second dimension (vector length).
  
  but the forum questions here show that the current situation is confusing:
- In many cases it makes sense to be time-major, the batch output and hidden state are computed over time. Also slicing a time-major tensor over time would still give us a contiguous tensor and the best perf. However for machine translation, you want to process a batch of sentences of varying length, hence batch-major makes more sense.
How to deal with variable-length sequences (for example sentences for machine translation). PyTorch pack_padded_sequence and pad_packed_sequence are generating a lot of questions and are probably not the best way to go with this.

pengzhao-intel · 2018-05-14T14:19:49Z

@mratsim, @TaoLv can answer the parts of your questions based on our CPU implementation principle.

TaoLv · 2018-05-14T15:12:34Z

@mratsim I don't quite understand the weight reusing between cpu and gpu. Do you mean weights trained on gpu cannot be applied to cpu just because cpu and gpu implementations have different equations? If so, how does tensorflow handle this situation? AFAIK, tensorflow has different equations with cudnn but it also has integrated cudnn.
For the input data layout, I guess time major will show better performance on both cpu and gpu. Actually, mxnet will perform a reshape for batch major input. https://github.com/apache/incubator-mxnet/blob/master/python/mxnet/rnn/rnn_cell.py#L677 Although, I think this kind of reshape or layout change can be hidden in cpp code or dnn library for better performance.
For variable length input, I have no idea about how can framework perform high efficiency parallel computation to those packed input if they are not well aligned.

TaoLv · 2018-05-14T15:23:20Z

As nv and baidu's blogs said, cudnn's equations are more friendly for optimization. But I'm wondering if there are any accurary differences between these two sets of equation.

mratsim · 2018-05-14T16:54:23Z

@TaoLv @pengzhao-intel

Thanks for dropping by, regarding weights reuse, Keras plain prevents sharing CuDNN and CPU weights and they reimplemented a CPU version compatible with CuDNN.

Keras: weights on GPU cannot be reused on CPU and solutions (i.e. redoing a CPU layer):

LSTM

GRU

Now in the grand scheme of things, I suppose they can actually be re-used and the first couple batches will act like transfer learning/domain adaptation for CNNs.

Regarding accuracy, Baidu's and Nvidia tests showed that there is almost no accuracy difference. This paper even showed 3 much more radical variants that only took into account the last hidden state and 2 of them performed just as well as the fully gated GRU. Equations from Wikipedia article.

variant 1:
variant 2:
variant 3 (worse accuracy per batch but can do more timesteps in the same CPU time)

Regarding time-major speed, it was indeed my feeling.

For variable-length inputs, I suppose we have to wait for CuDNN 8.

mratsim · 2018-05-15T13:42:11Z

A quick survey last Sunday among Kaggle data scientists (including masters and grandmasters) show that Batch-major is favored 4-0 (there is one vote by me in both sections to ease voting):

* Implement forward pass of GRU Cell - RFC #228 * Rename previous implementation "inference" and use original paper names * Add forward GRU Cell with weights saving * linear_backward doesn't need bias to get gradBias * Add GRU cell backpropagation + tests

mratsim · 2018-08-31T18:29:02Z

New paper LSTM benchmarks of deep learning frameworks: https://arxiv.org/pdf/1806.01818.pdf

mratsim added optimization key feature state-of-the-art research labels May 10, 2018

mratsim added a commit that referenced this issue May 10, 2018

Implement forward pass of GRU Cell - RFC #228

c2b9748

This was referenced May 10, 2018

[WIP] RNN - Implement fast Gated Recurrent Unit (GRU) #231

Merged

rnn training example? oneapi-src/oneDNN#218

Closed

mratsim added NLP Vision labels Aug 24, 2018

This was referenced Aug 31, 2018

Batch GEMM #101

Open

bidirectional RNNs #271

Open

[WIP] RNN fused GRU primitive #272

Merged

mratsim closed this as completed Sep 23, 2018

ngimel mentioned this issue Jan 9, 2019

Correct the docstring of GRU for update gate pytorch/pytorch#15875

Closed

mratsim mentioned this issue May 22, 2020

Image Classification example? #458

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overview of the fastest CPU RNNs implementation #228

Overview of the fastest CPU RNNs implementation #228

mratsim commented May 10, 2018 •

edited

Loading

sclee15 commented May 11, 2018

mratsim commented May 12, 2018 •

edited

Loading

mratsim commented May 13, 2018

mratsim commented May 13, 2018

pengzhao-intel commented May 14, 2018

TaoLv commented May 14, 2018

TaoLv commented May 14, 2018

mratsim commented May 14, 2018 •

edited

Loading

mratsim commented May 15, 2018

mratsim commented Aug 31, 2018

Overview of the fastest CPU RNNs implementation #228

Overview of the fastest CPU RNNs implementation #228

Comments

mratsim commented May 10, 2018 • edited Loading

General overview

Readable implementation

"Unreadable" C++ implementations (static graphs)

Benchmarks

Optimized implementations

Note on biases and equations

sclee15 commented May 11, 2018

mratsim commented May 12, 2018 • edited Loading

mratsim commented May 13, 2018

mratsim commented May 13, 2018

pengzhao-intel commented May 14, 2018

TaoLv commented May 14, 2018

TaoLv commented May 14, 2018

mratsim commented May 14, 2018 • edited Loading

mratsim commented May 15, 2018

mratsim commented Aug 31, 2018

mratsim commented May 10, 2018 •

edited

Loading

mratsim commented May 12, 2018 •

edited

Loading

mratsim commented May 14, 2018 •

edited

Loading