Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[MXNET-107] Add Fused Vanilla RNN and dropout for CPU #11399

Merged
merged 1 commit into from
Jun 26, 2018

Conversation

lihaofd
Copy link
Contributor

@lihaofd lihaofd commented Jun 26, 2018

Description

In this PR, it creates Fused Vanilla RNN(tanh/relu) operator and dropout of GRU/LSTM/vRNN for CPU.
@pengzhao-intel, @TaoLv

Feature changes

New features

  • Single layer/Multiple layer and unidirectional/bidirectional Vanilla RNN(tanh/relu), including both forward and backward computation.
  • Support dropout of GRU/LSTM/vRNN

Unit-test changes

  • Create new testcase in tests/python/unittests/test_operator.py.
  • update testcase in example/rnn/bucketing/cudnn_rnn_bucketing.py
  • Check consistency with original RNNCell implementation.

Performance

We have tested performance of FusedRNN and NonFused RNNCell on our local Skylake-8180 with 2 Sockets and 56 cores. Use MKL as blas lib in this performance test.
Test input size is from DS2 default parameters(seq_length = 300, batch_size = 20, input_size = 800, hidden_size = 800).

Layer=1 bidirectional = False

API Inference time(fwd, samples/sec) Training time(fwd + bwd, samples/sec)
rnn.RNNCell - NoFusedRNN(Tanh, CPU) 492.61 198.02
this PR - FusedRNN(Tanh, CPU) 952.38 318.98
speedup 1.93x 1.61x
API Inference time(fwd, samples/sec) Training time(fwd + bwd, samples/sec)
rnn.RNNCell - NoFusedRNN(Relu, CPU) 277.78 104.17
this PR - FusedRNN(Relu, CPU) 740.74 177
speedup 2.67x 1.7x

Layer=5 bidirectional = True

API Inference time(fwd, samples/sec) Training time(fwd + bwd, samples/sec)
rnn.RNNCell - NoFusedRNN(Tanh, CPU) 38.91 22.73
rnn.RNNCell (Tanh, cuda) 47.85 26.95
rnn.RNNCell (Tanh, cudnn) 208.33 81.63
this PR - FusedRNN(Tanh, CPU) 104.17 34.01
speedup -this PR/RNNCell (Tanh, CPU) 267.7% 149.7%
speedup -this PR/RNNCell (Tanh, cuda) 217.7% 126.2%
speedup -this PR/RNNCell (Tanh, cudnn) 50% 41.7%
API Inference time(fwd, samples/sec) Training time(fwd + bwd, samples/sec)
rnn.RNNCell - NoFusedRNN(Relu, CPU) 40.73 22.6
rnn.RNNCell (Relu, cuda) 52.91 26.81
rnn.RNNCell (Relu, cudnn) 206.83 82.64
this PR - FusedRNN(Relu, CPU) 134.23 35.97
speedup -this PR/RNNCell (Relu, CPU) 329.5% 159.2%
speedup -this PR/RNNCell (Relu, cuda) 253.7% 134.2%
speedup -this PR/RNNCell (Relu, cudnn) 64.9% 43.5%

Convergency Curve

We have tested Convergency of FusedGRU/LSTM(dropout = 0.5) on our CPU-Skylake-8180 with 2 Sockets and 56 cores and GPU-P100 by using example/rnn/bucketing/cudnn_rnn_bucketing.py
Test input size is layer = 3, batch_size = 32, num-embed = 800, num-hidden = 800, num-epochs 20
gru_dropout
lstm_dropout

@szha: resolves #10870, #10872

@lihaofd lihaofd requested a review from szha as a code owner June 26, 2018 00:41
@szha szha self-assigned this Jun 26, 2018
@TaoLv
Copy link
Member

TaoLv commented Jun 26, 2018

Please remove [WIP] from the title and add the JIRA number to it. https://issues.apache.org/jira/browse/MXNET-107

@lihaofd lihaofd changed the title [WIP] Add Fused Vanilla RNN and dropout [MXNET-107] Add Fused Vanilla RNN and dropout Jun 26, 2018
@lihaofd lihaofd changed the title [MXNET-107] Add Fused Vanilla RNN and dropout [MXNET-107] Add Fused Vanilla RNN and dropout for CPU Jun 26, 2018
@piiswrong piiswrong merged commit 0538ad9 into apache:master Jun 26, 2018
XinYao1994 pushed a commit to XinYao1994/incubator-mxnet that referenced this pull request Aug 29, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

RNN operator should support rnn_tanh and rnn_relu mode on CPU
4 participants