[MXNET-107]Fused GRU implementation for CPU #10311

lihaofd · 2018-03-29T04:21:09Z

Description

In this PR, it aligns with #10104's registration way and creates a fused GRU operator for CPU.
@pengzhao-intel, @TaoLv , @sherry-zhang

Feature changes

New features

Single layer/Multiple layer and unidirectional/bidirectional GRU, including both forward and backward computation.
Support kAddTo/kNullOp check for both GRU and LSTM

Unit-test changes

Create new testcase in tests/python/unittests/test_operator.py.
update testcase in example/rnn/bucketing/cudnn_rnn_bucketing.py
Check consistency with original GRUCell implementation.

Performance

We have tested performance of sym.RNN and rnn.GRUCell on our local Skylake-6148 with 2 Sockets and 40 cores. Use MKL as blas lib in this performance test.
Test input size is from DS2 default parameters(seq_length = 300, batch_size = 20, input_size = 800, hidden_size = 800).

Layer=1 bidirectional = False

API	Inference time(fwd, sec)	Training time(fwd + bwd, sec)
rnn.GRUCell	0.12	0.277
this PR	0.048	0.106
speedup	2.5x	2.6x

Layer=5 bidirectional = True

API	Inference time(fwd, sec)	Training time(fwd + bwd, sec)
rnn.GRUCell	1.275	2.94
rnn.GRUCell (cuda)	1.047	1.653
rnn.GRUCell (cudnn)	0.161	0.442
this PR	0.551	1.72
speedup -this PR/GRUCell	231.4%	171%
speedup -this PR/GRUCell (cuda)	190%	96%
speedup -this PR/GRUCell (cudnn)	29.2%	25.7%

Convergency Curve

We have tested Convergency of FusedGRU on our CPU-Skylake-8180 with 2 Sockets and 56 cores and GPU-P100 by using example/rnn/bucketing/cudnn_rnn_bucketing.py
Test input size is layer = 3, batch_size = 32, num-embed = 800, num-hidden = 800, num-epochs 20

Checklist

Passed code style checking (make lint).
All changes have test coverage.
Code is well-documented.

piiswrong · 2018-04-23T19:30:21Z

@sherry-zhang @yajiedesign @TaoLv
ping
can we get some review on this?

yajiedesign · 2018-04-24T00:30:04Z

look like good

TaoLv · 2018-04-24T02:17:43Z

@piiswrong @yajiedesign I think we should start review on #10104 firstly, since most design and implementation of this PR is following #10104 and #10104 has enabled all the RNN related unit tests.
I posted a design doc here to describe what we did in these two PRs and what we plan to do in the near future. Feel free to review this doc also and give comments.
https://docs.google.com/document/d/1XC_PmbSc7q6px22LIW3vwhbA_wmX8wRGLRnet3pMJrs/edit?usp=sharing

update my forked branch

… has nothing to do with this PR and will recover it once the issue is passed

… with RNN

TaoLv · 2018-05-22T06:39:19Z

tests/python/unittest/test_operator.py

@@ -93,6 +93,41 @@ def test_lstm_bidirectional():
    check_rnn_consistency(stack, fused, T, N, I, H)
    check_rnn_consistency(fused, stack, T, N, I, H)

+@with_seed()
+def test_gru_sym():


@szha @piiswrong May I know how to check the gradient weight and gradient bias in test case? Seems they are not verified in the check_rnn_consistency function.

you can get gradient with block.weight.grad()

sync with master branch

szha · 2018-05-31T21:54:18Z

example/rnn/bucketing/cudnn_gru_bucketing.py

@@ -0,0 +1,235 @@
+# Licensed to the Apache Software Foundation (ASF) under one


don't copy a whole file. rename cudnn_lstm_bucketing.py to cudnn_rnn_bucketing.py and then add a switch for the mode instead.

fixed, thanks!

piiswrong · 2018-06-01T01:07:52Z

I missed one thing in the previous PRs: looks like backward doesn't support req == kAddTo? Also it looks like req[kParams] isn't even checked

update to latest code

piiswrong · 2018-06-02T22:01:13Z

Please also check for kNullOp and skip when necessary.

lihaofd · 2018-06-04T03:05:11Z

@piiswrong, kNullOp checking is added. Please help to review it. thx

piiswrong · 2018-06-04T21:49:42Z

You need to check for kNullOp for kData and kState too. And when req is kNullOp, nothing should be written to the corresponding output array.

It's not enough to just skip filling 0

piiswrong · 2018-06-04T21:52:04Z

src/operator/rnn-inl.h

@@ -474,6 +495,9 @@ class RNNOp : public Operator{
    CHECK(dw.CheckContiguous());
    CHECK(dhx.CheckContiguous());
    CHECK(dy.CheckContiguous());
+    if (req[rnn_enum::kParams] != kAddTo && req[rnn_enum::kParams] != kNullOp) {


Isn't gradient still going to be overwritten by backward kernel later?

piiswrong · 2018-06-04T21:54:13Z

src/operator/rnn_impl.h

+  const int omp_threads = mxnet::engine::OpenMP::Get()->GetRecommendedOMPThreadCount();
+  #pragma omp parallel for num_threads(omp_threads)
+  for (int i = 0; i < D * H * 3 * H; ++i) {
+    dwh[i] = 0;


This is still overwriting gradient even when req[kParams] = kAddTo ?

ThomasDelteil · 2018-06-12T06:34:20Z

This seems to have generated flakiness on this test: #11202 and here #11219
This is happening on a large portion of the windows builds

lihaofd · 2018-06-12T06:43:26Z

@ThomasDelteil , we will look into the issue.
Thanks!

* Add GRU Support and Test Case * skip the gpu test case that has nothing to do with RNN GRU * fix robust bug for gru backward * fix bug for unifying weight parameter * add GRU multiple layer and bidirection support with test case * fix test case bug * fix test case bug * fix bug for memory issue * fix bug for bidirection * rebase code and fix bug for memory corruption issue * fix gpu compile issue * fix bug and enable some test cases * fix robust bug * trigger the build to check if quantize-gpu case is covered * trigger the build to check if MKLDNN+GPU case is covered * disable failed gpu test case of MKLDNN_UTIL_FUNC-MemFormat because it has nothing to do with this PR and will recover it once the issue is passed * skip failed test_reduce test case temporarily as it has nothing to do with RNN * enable several test cases * retrigger the build * rebase code from lstm * rebase code for resolve conflict * add gru code after resolve conflict * fix bug for resolve conflict * add Fused GRU code with test case * retrigger the build * add GetRecommendedOMPThreadCount for omp * fix conflict issue * add gru relate code * fix bug for code * update code for gru * retrigger the build * fix code about gru condition * enhance test case to test gradient weights and bias * fix bug for test case * fix bug for test case * fix bug about dropout condition and test case * fix bug for test case * fix bug for test case * retrigger the build * rebase code * add gru code * fix issues about namespace, removing define and memcpy * retrigger the build * fix issues and add cudnn_gru_bucketing.py test case * retrigger the build * update cudnn_rnn_bucketing.py test case * update cudnn_rnn_bucketing.py test case * update cudnn_rnn_bucketing.py test case * add check for req[kParams] and kAddTo from cudnn_rnn-inl.h * retrigger the build * retrigger the build * retrigger the build * add kNullOp check * retrigger the build * update kNullOp support and test case for both GRU and LSTM * update kAddToOp support for both GRU and LSTM

Add GRU Support and Test Case

1e4aaf2

lihaofd requested a review from cjolivier01 as a code owner March 29, 2018 04:21

Li, Hao H added 10 commits March 29, 2018 13:35

skip the gpu test case that has nothing to do with RNN GRU

87de652

fix robust bug for gru backward

2b5b43d

fix bug for unifying weight parameter

54c64bc

add GRU multiple layer and bidirection support with test case

f375c89

fix test case bug

6719685

fix test case bug

1ab7869

fix bug for memory issue

f6ae0d1

fix bug for bidirection

4e11dc6

rebase code and fix bug for memory corruption issue

817fc30

fix gpu compile issue

e0a61cb

TaoLv mentioned this pull request Apr 13, 2018

Intel MKL-DNN RNN Support #10542

Closed

fix bug and enable some test cases

3ceaa00

lihaofd requested a review from szha as a code owner April 27, 2018 02:55

Li, Hao H added 8 commits May 4, 2018 09:50

fix robust bug

2e7cbb0

trigger the build to check if quantize-gpu case is covered

1b2288b

trigger the build to check if MKLDNN+GPU case is covered

4f10a01

Merge pull request #1 from apache/master

18f7c5f

update my forked branch

disable failed gpu test case of MKLDNN_UTIL_FUNC-MemFormat because it…

e271184

… has nothing to do with this PR and will recover it once the issue is passed

skip failed test_reduce test case temporarily as it has nothing to do…

646766c

… with RNN

enable several test cases

be9de01

retrigger the build

21e8978

TaoLv mentioned this pull request May 9, 2018

[WIP][MXNET-107] Fused LSTM implementation for CPU #10104

Merged

11 tasks

This was referenced May 9, 2018

gluon.rnn cells should use fused RNN operator #10871

Closed

gluon.rnn layers should use fused RNN operator and become HybridBlock #10873

Closed

TaoLv reviewed May 22, 2018

View reviewed changes

Li, Hao H added 3 commits May 25, 2018 10:54

fix issues and add cudnn_gru_bucketing.py test case

0bc9585

retrigger the build

6f25c26

Merge pull request #9 from apache/master

f1c43ef

sync with master branch

szha reviewed May 31, 2018

View reviewed changes

Li, Hao H added 3 commits June 1, 2018 08:18

update cudnn_rnn_bucketing.py test case

7336cc3

update cudnn_rnn_bucketing.py test case

33060ee

update cudnn_rnn_bucketing.py test case

0c580df

Li, Hao H added 5 commits June 1, 2018 10:21

add check for req[kParams] and kAddTo from cudnn_rnn-inl.h

41a1382

retrigger the build

9173088

retrigger the build

bab3ced

Merge pull request #10 from apache/master

146ac33

update to latest code

retrigger the build

242ed83

Li, Hao H added 2 commits June 3, 2018 08:43

add kNullOp check

4dfb758

retrigger the build

8bd9909

piiswrong reviewed Jun 4, 2018

View reviewed changes

Li, Hao H added 2 commits June 5, 2018 16:35

update kNullOp support and test case for both GRU and LSTM

daf5a86

update kAddToOp support for both GRU and LSTM

27ebb4f

piiswrong merged commit 069026a into apache:master Jun 6, 2018

szha mentioned this pull request Jul 18, 2018

enable CPU kernel for all RNN layer forward #11807

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MXNET-107]Fused GRU implementation for CPU #10311

[MXNET-107]Fused GRU implementation for CPU #10311

lihaofd commented Mar 29, 2018 •

edited

Loading

piiswrong commented Apr 23, 2018 •

edited

Loading

yajiedesign commented Apr 24, 2018

TaoLv commented Apr 24, 2018

TaoLv May 22, 2018

piiswrong May 24, 2018

szha May 31, 2018

lihaofd Jun 1, 2018

piiswrong commented Jun 1, 2018

piiswrong commented Jun 2, 2018

lihaofd commented Jun 4, 2018

piiswrong commented Jun 4, 2018 •

edited

Loading

piiswrong Jun 4, 2018

piiswrong Jun 4, 2018

ThomasDelteil commented Jun 12, 2018 •

edited

Loading

lihaofd commented Jun 12, 2018

		@@ -0,0 +1,235 @@
		# Licensed to the Apache Software Foundation (ASF) under one

[MXNET-107]Fused GRU implementation for CPU #10311

[MXNET-107]Fused GRU implementation for CPU #10311

Conversation

lihaofd commented Mar 29, 2018 • edited Loading

Description

Feature changes

New features

Unit-test changes

Performance

Convergency Curve

Checklist

piiswrong commented Apr 23, 2018 • edited Loading

yajiedesign commented Apr 24, 2018

TaoLv commented Apr 24, 2018

TaoLv May 22, 2018

Choose a reason for hiding this comment

piiswrong May 24, 2018

Choose a reason for hiding this comment

szha May 31, 2018

Choose a reason for hiding this comment

lihaofd Jun 1, 2018

Choose a reason for hiding this comment

piiswrong commented Jun 1, 2018

piiswrong commented Jun 2, 2018

lihaofd commented Jun 4, 2018

piiswrong commented Jun 4, 2018 • edited Loading

piiswrong Jun 4, 2018

Choose a reason for hiding this comment

piiswrong Jun 4, 2018

Choose a reason for hiding this comment

ThomasDelteil commented Jun 12, 2018 • edited Loading

lihaofd commented Jun 12, 2018

lihaofd commented Mar 29, 2018 •

edited

Loading

piiswrong commented Apr 23, 2018 •

edited

Loading

piiswrong commented Jun 4, 2018 •

edited

Loading

ThomasDelteil commented Jun 12, 2018 •

edited

Loading