-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[MXNET-107]Fused GRU implementation for CPU #10311
Conversation
@sherry-zhang @yajiedesign @TaoLv |
look like good |
@piiswrong @yajiedesign I think we should start review on #10104 firstly, since most design and implementation of this PR is following #10104 and #10104 has enabled all the RNN related unit tests. |
update my forked branch
… has nothing to do with this PR and will recover it once the issue is passed
@@ -93,6 +93,41 @@ def test_lstm_bidirectional(): | |||
check_rnn_consistency(stack, fused, T, N, I, H) | |||
check_rnn_consistency(fused, stack, T, N, I, H) | |||
|
|||
@with_seed() | |||
def test_gru_sym(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@szha @piiswrong May I know how to check the gradient weight and gradient bias in test case? Seems they are not verified in the check_rnn_consistency
function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can get gradient with block.weight.grad()
@@ -0,0 +1,235 @@ | |||
# Licensed to the Apache Software Foundation (ASF) under one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't copy a whole file. rename cudnn_lstm_bucketing.py to cudnn_rnn_bucketing.py and then add a switch for the mode instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed, thanks!
I missed one thing in the previous PRs: looks like backward doesn't support req == kAddTo? Also it looks like req[kParams] isn't even checked |
Please also check for kNullOp and skip when necessary. |
@piiswrong, kNullOp checking is added. Please help to review it. thx |
You need to check for kNullOp for kData and kState too. And when req is kNullOp, nothing should be written to the corresponding output array. It's not enough to just skip filling 0 |
src/operator/rnn-inl.h
Outdated
@@ -474,6 +495,9 @@ class RNNOp : public Operator{ | |||
CHECK(dw.CheckContiguous()); | |||
CHECK(dhx.CheckContiguous()); | |||
CHECK(dy.CheckContiguous()); | |||
if (req[rnn_enum::kParams] != kAddTo && req[rnn_enum::kParams] != kNullOp) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't gradient still going to be overwritten by backward kernel later?
src/operator/rnn_impl.h
Outdated
const int omp_threads = mxnet::engine::OpenMP::Get()->GetRecommendedOMPThreadCount(); | ||
#pragma omp parallel for num_threads(omp_threads) | ||
for (int i = 0; i < D * H * 3 * H; ++i) { | ||
dwh[i] = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still overwriting gradient even when req[kParams] = kAddTo ?
@ThomasDelteil , we will look into the issue. |
* Add GRU Support and Test Case * skip the gpu test case that has nothing to do with RNN GRU * fix robust bug for gru backward * fix bug for unifying weight parameter * add GRU multiple layer and bidirection support with test case * fix test case bug * fix test case bug * fix bug for memory issue * fix bug for bidirection * rebase code and fix bug for memory corruption issue * fix gpu compile issue * fix bug and enable some test cases * fix robust bug * trigger the build to check if quantize-gpu case is covered * trigger the build to check if MKLDNN+GPU case is covered * disable failed gpu test case of MKLDNN_UTIL_FUNC-MemFormat because it has nothing to do with this PR and will recover it once the issue is passed * skip failed test_reduce test case temporarily as it has nothing to do with RNN * enable several test cases * retrigger the build * rebase code from lstm * rebase code for resolve conflict * add gru code after resolve conflict * fix bug for resolve conflict * add Fused GRU code with test case * retrigger the build * add GetRecommendedOMPThreadCount for omp * fix conflict issue * add gru relate code * fix bug for code * update code for gru * retrigger the build * fix code about gru condition * enhance test case to test gradient weights and bias * fix bug for test case * fix bug for test case * fix bug about dropout condition and test case * fix bug for test case * fix bug for test case * retrigger the build * rebase code * add gru code * fix issues about namespace, removing define and memcpy * retrigger the build * fix issues and add cudnn_gru_bucketing.py test case * retrigger the build * update cudnn_rnn_bucketing.py test case * update cudnn_rnn_bucketing.py test case * update cudnn_rnn_bucketing.py test case * add check for req[kParams] and kAddTo from cudnn_rnn-inl.h * retrigger the build * retrigger the build * retrigger the build * add kNullOp check * retrigger the build * update kNullOp support and test case for both GRU and LSTM * update kAddToOp support for both GRU and LSTM
* Add GRU Support and Test Case * skip the gpu test case that has nothing to do with RNN GRU * fix robust bug for gru backward * fix bug for unifying weight parameter * add GRU multiple layer and bidirection support with test case * fix test case bug * fix test case bug * fix bug for memory issue * fix bug for bidirection * rebase code and fix bug for memory corruption issue * fix gpu compile issue * fix bug and enable some test cases * fix robust bug * trigger the build to check if quantize-gpu case is covered * trigger the build to check if MKLDNN+GPU case is covered * disable failed gpu test case of MKLDNN_UTIL_FUNC-MemFormat because it has nothing to do with this PR and will recover it once the issue is passed * skip failed test_reduce test case temporarily as it has nothing to do with RNN * enable several test cases * retrigger the build * rebase code from lstm * rebase code for resolve conflict * add gru code after resolve conflict * fix bug for resolve conflict * add Fused GRU code with test case * retrigger the build * add GetRecommendedOMPThreadCount for omp * fix conflict issue * add gru relate code * fix bug for code * update code for gru * retrigger the build * fix code about gru condition * enhance test case to test gradient weights and bias * fix bug for test case * fix bug for test case * fix bug about dropout condition and test case * fix bug for test case * fix bug for test case * retrigger the build * rebase code * add gru code * fix issues about namespace, removing define and memcpy * retrigger the build * fix issues and add cudnn_gru_bucketing.py test case * retrigger the build * update cudnn_rnn_bucketing.py test case * update cudnn_rnn_bucketing.py test case * update cudnn_rnn_bucketing.py test case * add check for req[kParams] and kAddTo from cudnn_rnn-inl.h * retrigger the build * retrigger the build * retrigger the build * add kNullOp check * retrigger the build * update kNullOp support and test case for both GRU and LSTM * update kAddToOp support for both GRU and LSTM
Description
In this PR, it aligns with #10104's registration way and creates a fused GRU operator for CPU.
@pengzhao-intel, @TaoLv , @sherry-zhang
Feature changes
New features
Unit-test changes
Performance
We have tested performance of sym.RNN and rnn.GRUCell on our local Skylake-6148 with 2 Sockets and 40 cores. Use MKL as blas lib in this performance test.
Test input size is from DS2 default parameters(seq_length = 300, batch_size = 20, input_size = 800, hidden_size = 800).
Layer=1 bidirectional = False
Layer=5 bidirectional = True
Convergency Curve
We have tested Convergency of FusedGRU on our CPU-Skylake-8180 with 2 Sockets and 56 cores and GPU-P100 by using example/rnn/bucketing/cudnn_rnn_bucketing.py
Test input size is layer = 3, batch_size = 32, num-embed = 800, num-hidden = 800, num-epochs 20
Checklist