Cpu lstm inference #9977

Jerryzcn · 2018-03-03T02:37:32Z

Description

(Brief description on what this PR is about)
CPU LSTM inference kernel.
This is around 9.5x faster than gluon LSTM cell.

Verified on speech recognition task, as well as on unittest.

Checklist

Essentials

Passed code style checking (make lint)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Comments

If this change is a backward incompatible change, why must this change be made.
Interesting edge cases to note here

sync with master

remove redundant log enable openmp

pengzhao-intel · 2018-03-03T04:57:20Z

@Jerryzcn It's very great to explore the full CPU power for RNN cell.

FYI, we have implemented the fused LSTM OP of CPU in local including inference and training.
And the new OP is registered with NNVM.

I think we can co-operate together to merge the code. @TaoLv @sherry-zhang
We can PR our code in your repo so that the whole LSTM solution can be ready in the same time.

What's your opinion?

Jerryzcn · 2018-03-03T05:53:41Z

@pengzhao-intel This is great, we can definitely collaborate. The reason I am sending this PR is for one of our own project. and we would like to have something to use ASAP. Do you have a timeline for the fused LSTM OP?

reminisce · 2018-03-03T06:04:15Z

It's great to have cpu version implemented. We are deprecating operator implementation using the legacy interface. It's would be better if you can refactor code using the nnvm interface for operator implementation. One example is https://github.com/apache/incubator-mxnet/blob/master/src/operator/nn/convolution-inl.h#L155

pengzhao-intel · 2018-03-04T09:12:06Z

@Jerryzcn Thanks for the info. So, I think it's better to merge this PR as-is.

Our LSTM/GRU will be ready in this month and we will submit the code separately for the review :)

szha · 2018-03-04T20:32:23Z

@pengzhao-intel will the elman RNN be part of your PR? cudnn currently supports it, so we support it as part of the RNN layers.

TaoLv · 2018-03-05T01:51:29Z

@szha To make it easier to review, we would like to split the whole RNN implementation on CPU into several PRs. Firstly, We will submit code for single-layer and unidirectional LSTM/GRU. Then, multi-layer and bidirectional support will be added for LSTM/GRU. Vanilla RNN (maybe elman RNN in your words) will be supported after we finish LSTM/GRU. Actually, we have implemented fused vanilla RNN, but I think it should be a low priority to integrated it into mxnet, compared with LSTM/GRU.

@szha What about your opinion? We can set a detailed plan for this PRs if needed.
@pengzhao-intel Correct me if I missed anything.

szha · 2018-03-05T02:50:33Z

@TaoLv sounds good. What timeline are we looking at for feature parity with cudnn?

TaoLv · 2018-03-05T08:19:41Z

@szha Team need take an internal discussion about it and will back to you with a detailed plan soon.
BTW, do you know anybody can help to refactor the existing RNN operator with nnvm interfaces? Seems that some of gpu code need be changed for that.

szha · 2018-03-05T16:12:47Z

Pinging @piiswrong for coordination.

piiswrong · 2018-03-05T18:25:01Z

We don't have to do the refactor now. CPU support is more important.

szha · 2018-03-06T21:28:22Z

python/mxnet/gluon/data/dataloader.py

@@ -214,6 +214,7 @@ def __iter__(self):
            worker.start()
            workers.append(worker)

+        idx = -1


idx might be reference before assignment when i run pylint

CodingCat · 2018-03-07T04:38:42Z

Hi, the community has passed to vote about associating the code changes with JIRA (https://lists.apache.org/thread.html/ab22cf0e35f1bce2c3bf3bec2bc5b85a9583a3fe7fd56ba1bbade55f@%3Cdev.mxnet.apache.org%3E)

We have updated the guidelines for contributors in https://cwiki.apache.org/confluence/display/MXNET/Development+Process, please ensure that you have created a JIRA at https://issues.apache.org/jira/projects/MXNET/issues/ to describe your work in this pull request and include the JIRA title in your PR as [MXNET-xxxx] your title where MXNET-xxxx is the JIRA id

Thanks!

zheng-da · 2018-03-07T00:49:55Z

src/operator/rnn-inl.h

    size *= 2;
  } else {
-    size += (layerNum - 1) * rnn_single_param_size(hiddenSize, hiddenSize, mode);
+    size += (layerNum - 1) * rnn_single_param_size(hiddenSize, hiddenSize,
+        mode);


you are just reformatting the code here?

zheng-da · 2018-03-07T01:02:43Z

src/operator/rnn-inl.h

+    CHECK_EQ(y.CheckContiguous(), true);
+
+    if (ctx.is_train)
+      LOG(FATAL) << "only inference mode is available for cpu at the moment.";


you can do CHECK(!ctx.is_train) << "..."

zheng-da · 2018-03-07T18:45:37Z

src/operator/rnn-inl.h

-        in_data[rnn_enum::kState], out_data[rnn_enum::kOut], out_grad[rnn_enum::kOut]};
+    std::vector<int> dep = {in_data[rnn_enum::kData],
+  in_data[rnn_enum::kParams], in_data[rnn_enum::kState],
+  out_data[rnn_enum::kOut], out_grad[rnn_enum::kOut]};


i'm not sure why you want to change the code in this function. it seems you just reorganize the code a little bit.

it exceeds 80 char per line limit.

the coding style in mxnet allows up to 100 char per line.
so the original code is fine.

zheng-da · 2018-03-07T18:56:04Z

src/operator/rnn-inl.h

+    if (param_.mode == rnn_enum::kLstm)
+      param_.lstm_q_ = true;
+    else
+      param_.lstm_q_ = false;


it seems this check can be merged to the switch case statement above.

TaoLv · 2018-03-09T05:53:10Z

src/operator/rnn-inl.h

@@ -114,7 +120,8 @@ struct RNNParam : public dmlc::Parameter<RNNParam> {

    DMLC_DECLARE_FIELD(p).set_default(0.)
    .set_range(0, 1)
-    .describe("Dropout probability, fraction of the input that gets dropped out at training time");
+    .describe("Dropout probability, fraction of the input that gets dropped"
+        "out at training time");


Remove this change. Length of this line is less than 100.
BTW, why there are still some parameters don't have their descriptions, like pkeep_, lstm_q_?

pkeep_, lstm_q_ are used in cudnn_rnn-inl.h

TaoLv · 2018-03-09T05:55:53Z

src/operator/rnn-inl.h

+    CHECK_EQ(y.CheckContiguous(), true);
+
+    CHECK(!ctx.is_train) << "only inference mode is available"
+      "for cpu at the moment.";


How about check this at the front of this function?

szha · 2018-03-09T06:04:57Z

tests/python/unittest/test_gluon_rnn.py

+    model = mx.gluon.nn.Sequential()
+    with model.name_scope():
+        model.add(mx.gluon.rnn.LSTM(2, num_layers=6, bidirectional=True))
+    model.initialize(mx.init.One())


Could you also test the consistency between cpu and gpu, with same random weights and random inputs?

will it break CPU tests? It might be too much an effort

You can use this function to test consistency.
https://github.com/apache/incubator-mxnet/blob/master/python/mxnet/test_utils.py#L1207

Under the current input and weight, your test would still pass even if the weights are iterated backwards. Unfortunately it's not in an acceptable state.

szha · 2018-03-09T06:10:00Z

src/operator/rnn-inl.h

+    CHECK_EQ(x.CheckContiguous(), true);
+    CHECK_EQ(w.CheckContiguous(), true);
+    CHECK_EQ(hx.CheckContiguous(), true);
+    CHECK_EQ(y.CheckContiguous(), true);


CHECK(x.CheckContiguous());

reminisce · 2018-03-09T07:55:14Z

src/operator/rnn-inl.h

+ private:
+  RNNParam param_;
+
+  virtual void LSTMFusedElementWiseCPUOps(const Tensor<cpu, 2, DType> &i2h_y,


Why virtual?

reminisce · 2018-03-09T08:01:08Z

src/operator/rnn-inl.h

+      int64_t f = i + h_channel;
+      int64_t c = i + h_channel * 2;
+      int64_t o = i + h_channel * 3;
+      h2h_y[j][i] += i2h_y[j][i];


Too many overloaded operator [] calls and temporary Tensor objects generated here. At least you can cache h2h_y[j], i2h_y[j], etc. for each loop.

tried, but i did not notice any difference in runtime. i think the tensor object probably does not generate new tensor object here. I think multiple [] are probably implemented as a single dereference operation rather multiple one. I suspect that assigning it to a local variable will actually use one of the register for holding the pointer to the object, which may actually slow down the process

When you use data[i], where data is a 2D tensor, it returns a 1D temporary Tensor object for you, and then call the 1D tensor's operator[]. You would not be able to notice much runtime improvement after you make the change if the program didn't run for a long time only for this loop and the improvement could be dwarfed by other factors that are major bottlenecks. At least, it's not a good practice to write C++ code like this.

okay. but it seems inside mshadow, all the ops are implemented using multiple []
https://github.com/dmlc/mshadow/blob/master/mshadow/tensor_cpu-inl.h#L380
I will probably access the dptr_

I don't have a strong opinion on this. If you could use dptr_, that's the best for performance because it saves function calls and temp tensor object creation, but it could introduce the issue of code readability and defeat the purpose of OO.

I think the rule of thumb here is try to avoid temp tensor creation and destruction while keep the code readable. So it's okay to use operator[] for a 1D Tensor since it only return values and cache the temp tensor created by calling operator[] for a 2D tensor.

reminisce · 2018-03-09T08:02:24Z

tests/python/unittest/test_gluon_rnn.py

+    model = mx.gluon.nn.Sequential()
+    with model.name_scope():
+        model.add(mx.gluon.rnn.LSTM(2, num_layers=6, bidirectional=True))
+    model.initialize(mx.init.One())


You can use this function to test consistency.
https://github.com/apache/incubator-mxnet/blob/master/python/mxnet/test_utils.py#L1207

reminisce · 2018-03-09T08:03:42Z

tests/python/unittest/test_gluon_rnn.py

+    model.initialize(mx.init.One())
+    y = model(x).asnumpy()
+
+    mx.test_utils.assert_almost_equal(y, np.array([[[0.72045636, 0.72045636, 0.95215213, 0.95215213],


Why are there hardcoded numbers?

https://github.com/apache/incubator-mxnet/blob/master/python/mxnet/test_utils.py#L1207 is for symbol it seems. or does it support both?

mv hardcoded number to constant

chenchu-zs · 2018-03-09T11:54:08Z

src/operator/rnn-inl.h

+              + (layer - num_dir) * (h2h_w_size * num_dir + h2h_w_size));
+        Tensor<cpu, 2, DType> i2h_w(w.Slice(start, start + (layer < num_dir ?
+              (in_channel * fused_h_ch) : num_dir * h2h_w_size)).dptr_,
+              i2h_w_shape);


Why slice? i think w.dptr_ + start is same as w.Slice(start, start + (layer < num_dir ? in_channel * fused_h_ch) : num_dir * h2h_w_size)).dptr_

chenchu-zs · 2018-03-09T12:08:56Z

src/operator/rnn-inl.h

+    int64_t ji;
+    #pragma omp parallel for private(ji)
+    for (ji = 0; ji < batch_size * h_channel; ji++) {
+      int64_t j = ji / h_channel;  // batch dim


Is it ok to write batch_size * h_channel in condition expression? It will calculate ji times.
And ++ji is better than ji++.

u mean move it out of condition expression?

chenchu-zs · 2018-03-09T12:22:21Z

src/operator/rnn-inl.h

+    int64_t ji;
+    #pragma omp parallel for private(ji)
+    for (ji = 0; ji < batch_size * h_channel; ji++) {
+      int64_t j = ji / h_channel;  // batch dim


Don't need to set ji private if define ji in for loop. like this:

#pragma omp parallel for for(int64_t ji = 0; ... ; ....)

szha

Please add consistency tests between CPU and GPU (cudnn) using random weights and random inputs, with dropouts off.

Jerryzcn · 2018-03-09T21:46:49Z

@szha I think https://github.com/apache/incubator-mxnet/blob/master/tests/python/gpu/test_operator_gpu.py#L1527 check for consistency? Although the inputs are ones.

add tests

szha · 2018-03-10T05:48:46Z

Since this change is only useful for inference, RNN layer still needs to remain a Block. Once the backward is in place, we will be able to change it to a HybridBlock.

* fix autograd import path * cpu lstm working * remove fatal log * add simple unittest remove redundant log enable openmp * fused input2hidden gemm * fix lint * fix pylint * fix windows build error * fix gluon rnn interface * Update dataloader.py * address cr * address cr * fix import * revert some cosmetic change * fix typo * remove newline * rm virtual mv hardcoded number to constant * address cr add tests * simplify test * fix test * fix tests * change magic number scope

Jerryzcn and others added 7 commits December 28, 2017 20:38

Merge pull request #1 from apache/master

391f5a3

sync with master

fix autograd import path

9e48c78

Merge remote-tracking branch 'upstream/master'

3be573f

cpu lstm working

1e8a7e7

remove fatal log

6f209d2

add simple unittest

7e930cb

remove redundant log enable openmp

fused input2hidden gemm

19c85aa

Jerryzcn requested review from cjolivier01 and szha as code owners March 3, 2018 02:37

Jerryzcn added 4 commits March 5, 2018 21:50

fix lint

7c84239

fix pylint

73a632b

fix windows build error

417552f

fix gluon rnn interface

3f06618

szha reviewed Mar 6, 2018

View reviewed changes

Update dataloader.py

e7e67af

zheng-da reviewed Mar 7, 2018

View reviewed changes

Jerryzcn added 3 commits March 7, 2018 21:05

address cr

df2f836

Merge branch 'cpu-lstm' of github.com:Jerryzcn/mxnet into cpu-lstm

e81b9ce

address cr

b8ca9c8

szha self-assigned this Mar 8, 2018

fix import

9b919af

TaoLv reviewed Mar 9, 2018

View reviewed changes

szha reviewed Mar 9, 2018

View reviewed changes

Jerryzcn added 3 commits March 9, 2018 06:31

revert some cosmetic change

110010d

fix typo

f346598

remove newline

9de6bf6

reminisce reviewed Mar 9, 2018

View reviewed changes

rm virtual

f41d8ef

mv hardcoded number to constant

chenchu-zs reviewed Mar 9, 2018

View reviewed changes

szha suggested changes Mar 9, 2018

View reviewed changes

Jerryzcn added 6 commits March 9, 2018 21:57

address cr

a8cda1a

add tests

simplify test

f051293

Merge remote-tracking branch 'upstream/master' into cpu-lstm

cffd778

fix test

6e2134e

fix tests

d065b10

change magic number scope

ef1e19d

szha approved these changes Mar 10, 2018

View reviewed changes

szha merged commit 13ae4d1 into apache:master Mar 10, 2018

Jerryzcn deleted the cpu-lstm branch March 10, 2018 09:15

chenchu-zs mentioned this pull request Mar 14, 2018

[WIP][MXNET-107] Fused LSTM implementation for CPU #10104

Merged

11 tasks

mratsim mentioned this pull request May 13, 2018

Overview of the fastest CPU RNNs implementation mratsim/Arraymancer#228

Closed

eric-haibin-lin mentioned this pull request Jan 12, 2020

Slow CPU inference in Gluon GRU module #13634

Closed

Cpu lstm inference #9977

Cpu lstm inference #9977

Conversation

Jerryzcn commented Mar 3, 2018 • edited Loading

Description

Checklist

Essentials

Comments

pengzhao-intel commented Mar 3, 2018

Jerryzcn commented Mar 3, 2018

reminisce commented Mar 3, 2018

pengzhao-intel commented Mar 4, 2018

szha commented Mar 4, 2018

TaoLv commented Mar 5, 2018 • edited Loading

szha commented Mar 5, 2018

TaoLv commented Mar 5, 2018

szha commented Mar 5, 2018

piiswrong commented Mar 5, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CodingCat commented Mar 7, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jerryzcn Mar 7, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jerryzcn Mar 9, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jerryzcn Mar 9, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jerryzcn Mar 9, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

szha left a comment

Choose a reason for hiding this comment

Jerryzcn commented Mar 9, 2018

szha commented Mar 10, 2018

Jerryzcn commented Mar 3, 2018 •

edited

Loading

TaoLv commented Mar 5, 2018 •

edited

Loading

Jerryzcn Mar 7, 2018 •

edited

Loading

Jerryzcn Mar 9, 2018 •

edited

Loading

Jerryzcn Mar 9, 2018 •

edited

Loading

Jerryzcn Mar 9, 2018 •

edited

Loading