Optimize AddTakeGrad Tensor Sum #17906

ElaineBao · 2020-03-25T13:07:15Z

Description

The function of AddTakeGrad is used in the backward pass of embedding operator. Originally it uses tensor-level summation, which is very slow. By replacing tensor-level summation to element-wise summation, this function can be faster (about 6X speedup for a dummy example).

@xinyu-intel @zixuanweeei @TaoLv please review.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at https://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Feature1, tests, (and when applicable, API doc)
Feature2, tests, (and when applicable, API doc)

Comments

If this change is a backward incompatible change, why must this change be made.
Interesting edge cases to note here

mxnet-bot · 2020-03-25T13:07:19Z

Hey @ElaineBao , Thanks for submitting the PR
Once your PR is ready for CI checks, invoke the following commands:

To trigger all jobs: @mxnet-bot run ci [all]
To trigger specific jobs: @mxnet-bot run ci [job1, job2]

CI supported jobs: [unix-cpu, edge, centos-gpu, windows-cpu, miscellaneous, sanity, unix-gpu, windows-gpu, clang, website, centos-cpu]

Note:
Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin.
All CI tests must pass before the PR can be merged.

ElaineBao · 2020-03-25T13:07:53Z

@mxnet-bot run ci [all]

leezu · 2020-03-25T17:11:50Z

Is tensor-level summation slow as the compiler fails to optimize the mshadow implementation? Or what is the reason?

ElaineBao · 2020-03-26T01:07:08Z

Is tensor-level summation slow as the compiler fails to optimize the mshadow implementation? Or what is the reason?

Not quite sure about the reason, but I think it may be related to temporary memory allocation

ElaineBao · 2020-03-26T01:07:44Z

@mxnet-bot run ci [unix-gpu, windows-gpu]

mxnet-bot · 2020-03-26T01:07:51Z

Jenkins CI successfully triggered : [windows-gpu, unix-gpu]

leezu · 2020-03-26T01:17:40Z

@ElaineBao do you expect this to be true at other places where tensor-level summation is used? Should these places be checked / fixed too?

ElaineBao · 2020-03-26T01:30:50Z

@leezu I cannot say all tensor-level summation is slow, after all I haven't run all the cases.

But changing tensor-level summation to element-wise summation actually increases the amount of code and makes the code less readable, so if not for known efficiency issue, I think it's better to remain unchanged and using tensor-level summation.

TaoLv · 2020-03-26T01:46:59Z

@ElaineBao Could you please share a benchmarking script so we can verify the effect of this optimization? opperf may help: https://github.com/apache/incubator-mxnet/tree/master/benchmark/opperf.

ElaineBao · 2020-03-26T02:18:27Z

@TaoLv OK, I'll work on it

TaoLv · 2020-04-04T14:48:16Z

The CI issue should be already addressed. Please rebase your PR and resolve the comments. Thanks.

ElaineBao · 2020-04-04T16:07:26Z

Sorry for the late reply.
I tried to use opperf, but it doesn't work, some error throwed out when I was using it:

#   File "/incubator-mxnet/benchmark/opperf/rules/default_params.py", line 606, in <module>
#     "axis_shape": DEFAULT_AXIS_SHAPE,
# NameError: name 'DEFAULT_AXIS_SHAPE' is not defined

So I use mxnet profiler to validate the performance, I think it's also reasonable.
The script is as follows:

import random
import pandas as pd
import mxnet as mx
import numpy as np
from sklearn.model_selection import train_test_split

batch_size = 1000
num_epoch = 5
model_prefix = 'drivethru_attention_d'
n_plus= 522
total = 40000
profiling = True

records = []
for i in range(0, total):
    pluids = [random.randint(0, n_plus - 1) for i in range(0, 5)]
    label = random.randint(0, 1)
    records.append((pluids, label))

data = pd.DataFrame(records,
                    columns=['pluids','label'])
train, test = train_test_split(data, test_size=0.1, random_state=100)

X_train = mx.io.NDArrayIter(data={'pluids': np.array(train['pluids'].values.tolist(), dtype=int)},
                            label={'output_label': train['label'].values},
                            batch_size=batch_size,
                            shuffle=True)
X_eval = mx.io.NDArrayIter(data={'pluids': np.array(test['pluids'].values.tolist(), dtype=int)},
                            label={'output_label': test['label'].values},
                            batch_size=batch_size,
                            shuffle=True)
y_true = mx.symbol.Variable('output_label')


pluids = mx.symbol.Variable('pluids')
plu_embed = mx.symbol.Embedding(data=pluids, input_dim=n_plus, output_dim=50, name='plu_embed')

fc1 = mx.symbol.FullyConnected(data=plu_embed, num_hidden=int(n_plus), name='fc1')
rec_model = mx.symbol.SoftmaxOutput(data=fc1, label=y_true, name='output')

mod = mx.mod.Module(symbol=rec_model,
                    data_names=['pluids'],
                    label_names=['output_label'],
                    context=[mx.cpu()])
# enable profiler
mx.profiler.set_config(profile_symbolic=True, profile_imperative=True, profile_memory=False,
                                profile_api=True, filename='profile.json', aggregate_stats=True)
mx.profiler.set_state('run')

mod.fit(train_data=X_train,
        num_epoch=num_epoch,
        initializer=mx.init.Xavier(rnd_type="gaussian"),
        optimizer='adagrad',
        eval_metric=['accuracy'],
        validation_metric=['accuracy', mx.metric.TopKAccuracy(3)],
        eval_data=X_eval,
        batch_end_callback=mx.callback.Speedometer(batch_size, 2))

mx.profiler.set_state('stop')
print(mx.profiler.dumps())

And the performance:

before optimization of _backward_Embedding:

operator
=================
Name                          Total Count        Time (ms)    Min Time (ms)    Max Time (ms)    Avg Time (ms)
----                          -----------        ---------    -------------    -------------    -------------
_backward_Embedding                   180        2854.3340          12.3320          29.1350          15.8574
_mul_scalar                          1620         527.4130           0.0030           3.3110           0.3256
_backward_FullyConnected              180         162.2140           0.7430           1.6510           0.9012
SoftmaxOutput                         200         129.6620           0.1250           1.2650           0.6483
FullyConnected                        200         110.0570           0.2340          42.0660           0.5503
argmax                                200          49.5320           0.1840           0.4930           0.2477
broadcast_add                        1080          31.0420           0.0040           2.9860           0.0287
Embedding                             200          25.0530           0.0240           3.8110           0.1253
_backward_SoftmaxOutput               180          19.0860           0.0560           0.8680           0.1060
square                                540          18.5240           0.0030           2.8510           0.0343
sqrt                                  540          17.3870           0.0060           0.9440           0.0322
DeleteVariable                       3532          11.3070           0.0020           0.0330           0.0032
broadcast_sub                         540           8.2790           0.0040           0.0440           0.0153
broadcast_div                         540           7.4970           0.0050           0.0730           0.0139
_plus_scalar                          555           6.5850           0.0040           0.0650           0.0119
SetValueOp                              8           5.8160           0.0050           5.6540           0.7270
CopyCPU2CPU                           448           4.1040           0.0020           0.1030           0.0092
ResourceParallelRandomSetSeed               1           3.8440           3.8440           3.8440           3.8440
WaitForVar                            220           1.3150           0.0040           0.0120           0.0060
Cast                                   33           1.1590           0.0080           0.2680           0.0351
_random_normal                          2           0.7270           0.1210           0.6060           0.3635
_zeros                                  6           0.2650           0.0070           0.0720           0.0442
_div_scalar                            15           0.1400           0.0050           0.0190           0.0093
SetupExec                               6           0.0150           0.0010           0.0060           0.0025
_full                                   1           0.0060           0.0060           0.0060           0.0060

after optimization

operator
=================
Name                          Total Count        Time (ms)    Min Time (ms)    Max Time (ms)    Avg Time (ms)
----                          -----------        ---------    -------------    -------------    -------------
_mul_scalar                          1620         451.0970           0.0030           3.2960           0.2785
_backward_FullyConnected              180         195.9230           0.7440           2.3720           1.0885
SoftmaxOutput                         200         156.3020           0.1080           1.2910           0.7815
FullyConnected                        200         136.2320           0.2300          43.5920           0.6812
argmax                                200          54.5550           0.1710           0.4960           0.2728
_backward_SoftmaxOutput               180          39.8900           0.0570           0.8930           0.2216
Embedding                             200          27.0910           0.0270           3.1330           0.1355
broadcast_add                        1080          24.6370           0.0040           0.6560           0.0228
_backward_Embedding                   180          21.5230           0.0970           0.4120           0.1196
sqrt                                  540          20.1840           0.0060           0.1300           0.0374
square                                540          19.2420           0.0040           2.9200           0.0356
DeleteVariable                       3532          13.1160           0.0010           0.1310           0.0037
broadcast_sub                         540          11.0550           0.0040           0.0980           0.0205
broadcast_div                         540           9.3750           0.0050           0.1110           0.0174
_plus_scalar                          555           8.2280           0.0040           0.1140           0.0148
SetValueOp                              8           5.9090           0.0050           5.7620           0.7386
CopyCPU2CPU                           448           4.2760           0.0030           0.1040           0.0095
ResourceParallelRandomSetSeed               1           3.8370           3.8370           3.8370           3.8370
Cast                                   33           1.2800           0.0090           0.2670           0.0388
WaitForVar                            195           1.2160           0.0040           0.0180           0.0062
_random_normal                          2           0.7190           0.1200           0.5990           0.3595
_zeros                                  6           0.2710           0.0060           0.0790           0.0452
_div_scalar                            15           0.2610           0.0050           0.0790           0.0174
SetupExec                               6           0.0150           0.0020           0.0050           0.0025
_full                                   1           0.0070           0.0070           0.0070           0.0070

ElaineBao · 2020-04-05T02:23:53Z

@mxnet-bot run ci [unix-gpu, unix-cpu]

mxnet-bot · 2020-04-05T02:24:00Z

Jenkins CI successfully triggered : [unix-gpu, unix-cpu]

TaoLv · 2020-04-05T14:56:20Z

@ElaineBao Thank you. Impressive speedup!

TaoLv · 2020-04-05T15:01:00Z

Sorry for the late reply.
I tried to use opperf, but it doesn't work, some error throwed out when I was using it:
File "/incubator-mxnet/benchmark/opperf/rules/default_params.py", line 606, in
"axis_shape": DEFAULT_AXIS_SHAPE,
NameError: name 'DEFAULT_AXIS_SHAPE' is not defined

FYI, @ChaiBapchya.

leezu · 2020-04-05T18:14:39Z

Opperf to be fixed by #17894

xinyu-intel mentioned this pull request Mar 30, 2020

Enable embedding backward parallel #17736

Closed

7 tasks

ElaineBao requested review from aaronmarkham, leezu, marcoabreu and szha as code owners April 4, 2020 15:50

Optimize AddTakeGrad Tensor Sum

794a324

ElaineBao force-pushed the opt-embedding-bwd branch from 3cd7560 to 794a324 Compare April 4, 2020 15:56

TaoLv approved these changes Apr 5, 2020

View reviewed changes

leezu merged commit c3c76a8 into apache:master Apr 7, 2020

mk-61 pushed a commit to mk-61/incubator-mxnet that referenced this pull request Apr 7, 2020

Optimize AddTakeGrad Tensor Sum (apache#17906)

886b90b

ElaineBao deleted the opt-embedding-bwd branch April 14, 2020 00:42

ElaineBao added a commit to ElaineBao/incubator-mxnet that referenced this pull request Apr 14, 2020

Optimize AddTakeGrad Tensor Sum (apache#17906)

ef88c12

ElaineBao mentioned this pull request Apr 14, 2020

[v1.x] backport Optimize AddTakeGrad Tensor Sum (#17906) to v1.x #18045

Merged

7 tasks

pengzhao-intel pushed a commit that referenced this pull request Apr 15, 2020

Optimize AddTakeGrad Tensor Sum (#17906) (#18045)

3f920ae

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize AddTakeGrad Tensor Sum #17906

Optimize AddTakeGrad Tensor Sum #17906

ElaineBao commented Mar 25, 2020 •

edited

Loading

mxnet-bot commented Mar 25, 2020

ElaineBao commented Mar 25, 2020

leezu commented Mar 25, 2020

ElaineBao commented Mar 26, 2020

ElaineBao commented Mar 26, 2020

mxnet-bot commented Mar 26, 2020

leezu commented Mar 26, 2020 •

edited

Loading

ElaineBao commented Mar 26, 2020

TaoLv commented Mar 26, 2020

ElaineBao commented Mar 26, 2020

TaoLv commented Apr 4, 2020

ElaineBao commented Apr 4, 2020

ElaineBao commented Apr 5, 2020

mxnet-bot commented Apr 5, 2020

TaoLv commented Apr 5, 2020

TaoLv commented Apr 5, 2020

leezu commented Apr 5, 2020

Optimize AddTakeGrad Tensor Sum #17906

Optimize AddTakeGrad Tensor Sum #17906

Conversation

ElaineBao commented Mar 25, 2020 • edited Loading

Description

Checklist

Essentials

Changes

Comments

mxnet-bot commented Mar 25, 2020

ElaineBao commented Mar 25, 2020

leezu commented Mar 25, 2020

ElaineBao commented Mar 26, 2020

ElaineBao commented Mar 26, 2020

mxnet-bot commented Mar 26, 2020

leezu commented Mar 26, 2020 • edited Loading

ElaineBao commented Mar 26, 2020

TaoLv commented Mar 26, 2020

ElaineBao commented Mar 26, 2020

TaoLv commented Apr 4, 2020

ElaineBao commented Apr 4, 2020

ElaineBao commented Apr 5, 2020

mxnet-bot commented Apr 5, 2020

TaoLv commented Apr 5, 2020

TaoLv commented Apr 5, 2020

leezu commented Apr 5, 2020

ElaineBao commented Mar 25, 2020 •

edited

Loading

leezu commented Mar 26, 2020 •

edited

Loading