Use env var to enforce safe accumulation in ReduceAxesCompute #14830

haojin2 · 2019-04-29T08:37:18Z

Description

As title.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

New environment variable to control accumulation behavior.
Corresponding test coverage
Corresponding docs.

Comments

@eric-haibin-lin
Flakiness checked on GPU:

MXNET_TEST_COUNT=10000 nosetests tests/python/gpu/test_operator_gpu.py:test_norm
[INFO] Setting module np/mx/python random seeds, use MXNET_MODULE_SEED=292245351 to reproduce.
.
----------------------------------------------------------------------
Ran 1 test in 256.972s

OK

roywei · 2019-04-29T15:43:18Z

@mxnet-label-bot add[pr-work-in-progress, Operator, Backend]

src/operator/tensor/broadcast_reduce_op.h

addressed

eric-haibin-lin

some minor comments. otherwise LGTM. Thanks!

tests/python/unittest/test_operator.py

eric-haibin-lin · 2019-05-07T18:31:29Z

docs/faq/env_var.md

+* MXNET_ENFORCE_SAFE_ACCUMULATION
+  - Values: Values: 0(false) or 1(true) ```(default=0)```
+  - If this variable is set, the accumulation will enter the safe mode, meaning accumulation is done in a data type of higher precision than
+    the input data type, leading to more accurate results with a possible performance loss and backward compatibility loss.


Can we mention the list of operator supporting this env_var?

Also can you mention what casting happens ie., float16->float32, etc.,

Also I am concerned about float32-> float64 casting as described in #14722.
In our offline discussion @haojin2 and @eric-haibin-lin seem to allude that somehow now with float64 casting for accumulation it is exposing a higher loss which is the correct value and the earlier model which yields lower loss is actually wrong. I am not sure I understand this despite our long conversation (sorry!).
Also I found another old PR which casted the Accumulation variable to higher type, however [comment] (dmlc/mshadow#236 (comment)) seem to suggest float32->float32 is sufficient(I checked with @cjolivier01 for the reasoning he does not remember unfortunately).
My suggestion is to retain float32->float32 unless you have a convincing argument.

@nswamy I don't see a concrete reason why you want fp32 accumulation type for fp32 inputs if user set MXNET_ENFORCE_SAFE_ACCUMULATION=1. The only reason I see from the mshadow PR is to keep backward compatibility.
If you use fp32 accumulation type for fp32 inputs, it is NOT safe accumulation.

One simple test case where fp32 accumulation is not sufficient for safe accumulation:

# fp32 accumulation >>> import numpy as np >>> val1 = np.float32(3e38) >>> val2 = np.float32(-3e38) >>> sum = np.float32(3e38) >>> sum + val1 + val2 inf >>> sum64 = np.float64(3e38) >>> sum64 + val1 + val2 3e+38 >>> np.float32(sum64 + val1 + val2) 3e+38

@nswamy Although I agree that using fp32 for fp32 accumulation may be sufficient under many cases, but:

As @eric-haibin-lin said you cannot claim that such accumulation is safe.

Usually a user is not 100% sure that his/her data will not cause overflow during any accumulation, so we want to provide an option if users want to be safe even he/she is using fp32 data type.

@anirudh2290 there's no guarantee that being more precise gives you a model with better accuracy / lower perplexity.

I understand that there is no guarantee of getting better accuracy, but can it be worse and what are reasons for the same ? This is what seems to be happening for @nswamy 's example.

yeah it's possible that the model gets worse. The accuracy could go either way

In this case we need to change the env variable documentation: "leading to more accurate results with a possible performance loss and backward compatibility loss."

What does backward compatibility loss mean? backward incompatibility?

tests/python/unittest/test_operator.py

nswamy

Thanks for the change, I have a few comments.

docs/faq/env_var.md

nswamy · 2019-05-07T19:02:54Z

docs/faq/env_var.md

+* MXNET_ENFORCE_SAFE_ACCUMULATION
+  - Values: Values: 0(false) or 1(true) ```(default=0)```
+  - If this variable is set, the accumulation will enter the safe mode, meaning accumulation is done in a data type of higher precision than
+    the input data type, leading to more accurate results with a possible performance loss and backward compatibility loss.


Also can you mention what casting happens ie., float16->float32, etc.,

nswamy · 2019-05-07T19:13:34Z

docs/faq/env_var.md

+* MXNET_ENFORCE_SAFE_ACCUMULATION
+  - Values: Values: 0(false) or 1(true) ```(default=0)```
+  - If this variable is set, the accumulation will enter the safe mode, meaning accumulation is done in a data type of higher precision than
+    the input data type, leading to more accurate results with a possible performance loss and backward compatibility loss.


Also I am concerned about float32-> float64 casting as described in #14722.
In our offline discussion @haojin2 and @eric-haibin-lin seem to allude that somehow now with float64 casting for accumulation it is exposing a higher loss which is the correct value and the earlier model which yields lower loss is actually wrong. I am not sure I understand this despite our long conversation (sorry!).
Also I found another old PR which casted the Accumulation variable to higher type, however [comment] (dmlc/mshadow#236 (comment)) seem to suggest float32->float32 is sufficient(I checked with @cjolivier01 for the reasoning he does not remember unfortunately).
My suggestion is to retain float32->float32 unless you have a convincing argument.

nswamy · 2019-05-07T19:34:17Z

#14760

ci/windows/test_py2_cpu.ps1

src/operator/tensor/broadcast_reduce_op.h

eric-haibin-lin

LGTM. @anirudh2290 @nswamy any other concerns? We need the fix before 1.5 code freeze

anirudh2290 · 2019-05-16T02:15:21Z

Has the doc been fixed. There is no guarantee for aaccuraccy improvement with accumulation. The env variable doc says turn it on for better accuracy.

haojin2 · 2019-05-16T02:20:52Z

@anirudh2290 The doc is only saying that the ACCUMULATIONS themselves will be more accurate, it's making no claims about accuracies for models.

anirudh2290 · 2019-05-16T06:27:40Z

Since this is a global variable affecting many operators it can be easily interpreted as accumulation amongst all operators increasing accuracy of the model. At least add a note that turning this env variable on may not necessarily improve the accuracy of the model

haojin2 · 2019-05-16T10:08:20Z

@anirudh2290 Fundamentally nothing would guarantee any benefit for all models. One would design a model or configure the system in a way that he/she believes would benefit his/her goal most, and we, as a lower-level library, are simply presenting the choices for them (with correct implementation and abundant doc) rather than guaranteeing anything for them.
I think this piece of doc is serving its sole purpose: describing what will happen to accumulations if it's turned on, i.e. the accumulations will be done in a higher-precision type. If we add such a "disclaimer" to this very env var, do we then also do that for all other features/operators/env vars? For example, is adding "Using a BatchNorm layer in your model may not necessarily improve the accuracy of it" in BatchNorm's doc really giving extra information to the users?
I do understand where you're coming from about the influence of this env var may impose on the final accuracy of a model, so regarding the issue #14722 that actually triggered this fix here, I actually tried the same benchmark script on my end on a p2.8xlarge instance and tested on the branch of this very PR (Also there's a bug in the script that prevents it from running with py3, and my fix is here: awslabs/deeplearning-benchmark#70):

ubuntu@ip-162-32-28-44:~/deeplearning-benchmark$ export MXNET_SAFE_ACCUMULATION=1
ubuntu@ip-162-32-28-44:~/deeplearning-benchmark$ python word_language_model/word_language_model.py --gpus 8 --nhid 650 --emsize 650 --dropout 0.5 --epochs 40 --data word_language_model/data/ptb. --mode imperative --kvstore device
INFO:root:[Epoch 0] time cost 23.01s, valid loss 6.56, valid ppl 703.38
INFO:root:test loss 6.53, test ppl 683.61
INFO:root:[Epoch 1] time cost 19.86s, valid loss 6.16, valid ppl 473.41
INFO:root:test loss 6.13, test ppl 459.76
INFO:root:[Epoch 2] time cost 20.12s, valid loss 5.75, valid ppl 313.67
INFO:root:test loss 5.72, test ppl 304.20
INFO:root:[Epoch 3] time cost 19.85s, valid loss 5.57, valid ppl 262.83
INFO:root:test loss 5.54, test ppl 254.41
INFO:root:[Epoch 4] time cost 19.90s, valid loss 5.46, valid ppl 234.12
INFO:root:test loss 5.43, test ppl 228.92
INFO:root:[Epoch 5] time cost 19.96s, valid loss 5.31, valid ppl 202.62
INFO:root:test loss 5.28, test ppl 195.97
INFO:root:[Epoch 6] time cost 19.80s, valid loss 5.24, valid ppl 187.75
INFO:root:test loss 5.21, test ppl 182.98
INFO:root:[Epoch 7] time cost 19.93s, valid loss 5.16, valid ppl 174.70
INFO:root:test loss 5.13, test ppl 169.26
INFO:root:[Epoch 8] time cost 19.92s, valid loss 5.11, valid ppl 166.22
INFO:root:test loss 5.08, test ppl 161.02
INFO:root:[Epoch 9] time cost 19.88s, valid loss 5.07, valid ppl 158.64
INFO:root:test loss 5.03, test ppl 153.36
INFO:root:[Epoch 10] time cost 19.93s, valid loss 5.03, valid ppl 153.05
INFO:root:test loss 5.00, test ppl 147.75
INFO:root:[Epoch 11] time cost 19.87s, valid loss 4.99, valid ppl 147.12
INFO:root:test loss 4.95, test ppl 141.37
INFO:root:[Epoch 12] time cost 19.84s, valid loss 4.98, valid ppl 146.02
INFO:root:test loss 4.94, test ppl 140.14
INFO:root:[Epoch 13] time cost 19.86s, valid loss 4.94, valid ppl 139.35
INFO:root:test loss 4.89, test ppl 133.62
INFO:root:[Epoch 14] time cost 19.87s, valid loss 4.91, valid ppl 136.01
INFO:root:test loss 4.87, test ppl 130.37
INFO:root:[Epoch 15] time cost 19.83s, valid loss 4.90, valid ppl 133.98
INFO:root:test loss 4.86, test ppl 129.10
INFO:root:[Epoch 16] time cost 19.92s, valid loss 4.90, valid ppl 133.75
INFO:root:test loss 4.85, test ppl 127.97
INFO:root:[Epoch 17] time cost 19.92s, valid loss 4.87, valid ppl 130.34
INFO:root:test loss 4.83, test ppl 125.31
INFO:root:[Epoch 18] time cost 19.93s, valid loss 4.86, valid ppl 129.32
INFO:root:test loss 4.82, test ppl 123.99
INFO:root:[Epoch 19] time cost 19.89s, valid loss 4.84, valid ppl 126.92
INFO:root:test loss 4.81, test ppl 122.15
INFO:root:[Epoch 20] time cost 19.93s, valid loss 4.83, valid ppl 125.82
INFO:root:test loss 4.79, test ppl 120.79
INFO:root:[Epoch 21] time cost 19.91s, valid loss 4.84, valid ppl 126.45
INFO:root:[Epoch 22] time cost 20.03s, valid loss 4.81, valid ppl 122.33
INFO:root:test loss 4.76, test ppl 117.16
INFO:root:[Epoch 23] time cost 19.94s, valid loss 4.80, valid ppl 122.05
INFO:root:test loss 4.76, test ppl 116.82
INFO:root:[Epoch 24] time cost 19.94s, valid loss 4.80, valid ppl 121.71
INFO:root:test loss 4.76, test ppl 116.42
INFO:root:[Epoch 25] time cost 19.80s, valid loss 4.80, valid ppl 121.15
INFO:root:test loss 4.75, test ppl 115.90
INFO:root:[Epoch 26] time cost 19.74s, valid loss 4.80, valid ppl 121.27
INFO:root:[Epoch 27] time cost 19.98s, valid loss 4.80, valid ppl 121.00
INFO:root:test loss 4.75, test ppl 115.62
INFO:root:[Epoch 28] time cost 19.84s, valid loss 4.79, valid ppl 120.81
INFO:root:test loss 4.75, test ppl 115.50
INFO:root:[Epoch 29] time cost 19.95s, valid loss 4.79, valid ppl 120.73
INFO:root:test loss 4.75, test ppl 115.44
INFO:root:[Epoch 30] time cost 19.87s, valid loss 4.79, valid ppl 120.61
INFO:root:test loss 4.75, test ppl 115.37
INFO:root:[Epoch 31] time cost 19.89s, valid loss 4.79, valid ppl 120.48
INFO:root:test loss 4.75, test ppl 115.22
INFO:root:[Epoch 32] time cost 19.90s, valid loss 4.79, valid ppl 120.31
INFO:root:test loss 4.75, test ppl 115.10
INFO:root:[Epoch 33] time cost 19.84s, valid loss 4.79, valid ppl 120.31
INFO:root:test loss 4.75, test ppl 115.09
INFO:root:[Epoch 34] time cost 19.87s, valid loss 4.79, valid ppl 120.30
INFO:root:test loss 4.75, test ppl 115.05
INFO:root:[Epoch 35] time cost 19.93s, valid loss 4.79, valid ppl 120.22
INFO:root:test loss 4.74, test ppl 114.98
INFO:root:[Epoch 36] time cost 19.86s, valid loss 4.79, valid ppl 120.15
INFO:root:test loss 4.74, test ppl 114.92
INFO:root:[Epoch 37] time cost 19.84s, valid loss 4.79, valid ppl 120.06
INFO:root:test loss 4.74, test ppl 114.85
INFO:root:[Epoch 38] time cost 19.89s, valid loss 4.79, valid ppl 120.01
INFO:root:test loss 4.74, test ppl 114.81
INFO:root:[Epoch 39] time cost 19.73s, valid loss 4.79, valid ppl 119.95
INFO:root:test loss 4.74, test ppl 114.75
INFO:root:Best test loss 4.74, test ppl 114.75
ubuntu@ip-162-32-28-44:~/deeplearning-benchmark$ export MXNET_SAFE_ACCUMULATION=0
ubuntu@ip-162-32-28-44:~/deeplearning-benchmark$ python word_language_model/word_language_model.py --gpus 8 --nhid 650 --emsize 650 --dropout 0.5 --epochs 40 --data word_language_model/data/ptb. --mode imperative --kvstore device
INFO:root:[Epoch 0] time cost 22.86s, valid loss 6.56, valid ppl 703.38
INFO:root:test loss 6.53, test ppl 683.61
INFO:root:[Epoch 1] time cost 19.97s, valid loss 6.16, valid ppl 473.41
INFO:root:test loss 6.13, test ppl 459.76
INFO:root:[Epoch 2] time cost 19.79s, valid loss 5.75, valid ppl 313.67
INFO:root:test loss 5.72, test ppl 304.20
INFO:root:[Epoch 3] time cost 19.73s, valid loss 5.57, valid ppl 262.83
INFO:root:test loss 5.54, test ppl 254.41
INFO:root:[Epoch 4] time cost 19.75s, valid loss 5.46, valid ppl 234.12
INFO:root:test loss 5.43, test ppl 228.92
INFO:root:[Epoch 5] time cost 19.75s, valid loss 5.31, valid ppl 202.62
INFO:root:test loss 5.28, test ppl 195.97
INFO:root:[Epoch 6] time cost 19.78s, valid loss 5.24, valid ppl 187.75
INFO:root:test loss 5.21, test ppl 182.98
INFO:root:[Epoch 7] time cost 19.77s, valid loss 5.16, valid ppl 174.70
INFO:root:test loss 5.13, test ppl 169.26
INFO:root:[Epoch 8] time cost 19.75s, valid loss 5.11, valid ppl 166.22
INFO:root:test loss 5.08, test ppl 161.02
INFO:root:[Epoch 9] time cost 19.68s, valid loss 5.07, valid ppl 158.64
INFO:root:test loss 5.03, test ppl 153.36
INFO:root:[Epoch 10] time cost 19.73s, valid loss 5.03, valid ppl 153.05
INFO:root:test loss 5.00, test ppl 147.75
INFO:root:[Epoch 11] time cost 19.78s, valid loss 4.99, valid ppl 147.12
INFO:root:test loss 4.95, test ppl 141.37
INFO:root:[Epoch 12] time cost 19.72s, valid loss 4.98, valid ppl 146.02
INFO:root:test loss 4.94, test ppl 140.14
INFO:root:[Epoch 13] time cost 19.76s, valid loss 4.94, valid ppl 139.35
INFO:root:test loss 4.89, test ppl 133.62
INFO:root:[Epoch 14] time cost 19.74s, valid loss 4.91, valid ppl 136.01
INFO:root:test loss 4.87, test ppl 130.37
INFO:root:[Epoch 15] time cost 19.76s, valid loss 4.90, valid ppl 133.98
INFO:root:test loss 4.86, test ppl 129.10
INFO:root:[Epoch 16] time cost 19.76s, valid loss 4.90, valid ppl 133.75
INFO:root:test loss 4.85, test ppl 127.97
INFO:root:[Epoch 17] time cost 19.73s, valid loss 4.87, valid ppl 130.34
INFO:root:test loss 4.83, test ppl 125.31
INFO:root:[Epoch 18] time cost 19.78s, valid loss 4.86, valid ppl 129.32
INFO:root:test loss 4.82, test ppl 123.99
INFO:root:[Epoch 19] time cost 19.76s, valid loss 4.84, valid ppl 126.92
INFO:root:test loss 4.81, test ppl 122.15
INFO:root:[Epoch 20] time cost 19.79s, valid loss 4.83, valid ppl 125.82
INFO:root:test loss 4.79, test ppl 120.79
INFO:root:[Epoch 21] time cost 19.76s, valid loss 4.84, valid ppl 126.45
INFO:root:[Epoch 22] time cost 19.82s, valid loss 4.81, valid ppl 122.33
INFO:root:test loss 4.76, test ppl 117.16
INFO:root:[Epoch 23] time cost 19.78s, valid loss 4.80, valid ppl 122.05
INFO:root:test loss 4.76, test ppl 116.82
INFO:root:[Epoch 24] time cost 19.83s, valid loss 4.80, valid ppl 121.71
INFO:root:test loss 4.76, test ppl 116.42
INFO:root:[Epoch 25] time cost 19.87s, valid loss 4.80, valid ppl 121.15
INFO:root:test loss 4.75, test ppl 115.90
INFO:root:[Epoch 26] time cost 19.71s, valid loss 4.80, valid ppl 121.27
INFO:root:[Epoch 27] time cost 19.91s, valid loss 4.80, valid ppl 121.00
INFO:root:test loss 4.75, test ppl 115.62
INFO:root:[Epoch 28] time cost 19.89s, valid loss 4.79, valid ppl 120.81
INFO:root:test loss 4.75, test ppl 115.50
INFO:root:[Epoch 29] time cost 19.80s, valid loss 4.79, valid ppl 120.73
INFO:root:test loss 4.75, test ppl 115.44
INFO:root:[Epoch 30] time cost 19.70s, valid loss 4.79, valid ppl 120.61
INFO:root:test loss 4.75, test ppl 115.37
INFO:root:[Epoch 31] time cost 19.78s, valid loss 4.79, valid ppl 120.48
INFO:root:test loss 4.75, test ppl 115.22
INFO:root:[Epoch 32] time cost 19.69s, valid loss 4.79, valid ppl 120.31
INFO:root:test loss 4.75, test ppl 115.10
INFO:root:[Epoch 33] time cost 19.78s, valid loss 4.79, valid ppl 120.31
INFO:root:test loss 4.75, test ppl 115.09
INFO:root:[Epoch 34] time cost 19.88s, valid loss 4.79, valid ppl 120.30
INFO:root:test loss 4.75, test ppl 115.05
INFO:root:[Epoch 35] time cost 19.70s, valid loss 4.79, valid ppl 120.22
INFO:root:test loss 4.74, test ppl 114.98
INFO:root:[Epoch 36] time cost 19.70s, valid loss 4.79, valid ppl 120.15
INFO:root:test loss 4.74, test ppl 114.92
INFO:root:[Epoch 37] time cost 19.86s, valid loss 4.79, valid ppl 120.06
INFO:root:test loss 4.74, test ppl 114.85
INFO:root:[Epoch 38] time cost 19.65s, valid loss 4.79, valid ppl 120.01
INFO:root:test loss 4.74, test ppl 114.81
INFO:root:[Epoch 39] time cost 19.75s, valid loss 4.79, valid ppl 119.95
INFO:root:test loss 4.74, test ppl 114.75
INFO:root:Best test loss 4.74, test ppl 114.75
ubuntu@ip-162-32-28-44:~/deeplearning-benchmark$ python
Python 3.6.7 (default, Oct 22 2018, 11:32:17) 
[GCC 8.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import mxnet as mx
>>> mx
<module 'mxnet' from '/home/ubuntu/3-mxnet/python/mxnet/__init__.py'>
>>> quit()
ubuntu@ip-162-32-28-44:~/deeplearning-benchmark$ cd ..
ubuntu@ip-162-32-28-44:~$ cd 3-mxnet/
ubuntu@ip-162-32-28-44:~/3-mxnet$ git branch
  master
* safe_acc_envvar

I'm not seeing any significant difference between the results with this env var on or off (expected behavior in my opinion), and final results from both trials seem to be within a good range. @nswamy Would you please also try this on your machine to see if it's the case?
Back to what @anirudh2290 requested, I think that such an extra disclaimer is not needed due to:

The doc for the env var already serves its purpose(describing the effect of the env var) well.
There's no claim in the doc that some values of this env var would lead to more accurate models.
The claimed possible accuracy loss is not reproducible on my end.

Sorry that this is a long reply because there was too much info that I would want to include.
Hao

anirudh2290 · 2019-05-16T16:51:27Z

Adding new operator to your model is different from switching to safe accumulation on your exact model. Also, if you can point me to an env var doc which affects more than one operator and makes some claims about accuracy or performance that may be misleading I am happy to revisit it. The doc says this : "If this variable is set, the accumulation will enter the safe mode, meaning accumulation is done in a data type of higher precision than the input data type, leading to more accurate results with a possible performance loss and backward compatibility loss." Its natural for user to "wonder more accuracy results of what ? Which operators support safe accumulation? This is a global switch so I am assuming this should impact accuracy of many of our operators. So if I turn it on accuracy of my model may get better or worst case remain the same not worsen". This is not the case and I want us to clarify it. I don't think I am making any unreasonable ask here (Just add a simple note: "This switch may not necessarily improve accuracy of your overall model and in certain cases can make it worse"). By the way your switch currently doesn't control the softmax operator safe accumulation, which is mentioned in the issue opened by Naveen and probably that is why you are not able to reproduce the issue. I am sure in the future this switch will control softmax operator too.

haojin2 · 2019-05-16T19:23:20Z

@anirudh2290
Firstly, the "accurate" in the doc is describing the accumulation, NO claims where made about its influence on final accuracy/performance of any models built with MXNet with this env var turned on. Please make sure you do not factor in any assumed contexts that a random MXNet user may not have when reviewing this piece of doc.
Thanks for pointing out that the softmax is still always done in the safe mode at this moment, which led me to a revisit to the original issue. Here's a table that lists out all the final test losses got from running this training script:

Version	Test Loss	Safe Softmax?
1.5.0b20190220	4.79	No
1.5.0b20190221	5.18	Yes
1.5.0b20190313	4.90	Yes
This PR	4.74	Yes

We can see that:

The final test loss would have some variance even with the softmax's behavior unchanged.
The final test loss could vary to a higher or lower value when you change the behavior of softmax.
From the 2 facts above I think maybe the increase in loss observed in the original issue is just some random variance, which does not necessarily need a "fix".
I can definitely add this extra line or be more specific that the accurate is describing the accumulation, not any models built with MXNet if that's what you're looking for to get this PR merged ASAP for the 1.5 code freeze, but still, my original doc is not making any claims about its influence on the final accuracy of models in the first place so I would still think such disclaimers as nice-to-have but not really necessary.

haojin2 · 2019-05-16T20:57:53Z

Out of curiosity I did the following experiment on my p2.8xlarge:

Checkout the commit before the softmax PR, fresh build from source, then run the script.
Checkout the commit of the softmax PR, fresh build from source, then run the script.
Here're the logs
Built from source at the commit before the softmax PR:

ubuntu@ip-162-32-28-44:~/deeplearning-benchmark$ python word_language_model/word_language_model.py --gpus 8 --nhid 650 --emsize 650 --dropout 0.5 --epochs 40 --data word_language_model/data/ptb. --mode imperative --kvstore device
INFO:root:[Epoch 0] time cost 30.57s, valid loss 6.43, valid ppl 618.67
INFO:root:test loss 6.39, test ppl 596.96
INFO:root:[Epoch 1] time cost 28.15s, valid loss 6.05, valid ppl 424.32
INFO:root:test loss 6.03, test ppl 416.74
INFO:root:[Epoch 2] time cost 28.66s, valid loss 5.76, valid ppl 317.45
INFO:root:test loss 5.74, test ppl 310.12
INFO:root:[Epoch 3] time cost 28.61s, valid loss 5.60, valid ppl 270.85
INFO:root:test loss 5.58, test ppl 264.98
INFO:root:[Epoch 4] time cost 28.92s, valid loss 5.44, valid ppl 229.80
INFO:root:test loss 5.40, test ppl 221.53
INFO:root:[Epoch 5] time cost 28.94s, valid loss 5.33, valid ppl 207.46
INFO:root:test loss 5.31, test ppl 202.24
INFO:root:[Epoch 6] time cost 29.13s, valid loss 5.26, valid ppl 193.18
INFO:root:test loss 5.24, test ppl 188.84
INFO:root:[Epoch 7] time cost 28.76s, valid loss 5.19, valid ppl 178.78
INFO:root:test loss 5.16, test ppl 174.57
INFO:root:[Epoch 8] time cost 29.33s, valid loss 5.13, valid ppl 169.77
INFO:root:test loss 5.11, test ppl 165.58
INFO:root:[Epoch 9] time cost 28.92s, valid loss 5.09, valid ppl 162.16
INFO:root:test loss 5.06, test ppl 158.30
INFO:root:[Epoch 10] time cost 29.29s, valid loss 5.03, valid ppl 153.41
INFO:root:test loss 5.00, test ppl 147.82
INFO:root:[Epoch 11] time cost 29.02s, valid loss 5.01, valid ppl 149.68
INFO:root:test loss 4.97, test ppl 144.52
INFO:root:[Epoch 12] time cost 29.12s, valid loss 4.99, valid ppl 146.27
INFO:root:test loss 4.95, test ppl 141.74
INFO:root:[Epoch 13] time cost 29.10s, valid loss 4.95, valid ppl 141.57
INFO:root:test loss 4.92, test ppl 136.56
INFO:root:[Epoch 14] time cost 29.19s, valid loss 4.93, valid ppl 139.02
INFO:root:test loss 4.90, test ppl 134.21
INFO:root:[Epoch 15] time cost 29.02s, valid loss 4.92, valid ppl 137.63
INFO:root:test loss 4.89, test ppl 132.71
INFO:root:[Epoch 16] time cost 29.45s, valid loss 4.90, valid ppl 134.44
INFO:root:test loss 4.86, test ppl 128.75
INFO:root:[Epoch 17] time cost 28.85s, valid loss 4.87, valid ppl 130.48
INFO:root:test loss 4.83, test ppl 124.94
INFO:root:[Epoch 18] time cost 29.18s, valid loss 4.87, valid ppl 130.76
INFO:root:[Epoch 19] time cost 29.32s, valid loss 4.85, valid ppl 127.34
INFO:root:test loss 4.80, test ppl 121.90
INFO:root:[Epoch 20] time cost 29.29s, valid loss 4.84, valid ppl 126.82
INFO:root:test loss 4.80, test ppl 121.36
INFO:root:[Epoch 21] time cost 28.72s, valid loss 4.84, valid ppl 126.15
INFO:root:test loss 4.79, test ppl 120.70
INFO:root:[Epoch 22] time cost 29.30s, valid loss 4.83, valid ppl 125.70
INFO:root:test loss 4.79, test ppl 120.15
INFO:root:[Epoch 23] time cost 29.05s, valid loss 4.83, valid ppl 125.46
INFO:root:test loss 4.79, test ppl 119.92
INFO:root:[Epoch 24] time cost 29.18s, valid loss 4.83, valid ppl 124.62
INFO:root:test loss 4.78, test ppl 119.24
INFO:root:[Epoch 25] time cost 29.04s, valid loss 4.83, valid ppl 124.73
INFO:root:[Epoch 26] time cost 29.33s, valid loss 4.82, valid ppl 124.55
INFO:root:test loss 4.78, test ppl 118.98
INFO:root:[Epoch 27] time cost 28.93s, valid loss 4.82, valid ppl 124.35
INFO:root:test loss 4.78, test ppl 118.82
INFO:root:[Epoch 28] time cost 29.26s, valid loss 4.82, valid ppl 124.18
INFO:root:test loss 4.78, test ppl 118.66
INFO:root:[Epoch 29] time cost 28.90s, valid loss 4.82, valid ppl 124.09
INFO:root:test loss 4.78, test ppl 118.56
INFO:root:[Epoch 30] time cost 29.43s, valid loss 4.82, valid ppl 123.99
INFO:root:test loss 4.77, test ppl 118.47
INFO:root:[Epoch 31] time cost 28.97s, valid loss 4.82, valid ppl 123.91
INFO:root:test loss 4.77, test ppl 118.44
INFO:root:[Epoch 32] time cost 29.27s, valid loss 4.82, valid ppl 123.66
INFO:root:test loss 4.77, test ppl 118.21
INFO:root:[Epoch 33] time cost 28.87s, valid loss 4.82, valid ppl 123.63
INFO:root:test loss 4.77, test ppl 118.19
INFO:root:[Epoch 34] time cost 29.28s, valid loss 4.82, valid ppl 123.71
INFO:root:[Epoch 35] time cost 29.16s, valid loss 4.82, valid ppl 123.58
INFO:root:test loss 4.77, test ppl 118.09
INFO:root:[Epoch 36] time cost 29.39s, valid loss 4.82, valid ppl 123.54
INFO:root:test loss 4.77, test ppl 118.05
INFO:root:[Epoch 37] time cost 29.09s, valid loss 4.82, valid ppl 123.55
INFO:root:[Epoch 38] time cost 29.43s, valid loss 4.82, valid ppl 123.47
INFO:root:test loss 4.77, test ppl 117.99
INFO:root:[Epoch 39] time cost 29.04s, valid loss 4.82, valid ppl 123.44
INFO:root:test loss 4.77, test ppl 117.97
INFO:root:Best test loss 4.77, test ppl 117.97
ubuntu@ip-162-32-28-44:~/deeplearning-benchmark$ cd ..
ubuntu@ip-162-32-28-44:~$ cd incubator-mxnet/
ubuntu@ip-162-32-28-44:~/incubator-mxnet$ git status
HEAD detached at f9c436be2
nothing to commit, working tree clean
ubuntu@ip-162-32-28-44:~/incubator-mxnet$ python
Python 3.6.7 (default, Oct 22 2018, 11:32:17) 
[GCC 8.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import mxnet as mx
>>> mx
<module 'mxnet' from '/home/ubuntu/incubator-mxnet/python/mxnet/__init__.py'>
>>> quit()
ubuntu@ip-162-32-28-44:~/incubator-mxnet$ git show
commit f9c436be2689ac809aab5422b41f2fd768e8c4bc (HEAD)
Author: Przemyslaw Tredak <ptrendx@gmail.com>
Date:   Tue Feb 19 16:03:36 2019 -0800

    Fix req=null in SliceLikeBackward (#14209)

diff --git a/src/operator/tensor/matrix_op-inl.h b/src/operator/tensor/matrix_op-inl.h
index 97c4fa556..28ed4215e 100644
--- a/src/operator/tensor/matrix_op-inl.h
+++ b/src/operator/tensor/matrix_op-inl.h
@@ -1389,13 +1389,15 @@ void SliceLikeBackward(const nnvm::NodeAttrs& attrs,
   CHECK_EQ(inputs.size(), 1U);
   CHECK_EQ(outputs.size(), 2U);
   CHECK_EQ(req.size(), 2U);
-  if (req[0] == kNullOp) return;
   using namespace mshadow;
   Stream<xpu>* s = ctx.get_stream<xpu>();
+  if (req[1] != kNullOp && req[1] != kAddTo) {
+    Fill(s, outputs[1], req[1], 0);  // Second input not relavant to gradients.
+  }
+  if (req[0] == kNullOp) return;
   const TBlob& ograd = inputs[0];
   const TBlob& igrad = outputs[0];
   const SliceLikeParam& param = nnvm::get<SliceLikeParam>(attrs.parsed);
-  Fill(s, outputs[1], req[1], 0);  // Second input not relavant to gradients.
   if (req[0] == kWriteTo) {
     Fill(s, igrad, req[0], 0);
   } else if (req[0] == kWriteInplace) {

Built from source at the commit of the softmax PR:

ubuntu@ip-162-32-28-44:~/deeplearning-benchmark$ python word_language_model/word_language_model.py --gpus 8 --nhid 650 --emsize 650 --dropout 0.5 --epochs 40 --data word_language_model/data/ptb. --mode imperative --kvstore device
INFO:root:[Epoch 0] time cost 30.52s, valid loss 6.60, valid ppl 737.45
INFO:root:test loss 6.57, test ppl 714.20
INFO:root:[Epoch 1] time cost 28.03s, valid loss 6.13, valid ppl 461.04
INFO:root:test loss 6.10, test ppl 446.37
INFO:root:[Epoch 2] time cost 28.50s, valid loss 5.81, valid ppl 332.27
INFO:root:test loss 5.77, test ppl 320.43
INFO:root:[Epoch 3] time cost 28.28s, valid loss 5.58, valid ppl 264.90
INFO:root:test loss 5.54, test ppl 254.72
INFO:root:[Epoch 4] time cost 28.83s, valid loss 5.41, valid ppl 224.44
INFO:root:test loss 5.38, test ppl 217.38
INFO:root:[Epoch 5] time cost 28.54s, valid loss 5.34, valid ppl 208.28
INFO:root:test loss 5.30, test ppl 201.23
INFO:root:[Epoch 6] time cost 28.98s, valid loss 5.24, valid ppl 188.28
INFO:root:test loss 5.21, test ppl 182.98
INFO:root:[Epoch 7] time cost 28.58s, valid loss 5.18, valid ppl 177.91
INFO:root:test loss 5.15, test ppl 172.79
INFO:root:[Epoch 8] time cost 29.14s, valid loss 5.14, valid ppl 170.55
INFO:root:test loss 5.10, test ppl 164.79
INFO:root:[Epoch 9] time cost 28.61s, valid loss 5.08, valid ppl 160.70
INFO:root:test loss 5.05, test ppl 155.34
INFO:root:[Epoch 10] time cost 28.87s, valid loss 5.03, valid ppl 153.40
INFO:root:test loss 5.00, test ppl 148.69
INFO:root:[Epoch 11] time cost 28.71s, valid loss 5.00, valid ppl 149.02
INFO:root:test loss 4.97, test ppl 144.28
INFO:root:[Epoch 12] time cost 28.87s, valid loss 4.97, valid ppl 143.34
INFO:root:test loss 4.93, test ppl 137.89
INFO:root:[Epoch 13] time cost 28.80s, valid loss 4.95, valid ppl 140.72
INFO:root:test loss 4.91, test ppl 136.31
INFO:root:[Epoch 14] time cost 29.21s, valid loss 4.93, valid ppl 137.89
INFO:root:test loss 4.89, test ppl 132.43
INFO:root:[Epoch 15] time cost 28.84s, valid loss 4.91, valid ppl 135.74
INFO:root:test loss 4.87, test ppl 130.02
INFO:root:[Epoch 16] time cost 29.02s, valid loss 4.89, valid ppl 133.00
INFO:root:test loss 4.85, test ppl 127.47
INFO:root:[Epoch 17] time cost 28.74s, valid loss 4.87, valid ppl 130.48
INFO:root:test loss 4.83, test ppl 124.78
INFO:root:[Epoch 18] time cost 28.89s, valid loss 4.86, valid ppl 129.04
INFO:root:test loss 4.82, test ppl 124.39
INFO:root:[Epoch 19] time cost 28.88s, valid loss 4.85, valid ppl 127.15
INFO:root:test loss 4.81, test ppl 122.78
INFO:root:[Epoch 20] time cost 29.10s, valid loss 4.83, valid ppl 124.98
INFO:root:test loss 4.79, test ppl 120.01
INFO:root:[Epoch 21] time cost 28.94s, valid loss 4.83, valid ppl 125.68
INFO:root:[Epoch 22] time cost 29.27s, valid loss 4.81, valid ppl 122.71
INFO:root:test loss 4.76, test ppl 116.87
INFO:root:[Epoch 23] time cost 28.86s, valid loss 4.81, valid ppl 122.14
INFO:root:test loss 4.76, test ppl 116.37
INFO:root:[Epoch 24] time cost 29.21s, valid loss 4.80, valid ppl 121.99
INFO:root:test loss 4.76, test ppl 116.17
INFO:root:[Epoch 25] time cost 28.77s, valid loss 4.80, valid ppl 121.56
INFO:root:test loss 4.75, test ppl 115.79
INFO:root:[Epoch 26] time cost 29.35s, valid loss 4.80, valid ppl 121.45
INFO:root:test loss 4.75, test ppl 115.69
INFO:root:[Epoch 27] time cost 28.77s, valid loss 4.80, valid ppl 121.28
INFO:root:test loss 4.75, test ppl 115.41
INFO:root:[Epoch 28] time cost 29.18s, valid loss 4.79, valid ppl 120.86
INFO:root:test loss 4.75, test ppl 115.07
INFO:root:[Epoch 29] time cost 28.79s, valid loss 4.79, valid ppl 120.70
INFO:root:test loss 4.74, test ppl 114.90
INFO:root:[Epoch 30] time cost 29.17s, valid loss 4.79, valid ppl 120.60
INFO:root:test loss 4.74, test ppl 114.86
INFO:root:[Epoch 31] time cost 28.73s, valid loss 4.79, valid ppl 120.13
INFO:root:test loss 4.74, test ppl 114.50
INFO:root:[Epoch 32] time cost 29.13s, valid loss 4.79, valid ppl 119.99
INFO:root:test loss 4.74, test ppl 114.25
INFO:root:[Epoch 33] time cost 28.82s, valid loss 4.79, valid ppl 120.21
INFO:root:[Epoch 34] time cost 29.09s, valid loss 4.78, valid ppl 119.59
INFO:root:test loss 4.73, test ppl 113.82
INFO:root:[Epoch 35] time cost 28.73s, valid loss 4.78, valid ppl 119.63
INFO:root:[Epoch 36] time cost 29.12s, valid loss 4.78, valid ppl 119.59
INFO:root:[Epoch 37] time cost 28.74s, valid loss 4.78, valid ppl 119.56
INFO:root:test loss 4.73, test ppl 113.79
INFO:root:[Epoch 38] time cost 28.81s, valid loss 4.78, valid ppl 119.54
INFO:root:test loss 4.73, test ppl 113.76
INFO:root:[Epoch 39] time cost 28.74s, valid loss 4.78, valid ppl 119.51
INFO:root:test loss 4.73, test ppl 113.73
INFO:root:Best test loss 4.73, test ppl 113.73
ubuntu@ip-162-32-28-44:~/deeplearning-benchmark$ cd ../incubator-mxnet
ubuntu@ip-162-32-28-44:~/incubator-mxnet$ python
Python 3.6.7 (default, Oct 22 2018, 11:32:17) 
[GCC 8.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import mxnet as mx
>>> mx
<module 'mxnet' from '/home/ubuntu/incubator-mxnet/python/mxnet/__init__.py'>
>>> quit()
ubuntu@ip-162-32-28-44:~/incubator-mxnet$ git show
commit 862cbc67aacf81990b8c885847686a4c3c734cd3 (HEAD)
Author: Sheng Zha <szha@users.noreply.github.com>
Date:   Wed Feb 20 16:37:12 2019 -0800

    softmax for fp16 with fp32 accumulator (#14098)
    
    * softmax for fp16 with fp32 accumulator
    
    * return AType in kernel
    
    * add dtype
    
    * kernel
    
    * grad use in-out only when dtype override
    
    * simplify infer type
    
    * address comments

So here I'm actually seeing an INCREASE of final model performance after adding in the softmax PR with all other variables controlled... @nswamy Can you perform the exactly same experiment on your machine to see if this is the case? Your previous experiments were not controlling all other variables so the vision may be blurred to some extent...
@anirudh2290 @eric-haibin-lin FYI

haojin2 · 2019-05-16T21:08:41Z

@anirudh2290 The extra disclaimer that you asked for is now added.

anirudh2290 · 2019-05-16T22:05:39Z

@haojin2 Thanks for the change and trying to reproduce the original issue. I will let @nswamy comment on it since you guys seem to be seeing different things.

Your previous experiments were not controlling all other variables

Which variables in specific do you mean ?

haojin2 · 2019-05-16T22:57:51Z

@anirudh2290 @nswamy was using 2 different nightly builds, and that would introduce extra commits (more than 1 commits merged on 2/20) in addition to the softmax one. My experiments were purely testing the effect of only adding the softmax PR. I would consider mine as a more controlled experiment than the one in the issue.

haojin2 · 2019-05-16T22:58:22Z

@anirudh2290 BTW there're 2 approvals already, I'll merge the PR once the build passes.

…#14830) * add env var to control accumulation option * trial * try with ci change * recover all test_operator * add gpu coverage * fix order * address comments * extra disclaimer on model accuracies

haojin2 requested a review from eric-haibin-lin April 29, 2019 08:37

haojin2 self-assigned this Apr 29, 2019

marcoabreu added Backend Issues related to the backend of MXNet Operator pr-work-in-progress PR is still work in progress labels Apr 29, 2019

eric-haibin-lin reviewed Apr 30, 2019

View reviewed changes

src/operator/tensor/broadcast_reduce_op.h Outdated Show resolved Hide resolved

haojin2 force-pushed the safe_acc_envvar branch from 5383e18 to 125d646 Compare April 30, 2019 06:06

haojin2 requested a review from szha as a code owner April 30, 2019 06:06

anirudh2290 previously requested changes Apr 30, 2019

View reviewed changes

src/operator/tensor/broadcast_reduce_op.h Outdated Show resolved Hide resolved

haojin2 requested a review from anirudh2290 May 1, 2019 05:16

eric-haibin-lin reviewed May 7, 2019

View reviewed changes

haojin2 force-pushed the safe_acc_envvar branch from 125d646 to c19e247 Compare May 7, 2019 18:34

eric-haibin-lin reviewed May 7, 2019

View reviewed changes

tests/python/unittest/test_operator.py Outdated Show resolved Hide resolved

nswamy reviewed May 7, 2019

View reviewed changes

haojin2 force-pushed the safe_acc_envvar branch 2 times, most recently from e517a36 to 699db5f Compare May 9, 2019 18:57

haojin2 requested review from nswamy and eric-haibin-lin and removed request for nswamy May 9, 2019 18:58

haojin2 force-pushed the safe_acc_envvar branch 7 times, most recently from c0a7d6a to 8c9394c Compare May 15, 2019 06:38

eric-haibin-lin reviewed May 15, 2019

View reviewed changes

ci/windows/test_py2_cpu.ps1 Show resolved Hide resolved

src/operator/tensor/broadcast_reduce_op.h Outdated Show resolved Hide resolved

haojin2 force-pushed the safe_acc_envvar branch from 6dc77c3 to 9692355 Compare May 15, 2019 20:17

eric-haibin-lin approved these changes May 15, 2019

View reviewed changes

anirudh2290 approved these changes May 16, 2019

View reviewed changes

haojin2 force-pushed the safe_acc_envvar branch from 43b67fd to e6dd463 Compare May 16, 2019 23:40

haojin2 added 8 commits May 17, 2019 17:18

add env var to control accumulation option

18c595f

trial

1ebbc60

try with ci change

1cc84cb

recover all test_operator

ef7337e

add gpu coverage

307ab2b

fix order

c49ad79

address comments

dd86d71

extra disclaimer on model accuracies

f191ff9

haojin2 force-pushed the safe_acc_envvar branch from e6dd463 to f191ff9 Compare May 17, 2019 17:18

haojin2 merged commit 5aa62d8 into apache:master May 17, 2019

haojin2 deleted the safe_acc_envvar branch May 17, 2019 20:48

eric-haibin-lin mentioned this pull request May 22, 2019

upcasting accumulation type in norm increases loss/Perplexity #14760

Closed

sxjscience mentioned this pull request Oct 23, 2019

Fix index overflow bug in einsum #16589

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use env var to enforce safe accumulation in ReduceAxesCompute #14830

Use env var to enforce safe accumulation in ReduceAxesCompute #14830

haojin2 commented Apr 29, 2019 •

edited

Loading

roywei commented Apr 29, 2019

eric-haibin-lin left a comment

eric-haibin-lin May 7, 2019

nswamy May 7, 2019

nswamy May 7, 2019

eric-haibin-lin May 7, 2019 •

edited

Loading

haojin2 May 7, 2019

eric-haibin-lin May 7, 2019

anirudh2290 May 8, 2019

eric-haibin-lin May 8, 2019

anirudh2290 May 8, 2019

eric-haibin-lin May 9, 2019 •

edited

Loading

nswamy left a comment

nswamy May 7, 2019

nswamy May 7, 2019

nswamy commented May 7, 2019

eric-haibin-lin left a comment

anirudh2290 commented May 16, 2019

haojin2 commented May 16, 2019

anirudh2290 commented May 16, 2019

haojin2 commented May 16, 2019

anirudh2290 commented May 16, 2019 •

edited

Loading

haojin2 commented May 16, 2019

haojin2 commented May 16, 2019

haojin2 commented May 16, 2019

anirudh2290 commented May 16, 2019

haojin2 commented May 16, 2019

haojin2 commented May 16, 2019

Use env var to enforce safe accumulation in ReduceAxesCompute #14830

Use env var to enforce safe accumulation in ReduceAxesCompute #14830

Conversation

haojin2 commented Apr 29, 2019 • edited Loading

Description

Checklist

Essentials

Changes

Comments

roywei commented Apr 29, 2019

eric-haibin-lin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eric-haibin-lin May 7, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eric-haibin-lin May 9, 2019 • edited Loading

Choose a reason for hiding this comment

nswamy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nswamy commented May 7, 2019

eric-haibin-lin left a comment

Choose a reason for hiding this comment

anirudh2290 commented May 16, 2019

haojin2 commented May 16, 2019

anirudh2290 commented May 16, 2019

haojin2 commented May 16, 2019

anirudh2290 commented May 16, 2019 • edited Loading

haojin2 commented May 16, 2019

haojin2 commented May 16, 2019

haojin2 commented May 16, 2019

anirudh2290 commented May 16, 2019

haojin2 commented May 16, 2019

haojin2 commented May 16, 2019

haojin2 commented Apr 29, 2019 •

edited

Loading

eric-haibin-lin May 7, 2019 •

edited

Loading

eric-haibin-lin May 9, 2019 •

edited

Loading

anirudh2290 commented May 16, 2019 •

edited

Loading