Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Use env var to enforce safe accumulation in ReduceAxesCompute #14830

Merged
merged 8 commits into from
May 17, 2019

Conversation

haojin2
Copy link
Contributor

@haojin2 haojin2 commented Apr 29, 2019

Description

As title.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • New environment variable to control accumulation behavior.
  • Corresponding test coverage
  • Corresponding docs.

Comments

@eric-haibin-lin
Flakiness checked on GPU:

MXNET_TEST_COUNT=10000 nosetests tests/python/gpu/test_operator_gpu.py:test_norm
[INFO] Setting module np/mx/python random seeds, use MXNET_MODULE_SEED=292245351 to reproduce.
.
----------------------------------------------------------------------
Ran 1 test in 256.972s

OK

@haojin2 haojin2 self-assigned this Apr 29, 2019
@roywei
Copy link
Member

roywei commented Apr 29, 2019

@mxnet-label-bot add[pr-work-in-progress, Operator, Backend]

@marcoabreu marcoabreu added Backend Issues related to the backend of MXNet Operator pr-work-in-progress PR is still work in progress labels Apr 29, 2019
src/operator/tensor/broadcast_reduce_op.h Outdated Show resolved Hide resolved
@haojin2 haojin2 requested a review from anirudh2290 May 1, 2019 05:16
@anirudh2290 anirudh2290 dismissed their stale review May 7, 2019 18:23

addressed

Copy link
Member

@eric-haibin-lin eric-haibin-lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some minor comments. otherwise LGTM. Thanks!

tests/python/unittest/test_operator.py Outdated Show resolved Hide resolved
* MXNET_ENFORCE_SAFE_ACCUMULATION
- Values: Values: 0(false) or 1(true) ```(default=0)```
- If this variable is set, the accumulation will enter the safe mode, meaning accumulation is done in a data type of higher precision than
the input data type, leading to more accurate results with a possible performance loss and backward compatibility loss.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we mention the list of operator supporting this env_var?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also can you mention what casting happens ie., float16->float32, etc.,

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I am concerned about float32-> float64 casting as described in #14722.
In our offline discussion @haojin2 and @eric-haibin-lin seem to allude that somehow now with float64 casting for accumulation it is exposing a higher loss which is the correct value and the earlier model which yields lower loss is actually wrong. I am not sure I understand this despite our long conversation (sorry!).
Also I found another old PR which casted the Accumulation variable to higher type, however [comment] (dmlc/mshadow#236 (comment)) seem to suggest float32->float32 is sufficient(I checked with @cjolivier01 for the reasoning he does not remember unfortunately).
My suggestion is to retain float32->float32 unless you have a convincing argument.

Copy link
Member

@eric-haibin-lin eric-haibin-lin May 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nswamy I don't see a concrete reason why you want fp32 accumulation type for fp32 inputs if user set MXNET_ENFORCE_SAFE_ACCUMULATION=1. The only reason I see from the mshadow PR is to keep backward compatibility.
If you use fp32 accumulation type for fp32 inputs, it is NOT safe accumulation.

One simple test case where fp32 accumulation is not sufficient for safe accumulation:

# fp32 accumulation
>>> import numpy as np
>>> val1 = np.float32(3e38)
>>> val2 = np.float32(-3e38)
>>> sum = np.float32(3e38)
>>> sum + val1 + val2
inf
>>> sum64 = np.float64(3e38)
>>> sum64 + val1 + val2
3e+38
>>> np.float32(sum64 + val1 + val2)
3e+38

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nswamy Although I agree that using fp32 for fp32 accumulation may be sufficient under many cases, but:

  1. As @eric-haibin-lin said you cannot claim that such accumulation is safe.
  2. Usually a user is not 100% sure that his/her data will not cause overflow during any accumulation, so we want to provide an option if users want to be safe even he/she is using fp32 data type.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anirudh2290 there's no guarantee that being more precise gives you a model with better accuracy / lower perplexity.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that there is no guarantee of getting better accuracy, but can it be worse and what are reasons for the same ? This is what seems to be happening for @nswamy 's example.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah it's possible that the model gets worse. The accuracy could go either way

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case we need to change the env variable documentation: "leading to more accurate results with a possible performance loss and backward compatibility loss."

Copy link
Member

@eric-haibin-lin eric-haibin-lin May 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does backward compatibility loss mean? backward incompatibility?

Copy link
Member

@nswamy nswamy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the change, I have a few comments.

docs/faq/env_var.md Outdated Show resolved Hide resolved
* MXNET_ENFORCE_SAFE_ACCUMULATION
- Values: Values: 0(false) or 1(true) ```(default=0)```
- If this variable is set, the accumulation will enter the safe mode, meaning accumulation is done in a data type of higher precision than
the input data type, leading to more accurate results with a possible performance loss and backward compatibility loss.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also can you mention what casting happens ie., float16->float32, etc.,

* MXNET_ENFORCE_SAFE_ACCUMULATION
- Values: Values: 0(false) or 1(true) ```(default=0)```
- If this variable is set, the accumulation will enter the safe mode, meaning accumulation is done in a data type of higher precision than
the input data type, leading to more accurate results with a possible performance loss and backward compatibility loss.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I am concerned about float32-> float64 casting as described in #14722.
In our offline discussion @haojin2 and @eric-haibin-lin seem to allude that somehow now with float64 casting for accumulation it is exposing a higher loss which is the correct value and the earlier model which yields lower loss is actually wrong. I am not sure I understand this despite our long conversation (sorry!).
Also I found another old PR which casted the Accumulation variable to higher type, however [comment] (dmlc/mshadow#236 (comment)) seem to suggest float32->float32 is sufficient(I checked with @cjolivier01 for the reasoning he does not remember unfortunately).
My suggestion is to retain float32->float32 unless you have a convincing argument.

@nswamy
Copy link
Member

nswamy commented May 7, 2019

#14760

@haojin2 haojin2 force-pushed the safe_acc_envvar branch 2 times, most recently from e517a36 to 699db5f Compare May 9, 2019 18:57
@haojin2 haojin2 requested review from nswamy and eric-haibin-lin and removed request for nswamy May 9, 2019 18:58
@haojin2 haojin2 force-pushed the safe_acc_envvar branch 7 times, most recently from c0a7d6a to 8c9394c Compare May 15, 2019 06:38
Copy link
Member

@eric-haibin-lin eric-haibin-lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. @anirudh2290 @nswamy any other concerns? We need the fix before 1.5 code freeze

@anirudh2290
Copy link
Member

Has the doc been fixed. There is no guarantee for aaccuraccy improvement with accumulation. The env variable doc says turn it on for better accuracy.

@haojin2
Copy link
Contributor Author

haojin2 commented May 16, 2019

@anirudh2290 The doc is only saying that the ACCUMULATIONS themselves will be more accurate, it's making no claims about accuracies for models.

@anirudh2290
Copy link
Member

Since this is a global variable affecting many operators it can be easily interpreted as accumulation amongst all operators increasing accuracy of the model. At least add a note that turning this env variable on may not necessarily improve the accuracy of the model

@haojin2
Copy link
Contributor Author

haojin2 commented May 16, 2019

@anirudh2290 Fundamentally nothing would guarantee any benefit for all models. One would design a model or configure the system in a way that he/she believes would benefit his/her goal most, and we, as a lower-level library, are simply presenting the choices for them (with correct implementation and abundant doc) rather than guaranteeing anything for them.
I think this piece of doc is serving its sole purpose: describing what will happen to accumulations if it's turned on, i.e. the accumulations will be done in a higher-precision type. If we add such a "disclaimer" to this very env var, do we then also do that for all other features/operators/env vars? For example, is adding "Using a BatchNorm layer in your model may not necessarily improve the accuracy of it" in BatchNorm's doc really giving extra information to the users?
I do understand where you're coming from about the influence of this env var may impose on the final accuracy of a model, so regarding the issue #14722 that actually triggered this fix here, I actually tried the same benchmark script on my end on a p2.8xlarge instance and tested on the branch of this very PR (Also there's a bug in the script that prevents it from running with py3, and my fix is here: awslabs/deeplearning-benchmark#70):

ubuntu@ip-162-32-28-44:~/deeplearning-benchmark$ export MXNET_SAFE_ACCUMULATION=1
ubuntu@ip-162-32-28-44:~/deeplearning-benchmark$ python word_language_model/word_language_model.py --gpus 8 --nhid 650 --emsize 650 --dropout 0.5 --epochs 40 --data word_language_model/data/ptb. --mode imperative --kvstore device
INFO:root:[Epoch 0] time cost 23.01s, valid loss 6.56, valid ppl 703.38
INFO:root:test loss 6.53, test ppl 683.61
INFO:root:[Epoch 1] time cost 19.86s, valid loss 6.16, valid ppl 473.41
INFO:root:test loss 6.13, test ppl 459.76
INFO:root:[Epoch 2] time cost 20.12s, valid loss 5.75, valid ppl 313.67
INFO:root:test loss 5.72, test ppl 304.20
INFO:root:[Epoch 3] time cost 19.85s, valid loss 5.57, valid ppl 262.83
INFO:root:test loss 5.54, test ppl 254.41
INFO:root:[Epoch 4] time cost 19.90s, valid loss 5.46, valid ppl 234.12
INFO:root:test loss 5.43, test ppl 228.92
INFO:root:[Epoch 5] time cost 19.96s, valid loss 5.31, valid ppl 202.62
INFO:root:test loss 5.28, test ppl 195.97
INFO:root:[Epoch 6] time cost 19.80s, valid loss 5.24, valid ppl 187.75
INFO:root:test loss 5.21, test ppl 182.98
INFO:root:[Epoch 7] time cost 19.93s, valid loss 5.16, valid ppl 174.70
INFO:root:test loss 5.13, test ppl 169.26
INFO:root:[Epoch 8] time cost 19.92s, valid loss 5.11, valid ppl 166.22
INFO:root:test loss 5.08, test ppl 161.02
INFO:root:[Epoch 9] time cost 19.88s, valid loss 5.07, valid ppl 158.64
INFO:root:test loss 5.03, test ppl 153.36
INFO:root:[Epoch 10] time cost 19.93s, valid loss 5.03, valid ppl 153.05
INFO:root:test loss 5.00, test ppl 147.75
INFO:root:[Epoch 11] time cost 19.87s, valid loss 4.99, valid ppl 147.12
INFO:root:test loss 4.95, test ppl 141.37
INFO:root:[Epoch 12] time cost 19.84s, valid loss 4.98, valid ppl 146.02
INFO:root:test loss 4.94, test ppl 140.14
INFO:root:[Epoch 13] time cost 19.86s, valid loss 4.94, valid ppl 139.35
INFO:root:test loss 4.89, test ppl 133.62
INFO:root:[Epoch 14] time cost 19.87s, valid loss 4.91, valid ppl 136.01
INFO:root:test loss 4.87, test ppl 130.37
INFO:root:[Epoch 15] time cost 19.83s, valid loss 4.90, valid ppl 133.98
INFO:root:test loss 4.86, test ppl 129.10
INFO:root:[Epoch 16] time cost 19.92s, valid loss 4.90, valid ppl 133.75
INFO:root:test loss 4.85, test ppl 127.97
INFO:root:[Epoch 17] time cost 19.92s, valid loss 4.87, valid ppl 130.34
INFO:root:test loss 4.83, test ppl 125.31
INFO:root:[Epoch 18] time cost 19.93s, valid loss 4.86, valid ppl 129.32
INFO:root:test loss 4.82, test ppl 123.99
INFO:root:[Epoch 19] time cost 19.89s, valid loss 4.84, valid ppl 126.92
INFO:root:test loss 4.81, test ppl 122.15
INFO:root:[Epoch 20] time cost 19.93s, valid loss 4.83, valid ppl 125.82
INFO:root:test loss 4.79, test ppl 120.79
INFO:root:[Epoch 21] time cost 19.91s, valid loss 4.84, valid ppl 126.45
INFO:root:[Epoch 22] time cost 20.03s, valid loss 4.81, valid ppl 122.33
INFO:root:test loss 4.76, test ppl 117.16
INFO:root:[Epoch 23] time cost 19.94s, valid loss 4.80, valid ppl 122.05
INFO:root:test loss 4.76, test ppl 116.82
INFO:root:[Epoch 24] time cost 19.94s, valid loss 4.80, valid ppl 121.71
INFO:root:test loss 4.76, test ppl 116.42
INFO:root:[Epoch 25] time cost 19.80s, valid loss 4.80, valid ppl 121.15
INFO:root:test loss 4.75, test ppl 115.90
INFO:root:[Epoch 26] time cost 19.74s, valid loss 4.80, valid ppl 121.27
INFO:root:[Epoch 27] time cost 19.98s, valid loss 4.80, valid ppl 121.00
INFO:root:test loss 4.75, test ppl 115.62
INFO:root:[Epoch 28] time cost 19.84s, valid loss 4.79, valid ppl 120.81
INFO:root:test loss 4.75, test ppl 115.50
INFO:root:[Epoch 29] time cost 19.95s, valid loss 4.79, valid ppl 120.73
INFO:root:test loss 4.75, test ppl 115.44
INFO:root:[Epoch 30] time cost 19.87s, valid loss 4.79, valid ppl 120.61
INFO:root:test loss 4.75, test ppl 115.37
INFO:root:[Epoch 31] time cost 19.89s, valid loss 4.79, valid ppl 120.48
INFO:root:test loss 4.75, test ppl 115.22
INFO:root:[Epoch 32] time cost 19.90s, valid loss 4.79, valid ppl 120.31
INFO:root:test loss 4.75, test ppl 115.10
INFO:root:[Epoch 33] time cost 19.84s, valid loss 4.79, valid ppl 120.31
INFO:root:test loss 4.75, test ppl 115.09
INFO:root:[Epoch 34] time cost 19.87s, valid loss 4.79, valid ppl 120.30
INFO:root:test loss 4.75, test ppl 115.05
INFO:root:[Epoch 35] time cost 19.93s, valid loss 4.79, valid ppl 120.22
INFO:root:test loss 4.74, test ppl 114.98
INFO:root:[Epoch 36] time cost 19.86s, valid loss 4.79, valid ppl 120.15
INFO:root:test loss 4.74, test ppl 114.92
INFO:root:[Epoch 37] time cost 19.84s, valid loss 4.79, valid ppl 120.06
INFO:root:test loss 4.74, test ppl 114.85
INFO:root:[Epoch 38] time cost 19.89s, valid loss 4.79, valid ppl 120.01
INFO:root:test loss 4.74, test ppl 114.81
INFO:root:[Epoch 39] time cost 19.73s, valid loss 4.79, valid ppl 119.95
INFO:root:test loss 4.74, test ppl 114.75
INFO:root:Best test loss 4.74, test ppl 114.75
ubuntu@ip-162-32-28-44:~/deeplearning-benchmark$ export MXNET_SAFE_ACCUMULATION=0
ubuntu@ip-162-32-28-44:~/deeplearning-benchmark$ python word_language_model/word_language_model.py --gpus 8 --nhid 650 --emsize 650 --dropout 0.5 --epochs 40 --data word_language_model/data/ptb. --mode imperative --kvstore device
INFO:root:[Epoch 0] time cost 22.86s, valid loss 6.56, valid ppl 703.38
INFO:root:test loss 6.53, test ppl 683.61
INFO:root:[Epoch 1] time cost 19.97s, valid loss 6.16, valid ppl 473.41
INFO:root:test loss 6.13, test ppl 459.76
INFO:root:[Epoch 2] time cost 19.79s, valid loss 5.75, valid ppl 313.67
INFO:root:test loss 5.72, test ppl 304.20
INFO:root:[Epoch 3] time cost 19.73s, valid loss 5.57, valid ppl 262.83
INFO:root:test loss 5.54, test ppl 254.41
INFO:root:[Epoch 4] time cost 19.75s, valid loss 5.46, valid ppl 234.12
INFO:root:test loss 5.43, test ppl 228.92
INFO:root:[Epoch 5] time cost 19.75s, valid loss 5.31, valid ppl 202.62
INFO:root:test loss 5.28, test ppl 195.97
INFO:root:[Epoch 6] time cost 19.78s, valid loss 5.24, valid ppl 187.75
INFO:root:test loss 5.21, test ppl 182.98
INFO:root:[Epoch 7] time cost 19.77s, valid loss 5.16, valid ppl 174.70
INFO:root:test loss 5.13, test ppl 169.26
INFO:root:[Epoch 8] time cost 19.75s, valid loss 5.11, valid ppl 166.22
INFO:root:test loss 5.08, test ppl 161.02
INFO:root:[Epoch 9] time cost 19.68s, valid loss 5.07, valid ppl 158.64
INFO:root:test loss 5.03, test ppl 153.36
INFO:root:[Epoch 10] time cost 19.73s, valid loss 5.03, valid ppl 153.05
INFO:root:test loss 5.00, test ppl 147.75
INFO:root:[Epoch 11] time cost 19.78s, valid loss 4.99, valid ppl 147.12
INFO:root:test loss 4.95, test ppl 141.37
INFO:root:[Epoch 12] time cost 19.72s, valid loss 4.98, valid ppl 146.02
INFO:root:test loss 4.94, test ppl 140.14
INFO:root:[Epoch 13] time cost 19.76s, valid loss 4.94, valid ppl 139.35
INFO:root:test loss 4.89, test ppl 133.62
INFO:root:[Epoch 14] time cost 19.74s, valid loss 4.91, valid ppl 136.01
INFO:root:test loss 4.87, test ppl 130.37
INFO:root:[Epoch 15] time cost 19.76s, valid loss 4.90, valid ppl 133.98
INFO:root:test loss 4.86, test ppl 129.10
INFO:root:[Epoch 16] time cost 19.76s, valid loss 4.90, valid ppl 133.75
INFO:root:test loss 4.85, test ppl 127.97
INFO:root:[Epoch 17] time cost 19.73s, valid loss 4.87, valid ppl 130.34
INFO:root:test loss 4.83, test ppl 125.31
INFO:root:[Epoch 18] time cost 19.78s, valid loss 4.86, valid ppl 129.32
INFO:root:test loss 4.82, test ppl 123.99
INFO:root:[Epoch 19] time cost 19.76s, valid loss 4.84, valid ppl 126.92
INFO:root:test loss 4.81, test ppl 122.15
INFO:root:[Epoch 20] time cost 19.79s, valid loss 4.83, valid ppl 125.82
INFO:root:test loss 4.79, test ppl 120.79
INFO:root:[Epoch 21] time cost 19.76s, valid loss 4.84, valid ppl 126.45
INFO:root:[Epoch 22] time cost 19.82s, valid loss 4.81, valid ppl 122.33
INFO:root:test loss 4.76, test ppl 117.16
INFO:root:[Epoch 23] time cost 19.78s, valid loss 4.80, valid ppl 122.05
INFO:root:test loss 4.76, test ppl 116.82
INFO:root:[Epoch 24] time cost 19.83s, valid loss 4.80, valid ppl 121.71
INFO:root:test loss 4.76, test ppl 116.42
INFO:root:[Epoch 25] time cost 19.87s, valid loss 4.80, valid ppl 121.15
INFO:root:test loss 4.75, test ppl 115.90
INFO:root:[Epoch 26] time cost 19.71s, valid loss 4.80, valid ppl 121.27
INFO:root:[Epoch 27] time cost 19.91s, valid loss 4.80, valid ppl 121.00
INFO:root:test loss 4.75, test ppl 115.62
INFO:root:[Epoch 28] time cost 19.89s, valid loss 4.79, valid ppl 120.81
INFO:root:test loss 4.75, test ppl 115.50
INFO:root:[Epoch 29] time cost 19.80s, valid loss 4.79, valid ppl 120.73
INFO:root:test loss 4.75, test ppl 115.44
INFO:root:[Epoch 30] time cost 19.70s, valid loss 4.79, valid ppl 120.61
INFO:root:test loss 4.75, test ppl 115.37
INFO:root:[Epoch 31] time cost 19.78s, valid loss 4.79, valid ppl 120.48
INFO:root:test loss 4.75, test ppl 115.22
INFO:root:[Epoch 32] time cost 19.69s, valid loss 4.79, valid ppl 120.31
INFO:root:test loss 4.75, test ppl 115.10
INFO:root:[Epoch 33] time cost 19.78s, valid loss 4.79, valid ppl 120.31
INFO:root:test loss 4.75, test ppl 115.09
INFO:root:[Epoch 34] time cost 19.88s, valid loss 4.79, valid ppl 120.30
INFO:root:test loss 4.75, test ppl 115.05
INFO:root:[Epoch 35] time cost 19.70s, valid loss 4.79, valid ppl 120.22
INFO:root:test loss 4.74, test ppl 114.98
INFO:root:[Epoch 36] time cost 19.70s, valid loss 4.79, valid ppl 120.15
INFO:root:test loss 4.74, test ppl 114.92
INFO:root:[Epoch 37] time cost 19.86s, valid loss 4.79, valid ppl 120.06
INFO:root:test loss 4.74, test ppl 114.85
INFO:root:[Epoch 38] time cost 19.65s, valid loss 4.79, valid ppl 120.01
INFO:root:test loss 4.74, test ppl 114.81
INFO:root:[Epoch 39] time cost 19.75s, valid loss 4.79, valid ppl 119.95
INFO:root:test loss 4.74, test ppl 114.75
INFO:root:Best test loss 4.74, test ppl 114.75
ubuntu@ip-162-32-28-44:~/deeplearning-benchmark$ python
Python 3.6.7 (default, Oct 22 2018, 11:32:17) 
[GCC 8.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import mxnet as mx
>>> mx
<module 'mxnet' from '/home/ubuntu/3-mxnet/python/mxnet/__init__.py'>
>>> quit()
ubuntu@ip-162-32-28-44:~/deeplearning-benchmark$ cd ..
ubuntu@ip-162-32-28-44:~$ cd 3-mxnet/
ubuntu@ip-162-32-28-44:~/3-mxnet$ git branch
  master
* safe_acc_envvar

I'm not seeing any significant difference between the results with this env var on or off (expected behavior in my opinion), and final results from both trials seem to be within a good range. @nswamy Would you please also try this on your machine to see if it's the case?
Back to what @anirudh2290 requested, I think that such an extra disclaimer is not needed due to:

  1. The doc for the env var already serves its purpose(describing the effect of the env var) well.
  2. There's no claim in the doc that some values of this env var would lead to more accurate models.
  3. The claimed possible accuracy loss is not reproducible on my end.

Sorry that this is a long reply because there was too much info that I would want to include.
Hao

@anirudh2290
Copy link
Member

anirudh2290 commented May 16, 2019

Adding new operator to your model is different from switching to safe accumulation on your exact model. Also, if you can point me to an env var doc which affects more than one operator and makes some claims about accuracy or performance that may be misleading I am happy to revisit it. The doc says this : "If this variable is set, the accumulation will enter the safe mode, meaning accumulation is done in a data type of higher precision than the input data type, leading to more accurate results with a possible performance loss and backward compatibility loss." Its natural for user to "wonder more accuracy results of what ? Which operators support safe accumulation? This is a global switch so I am assuming this should impact accuracy of many of our operators. So if I turn it on accuracy of my model may get better or worst case remain the same not worsen". This is not the case and I want us to clarify it. I don't think I am making any unreasonable ask here (Just add a simple note: "This switch may not necessarily improve accuracy of your overall model and in certain cases can make it worse"). By the way your switch currently doesn't control the softmax operator safe accumulation, which is mentioned in the issue opened by Naveen and probably that is why you are not able to reproduce the issue. I am sure in the future this switch will control softmax operator too.

@haojin2
Copy link
Contributor Author

haojin2 commented May 16, 2019

@anirudh2290
Firstly, the "accurate" in the doc is describing the accumulation, NO claims where made about its influence on final accuracy/performance of any models built with MXNet with this env var turned on. Please make sure you do not factor in any assumed contexts that a random MXNet user may not have when reviewing this piece of doc.
Thanks for pointing out that the softmax is still always done in the safe mode at this moment, which led me to a revisit to the original issue. Here's a table that lists out all the final test losses got from running this training script:

Version Test Loss Safe Softmax?
1.5.0b20190220 4.79 No
1.5.0b20190221 5.18 Yes
1.5.0b20190313 4.90 Yes
This PR 4.74 Yes

We can see that:

  1. The final test loss would have some variance even with the softmax's behavior unchanged.
  2. The final test loss could vary to a higher or lower value when you change the behavior of softmax.
    From the 2 facts above I think maybe the increase in loss observed in the original issue is just some random variance, which does not necessarily need a "fix".
    I can definitely add this extra line or be more specific that the accurate is describing the accumulation, not any models built with MXNet if that's what you're looking for to get this PR merged ASAP for the 1.5 code freeze, but still, my original doc is not making any claims about its influence on the final accuracy of models in the first place so I would still think such disclaimers as nice-to-have but not really necessary.

@haojin2
Copy link
Contributor Author

haojin2 commented May 16, 2019

Out of curiosity I did the following experiment on my p2.8xlarge:

  1. Checkout the commit before the softmax PR, fresh build from source, then run the script.
  2. Checkout the commit of the softmax PR, fresh build from source, then run the script.
    Here're the logs
    Built from source at the commit before the softmax PR:
ubuntu@ip-162-32-28-44:~/deeplearning-benchmark$ python word_language_model/word_language_model.py --gpus 8 --nhid 650 --emsize 650 --dropout 0.5 --epochs 40 --data word_language_model/data/ptb. --mode imperative --kvstore device
INFO:root:[Epoch 0] time cost 30.57s, valid loss 6.43, valid ppl 618.67
INFO:root:test loss 6.39, test ppl 596.96
INFO:root:[Epoch 1] time cost 28.15s, valid loss 6.05, valid ppl 424.32
INFO:root:test loss 6.03, test ppl 416.74
INFO:root:[Epoch 2] time cost 28.66s, valid loss 5.76, valid ppl 317.45
INFO:root:test loss 5.74, test ppl 310.12
INFO:root:[Epoch 3] time cost 28.61s, valid loss 5.60, valid ppl 270.85
INFO:root:test loss 5.58, test ppl 264.98
INFO:root:[Epoch 4] time cost 28.92s, valid loss 5.44, valid ppl 229.80
INFO:root:test loss 5.40, test ppl 221.53
INFO:root:[Epoch 5] time cost 28.94s, valid loss 5.33, valid ppl 207.46
INFO:root:test loss 5.31, test ppl 202.24
INFO:root:[Epoch 6] time cost 29.13s, valid loss 5.26, valid ppl 193.18
INFO:root:test loss 5.24, test ppl 188.84
INFO:root:[Epoch 7] time cost 28.76s, valid loss 5.19, valid ppl 178.78
INFO:root:test loss 5.16, test ppl 174.57
INFO:root:[Epoch 8] time cost 29.33s, valid loss 5.13, valid ppl 169.77
INFO:root:test loss 5.11, test ppl 165.58
INFO:root:[Epoch 9] time cost 28.92s, valid loss 5.09, valid ppl 162.16
INFO:root:test loss 5.06, test ppl 158.30
INFO:root:[Epoch 10] time cost 29.29s, valid loss 5.03, valid ppl 153.41
INFO:root:test loss 5.00, test ppl 147.82
INFO:root:[Epoch 11] time cost 29.02s, valid loss 5.01, valid ppl 149.68
INFO:root:test loss 4.97, test ppl 144.52
INFO:root:[Epoch 12] time cost 29.12s, valid loss 4.99, valid ppl 146.27
INFO:root:test loss 4.95, test ppl 141.74
INFO:root:[Epoch 13] time cost 29.10s, valid loss 4.95, valid ppl 141.57
INFO:root:test loss 4.92, test ppl 136.56
INFO:root:[Epoch 14] time cost 29.19s, valid loss 4.93, valid ppl 139.02
INFO:root:test loss 4.90, test ppl 134.21
INFO:root:[Epoch 15] time cost 29.02s, valid loss 4.92, valid ppl 137.63
INFO:root:test loss 4.89, test ppl 132.71
INFO:root:[Epoch 16] time cost 29.45s, valid loss 4.90, valid ppl 134.44
INFO:root:test loss 4.86, test ppl 128.75
INFO:root:[Epoch 17] time cost 28.85s, valid loss 4.87, valid ppl 130.48
INFO:root:test loss 4.83, test ppl 124.94
INFO:root:[Epoch 18] time cost 29.18s, valid loss 4.87, valid ppl 130.76
INFO:root:[Epoch 19] time cost 29.32s, valid loss 4.85, valid ppl 127.34
INFO:root:test loss 4.80, test ppl 121.90
INFO:root:[Epoch 20] time cost 29.29s, valid loss 4.84, valid ppl 126.82
INFO:root:test loss 4.80, test ppl 121.36
INFO:root:[Epoch 21] time cost 28.72s, valid loss 4.84, valid ppl 126.15
INFO:root:test loss 4.79, test ppl 120.70
INFO:root:[Epoch 22] time cost 29.30s, valid loss 4.83, valid ppl 125.70
INFO:root:test loss 4.79, test ppl 120.15
INFO:root:[Epoch 23] time cost 29.05s, valid loss 4.83, valid ppl 125.46
INFO:root:test loss 4.79, test ppl 119.92
INFO:root:[Epoch 24] time cost 29.18s, valid loss 4.83, valid ppl 124.62
INFO:root:test loss 4.78, test ppl 119.24
INFO:root:[Epoch 25] time cost 29.04s, valid loss 4.83, valid ppl 124.73
INFO:root:[Epoch 26] time cost 29.33s, valid loss 4.82, valid ppl 124.55
INFO:root:test loss 4.78, test ppl 118.98
INFO:root:[Epoch 27] time cost 28.93s, valid loss 4.82, valid ppl 124.35
INFO:root:test loss 4.78, test ppl 118.82
INFO:root:[Epoch 28] time cost 29.26s, valid loss 4.82, valid ppl 124.18
INFO:root:test loss 4.78, test ppl 118.66
INFO:root:[Epoch 29] time cost 28.90s, valid loss 4.82, valid ppl 124.09
INFO:root:test loss 4.78, test ppl 118.56
INFO:root:[Epoch 30] time cost 29.43s, valid loss 4.82, valid ppl 123.99
INFO:root:test loss 4.77, test ppl 118.47
INFO:root:[Epoch 31] time cost 28.97s, valid loss 4.82, valid ppl 123.91
INFO:root:test loss 4.77, test ppl 118.44
INFO:root:[Epoch 32] time cost 29.27s, valid loss 4.82, valid ppl 123.66
INFO:root:test loss 4.77, test ppl 118.21
INFO:root:[Epoch 33] time cost 28.87s, valid loss 4.82, valid ppl 123.63
INFO:root:test loss 4.77, test ppl 118.19
INFO:root:[Epoch 34] time cost 29.28s, valid loss 4.82, valid ppl 123.71
INFO:root:[Epoch 35] time cost 29.16s, valid loss 4.82, valid ppl 123.58
INFO:root:test loss 4.77, test ppl 118.09
INFO:root:[Epoch 36] time cost 29.39s, valid loss 4.82, valid ppl 123.54
INFO:root:test loss 4.77, test ppl 118.05
INFO:root:[Epoch 37] time cost 29.09s, valid loss 4.82, valid ppl 123.55
INFO:root:[Epoch 38] time cost 29.43s, valid loss 4.82, valid ppl 123.47
INFO:root:test loss 4.77, test ppl 117.99
INFO:root:[Epoch 39] time cost 29.04s, valid loss 4.82, valid ppl 123.44
INFO:root:test loss 4.77, test ppl 117.97
INFO:root:Best test loss 4.77, test ppl 117.97
ubuntu@ip-162-32-28-44:~/deeplearning-benchmark$ cd ..
ubuntu@ip-162-32-28-44:~$ cd incubator-mxnet/
ubuntu@ip-162-32-28-44:~/incubator-mxnet$ git status
HEAD detached at f9c436be2
nothing to commit, working tree clean
ubuntu@ip-162-32-28-44:~/incubator-mxnet$ python
Python 3.6.7 (default, Oct 22 2018, 11:32:17) 
[GCC 8.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import mxnet as mx
>>> mx
<module 'mxnet' from '/home/ubuntu/incubator-mxnet/python/mxnet/__init__.py'>
>>> quit()
ubuntu@ip-162-32-28-44:~/incubator-mxnet$ git show
commit f9c436be2689ac809aab5422b41f2fd768e8c4bc (HEAD)
Author: Przemyslaw Tredak <ptrendx@gmail.com>
Date:   Tue Feb 19 16:03:36 2019 -0800

    Fix req=null in SliceLikeBackward (#14209)

diff --git a/src/operator/tensor/matrix_op-inl.h b/src/operator/tensor/matrix_op-inl.h
index 97c4fa556..28ed4215e 100644
--- a/src/operator/tensor/matrix_op-inl.h
+++ b/src/operator/tensor/matrix_op-inl.h
@@ -1389,13 +1389,15 @@ void SliceLikeBackward(const nnvm::NodeAttrs& attrs,
   CHECK_EQ(inputs.size(), 1U);
   CHECK_EQ(outputs.size(), 2U);
   CHECK_EQ(req.size(), 2U);
-  if (req[0] == kNullOp) return;
   using namespace mshadow;
   Stream<xpu>* s = ctx.get_stream<xpu>();
+  if (req[1] != kNullOp && req[1] != kAddTo) {
+    Fill(s, outputs[1], req[1], 0);  // Second input not relavant to gradients.
+  }
+  if (req[0] == kNullOp) return;
   const TBlob& ograd = inputs[0];
   const TBlob& igrad = outputs[0];
   const SliceLikeParam& param = nnvm::get<SliceLikeParam>(attrs.parsed);
-  Fill(s, outputs[1], req[1], 0);  // Second input not relavant to gradients.
   if (req[0] == kWriteTo) {
     Fill(s, igrad, req[0], 0);
   } else if (req[0] == kWriteInplace) {

Built from source at the commit of the softmax PR:

ubuntu@ip-162-32-28-44:~/deeplearning-benchmark$ python word_language_model/word_language_model.py --gpus 8 --nhid 650 --emsize 650 --dropout 0.5 --epochs 40 --data word_language_model/data/ptb. --mode imperative --kvstore device
INFO:root:[Epoch 0] time cost 30.52s, valid loss 6.60, valid ppl 737.45
INFO:root:test loss 6.57, test ppl 714.20
INFO:root:[Epoch 1] time cost 28.03s, valid loss 6.13, valid ppl 461.04
INFO:root:test loss 6.10, test ppl 446.37
INFO:root:[Epoch 2] time cost 28.50s, valid loss 5.81, valid ppl 332.27
INFO:root:test loss 5.77, test ppl 320.43
INFO:root:[Epoch 3] time cost 28.28s, valid loss 5.58, valid ppl 264.90
INFO:root:test loss 5.54, test ppl 254.72
INFO:root:[Epoch 4] time cost 28.83s, valid loss 5.41, valid ppl 224.44
INFO:root:test loss 5.38, test ppl 217.38
INFO:root:[Epoch 5] time cost 28.54s, valid loss 5.34, valid ppl 208.28
INFO:root:test loss 5.30, test ppl 201.23
INFO:root:[Epoch 6] time cost 28.98s, valid loss 5.24, valid ppl 188.28
INFO:root:test loss 5.21, test ppl 182.98
INFO:root:[Epoch 7] time cost 28.58s, valid loss 5.18, valid ppl 177.91
INFO:root:test loss 5.15, test ppl 172.79
INFO:root:[Epoch 8] time cost 29.14s, valid loss 5.14, valid ppl 170.55
INFO:root:test loss 5.10, test ppl 164.79
INFO:root:[Epoch 9] time cost 28.61s, valid loss 5.08, valid ppl 160.70
INFO:root:test loss 5.05, test ppl 155.34
INFO:root:[Epoch 10] time cost 28.87s, valid loss 5.03, valid ppl 153.40
INFO:root:test loss 5.00, test ppl 148.69
INFO:root:[Epoch 11] time cost 28.71s, valid loss 5.00, valid ppl 149.02
INFO:root:test loss 4.97, test ppl 144.28
INFO:root:[Epoch 12] time cost 28.87s, valid loss 4.97, valid ppl 143.34
INFO:root:test loss 4.93, test ppl 137.89
INFO:root:[Epoch 13] time cost 28.80s, valid loss 4.95, valid ppl 140.72
INFO:root:test loss 4.91, test ppl 136.31
INFO:root:[Epoch 14] time cost 29.21s, valid loss 4.93, valid ppl 137.89
INFO:root:test loss 4.89, test ppl 132.43
INFO:root:[Epoch 15] time cost 28.84s, valid loss 4.91, valid ppl 135.74
INFO:root:test loss 4.87, test ppl 130.02
INFO:root:[Epoch 16] time cost 29.02s, valid loss 4.89, valid ppl 133.00
INFO:root:test loss 4.85, test ppl 127.47
INFO:root:[Epoch 17] time cost 28.74s, valid loss 4.87, valid ppl 130.48
INFO:root:test loss 4.83, test ppl 124.78
INFO:root:[Epoch 18] time cost 28.89s, valid loss 4.86, valid ppl 129.04
INFO:root:test loss 4.82, test ppl 124.39
INFO:root:[Epoch 19] time cost 28.88s, valid loss 4.85, valid ppl 127.15
INFO:root:test loss 4.81, test ppl 122.78
INFO:root:[Epoch 20] time cost 29.10s, valid loss 4.83, valid ppl 124.98
INFO:root:test loss 4.79, test ppl 120.01
INFO:root:[Epoch 21] time cost 28.94s, valid loss 4.83, valid ppl 125.68
INFO:root:[Epoch 22] time cost 29.27s, valid loss 4.81, valid ppl 122.71
INFO:root:test loss 4.76, test ppl 116.87
INFO:root:[Epoch 23] time cost 28.86s, valid loss 4.81, valid ppl 122.14
INFO:root:test loss 4.76, test ppl 116.37
INFO:root:[Epoch 24] time cost 29.21s, valid loss 4.80, valid ppl 121.99
INFO:root:test loss 4.76, test ppl 116.17
INFO:root:[Epoch 25] time cost 28.77s, valid loss 4.80, valid ppl 121.56
INFO:root:test loss 4.75, test ppl 115.79
INFO:root:[Epoch 26] time cost 29.35s, valid loss 4.80, valid ppl 121.45
INFO:root:test loss 4.75, test ppl 115.69
INFO:root:[Epoch 27] time cost 28.77s, valid loss 4.80, valid ppl 121.28
INFO:root:test loss 4.75, test ppl 115.41
INFO:root:[Epoch 28] time cost 29.18s, valid loss 4.79, valid ppl 120.86
INFO:root:test loss 4.75, test ppl 115.07
INFO:root:[Epoch 29] time cost 28.79s, valid loss 4.79, valid ppl 120.70
INFO:root:test loss 4.74, test ppl 114.90
INFO:root:[Epoch 30] time cost 29.17s, valid loss 4.79, valid ppl 120.60
INFO:root:test loss 4.74, test ppl 114.86
INFO:root:[Epoch 31] time cost 28.73s, valid loss 4.79, valid ppl 120.13
INFO:root:test loss 4.74, test ppl 114.50
INFO:root:[Epoch 32] time cost 29.13s, valid loss 4.79, valid ppl 119.99
INFO:root:test loss 4.74, test ppl 114.25
INFO:root:[Epoch 33] time cost 28.82s, valid loss 4.79, valid ppl 120.21
INFO:root:[Epoch 34] time cost 29.09s, valid loss 4.78, valid ppl 119.59
INFO:root:test loss 4.73, test ppl 113.82
INFO:root:[Epoch 35] time cost 28.73s, valid loss 4.78, valid ppl 119.63
INFO:root:[Epoch 36] time cost 29.12s, valid loss 4.78, valid ppl 119.59
INFO:root:[Epoch 37] time cost 28.74s, valid loss 4.78, valid ppl 119.56
INFO:root:test loss 4.73, test ppl 113.79
INFO:root:[Epoch 38] time cost 28.81s, valid loss 4.78, valid ppl 119.54
INFO:root:test loss 4.73, test ppl 113.76
INFO:root:[Epoch 39] time cost 28.74s, valid loss 4.78, valid ppl 119.51
INFO:root:test loss 4.73, test ppl 113.73
INFO:root:Best test loss 4.73, test ppl 113.73
ubuntu@ip-162-32-28-44:~/deeplearning-benchmark$ cd ../incubator-mxnet
ubuntu@ip-162-32-28-44:~/incubator-mxnet$ python
Python 3.6.7 (default, Oct 22 2018, 11:32:17) 
[GCC 8.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import mxnet as mx
>>> mx
<module 'mxnet' from '/home/ubuntu/incubator-mxnet/python/mxnet/__init__.py'>
>>> quit()
ubuntu@ip-162-32-28-44:~/incubator-mxnet$ git show
commit 862cbc67aacf81990b8c885847686a4c3c734cd3 (HEAD)
Author: Sheng Zha <szha@users.noreply.github.com>
Date:   Wed Feb 20 16:37:12 2019 -0800

    softmax for fp16 with fp32 accumulator (#14098)
    
    * softmax for fp16 with fp32 accumulator
    
    * return AType in kernel
    
    * add dtype
    
    * kernel
    
    * grad use in-out only when dtype override
    
    * simplify infer type
    
    * address comments

So here I'm actually seeing an INCREASE of final model performance after adding in the softmax PR with all other variables controlled... @nswamy Can you perform the exactly same experiment on your machine to see if this is the case? Your previous experiments were not controlling all other variables so the vision may be blurred to some extent...
@anirudh2290 @eric-haibin-lin FYI

@haojin2
Copy link
Contributor Author

haojin2 commented May 16, 2019

@anirudh2290 The extra disclaimer that you asked for is now added.

@anirudh2290
Copy link
Member

@haojin2 Thanks for the change and trying to reproduce the original issue. I will let @nswamy comment on it since you guys seem to be seeing different things.

Your previous experiments were not controlling all other variables

Which variables in specific do you mean ?

@haojin2
Copy link
Contributor Author

haojin2 commented May 16, 2019

@anirudh2290 @nswamy was using 2 different nightly builds, and that would introduce extra commits (more than 1 commits merged on 2/20) in addition to the softmax one. My experiments were purely testing the effect of only adding the softmax PR. I would consider mine as a more controlled experiment than the one in the issue.

@haojin2
Copy link
Contributor Author

haojin2 commented May 16, 2019

@anirudh2290 BTW there're 2 approvals already, I'll merge the PR once the build passes.

@haojin2 haojin2 merged commit 5aa62d8 into apache:master May 17, 2019
@haojin2 haojin2 deleted the safe_acc_envvar branch May 17, 2019 20:48
haohuanw pushed a commit to haohuanw/incubator-mxnet that referenced this pull request Jun 23, 2019
…#14830)

* add env var to control accumulation option

* trial

* try with ci change

* recover all test_operator

* add gpu coverage

* fix order

* address comments

* extra disclaimer on model accuracies
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Backend Issues related to the backend of MXNet Operator pr-work-in-progress PR is still work in progress
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants