-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Use env var to enforce safe accumulation in ReduceAxesCompute #14830
Conversation
@mxnet-label-bot add[pr-work-in-progress, Operator, Backend] |
5383e18
to
125d646
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some minor comments. otherwise LGTM. Thanks!
docs/faq/env_var.md
Outdated
* MXNET_ENFORCE_SAFE_ACCUMULATION | ||
- Values: Values: 0(false) or 1(true) ```(default=0)``` | ||
- If this variable is set, the accumulation will enter the safe mode, meaning accumulation is done in a data type of higher precision than | ||
the input data type, leading to more accurate results with a possible performance loss and backward compatibility loss. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we mention the list of operator supporting this env_var?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also can you mention what casting happens ie., float16->float32, etc.,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I am concerned about float32-> float64 casting as described in #14722.
In our offline discussion @haojin2 and @eric-haibin-lin seem to allude that somehow now with float64 casting for accumulation it is exposing a higher loss which is the correct value and the earlier model which yields lower loss is actually wrong. I am not sure I understand this despite our long conversation (sorry!).
Also I found another old PR which casted the Accumulation variable to higher type, however [comment] (dmlc/mshadow#236 (comment)) seem to suggest float32->float32 is sufficient(I checked with @cjolivier01 for the reasoning he does not remember unfortunately).
My suggestion is to retain float32->float32 unless you have a convincing argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nswamy I don't see a concrete reason why you want fp32 accumulation type for fp32 inputs if user set MXNET_ENFORCE_SAFE_ACCUMULATION=1. The only reason I see from the mshadow PR is to keep backward compatibility.
If you use fp32 accumulation type for fp32 inputs, it is NOT safe accumulation.
One simple test case where fp32 accumulation is not sufficient for safe accumulation:
# fp32 accumulation
>>> import numpy as np
>>> val1 = np.float32(3e38)
>>> val2 = np.float32(-3e38)
>>> sum = np.float32(3e38)
>>> sum + val1 + val2
inf
>>> sum64 = np.float64(3e38)
>>> sum64 + val1 + val2
3e+38
>>> np.float32(sum64 + val1 + val2)
3e+38
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nswamy Although I agree that using fp32 for fp32 accumulation may be sufficient under many cases, but:
- As @eric-haibin-lin said you cannot claim that such accumulation is safe.
- Usually a user is not 100% sure that his/her data will not cause overflow during any accumulation, so we want to provide an option if users want to be safe even he/she is using fp32 data type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anirudh2290 there's no guarantee that being more precise gives you a model with better accuracy / lower perplexity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that there is no guarantee of getting better accuracy, but can it be worse and what are reasons for the same ? This is what seems to be happening for @nswamy 's example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah it's possible that the model gets worse. The accuracy could go either way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case we need to change the env variable documentation: "leading to more accurate results with a possible performance loss and backward compatibility loss."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does backward compatibility loss mean? backward incompatibility?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the change, I have a few comments.
docs/faq/env_var.md
Outdated
* MXNET_ENFORCE_SAFE_ACCUMULATION | ||
- Values: Values: 0(false) or 1(true) ```(default=0)``` | ||
- If this variable is set, the accumulation will enter the safe mode, meaning accumulation is done in a data type of higher precision than | ||
the input data type, leading to more accurate results with a possible performance loss and backward compatibility loss. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also can you mention what casting happens ie., float16->float32, etc.,
docs/faq/env_var.md
Outdated
* MXNET_ENFORCE_SAFE_ACCUMULATION | ||
- Values: Values: 0(false) or 1(true) ```(default=0)``` | ||
- If this variable is set, the accumulation will enter the safe mode, meaning accumulation is done in a data type of higher precision than | ||
the input data type, leading to more accurate results with a possible performance loss and backward compatibility loss. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I am concerned about float32-> float64 casting as described in #14722.
In our offline discussion @haojin2 and @eric-haibin-lin seem to allude that somehow now with float64 casting for accumulation it is exposing a higher loss which is the correct value and the earlier model which yields lower loss is actually wrong. I am not sure I understand this despite our long conversation (sorry!).
Also I found another old PR which casted the Accumulation variable to higher type, however [comment] (dmlc/mshadow#236 (comment)) seem to suggest float32->float32 is sufficient(I checked with @cjolivier01 for the reasoning he does not remember unfortunately).
My suggestion is to retain float32->float32 unless you have a convincing argument.
e517a36
to
699db5f
Compare
c0a7d6a
to
8c9394c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. @anirudh2290 @nswamy any other concerns? We need the fix before 1.5 code freeze
Has the doc been fixed. There is no guarantee for aaccuraccy improvement with accumulation. The env variable doc says turn it on for better accuracy. |
@anirudh2290 The doc is only saying that the ACCUMULATIONS themselves will be more accurate, it's making no claims about accuracies for models. |
Since this is a global variable affecting many operators it can be easily interpreted as accumulation amongst all operators increasing accuracy of the model. At least add a note that turning this env variable on may not necessarily improve the accuracy of the model |
@anirudh2290 Fundamentally nothing would guarantee any benefit for all models. One would design a model or configure the system in a way that he/she believes would benefit his/her goal most, and we, as a lower-level library, are simply presenting the choices for them (with correct implementation and abundant doc) rather than guaranteeing anything for them.
I'm not seeing any significant difference between the results with this env var on or off (expected behavior in my opinion), and final results from both trials seem to be within a good range. @nswamy Would you please also try this on your machine to see if it's the case?
Sorry that this is a long reply because there was too much info that I would want to include. |
Adding new operator to your model is different from switching to safe accumulation on your exact model. Also, if you can point me to an env var doc which affects more than one operator and makes some claims about accuracy or performance that may be misleading I am happy to revisit it. The doc says this : "If this variable is set, the accumulation will enter the safe mode, meaning accumulation is done in a data type of higher precision than the input data type, leading to more accurate results with a possible performance loss and backward compatibility loss." Its natural for user to "wonder more accuracy results of what ? Which operators support safe accumulation? This is a global switch so I am assuming this should impact accuracy of many of our operators. So if I turn it on accuracy of my model may get better or worst case remain the same not worsen". This is not the case and I want us to clarify it. I don't think I am making any unreasonable ask here (Just add a simple note: "This switch may not necessarily improve accuracy of your overall model and in certain cases can make it worse"). By the way your switch currently doesn't control the softmax operator safe accumulation, which is mentioned in the issue opened by Naveen and probably that is why you are not able to reproduce the issue. I am sure in the future this switch will control softmax operator too. |
@anirudh2290
We can see that:
|
Out of curiosity I did the following experiment on my p2.8xlarge:
Built from source at the commit of the softmax PR:
So here I'm actually seeing an INCREASE of final model performance after adding in the softmax PR with all other variables controlled... @nswamy Can you perform the exactly same experiment on your machine to see if this is the case? Your previous experiments were not controlling all other variables so the vision may be blurred to some extent... |
@anirudh2290 The extra disclaimer that you asked for is now added. |
@anirudh2290 @nswamy was using 2 different nightly builds, and that would introduce extra commits (more than 1 commits merged on 2/20) in addition to the softmax one. My experiments were purely testing the effect of only adding the softmax PR. I would consider mine as a more controlled experiment than the one in the issue. |
@anirudh2290 BTW there're 2 approvals already, I'll merge the PR once the build passes. |
…#14830) * add env var to control accumulation option * trial * try with ci change * recover all test_operator * add gpu coverage * fix order * address comments * extra disclaimer on model accuracies
Description
As title.
Checklist
Essentials
Please feel free to remove inapplicable items for your PR.
Changes
Comments
@eric-haibin-lin
Flakiness checked on GPU: