-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mixed-precision training with both torch_xla
or torch.autocast
#523
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
torch_xla
or torch.autocast
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We restore that before merging.
autocast_context.__enter__() | ||
yield | ||
autocast_context.__exit__(*sys.exc_info()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be simpler to just do:
autocast_context.__enter__() | |
yield | |
autocast_context.__exit__(*sys.exc_info()) | |
with torch.autocast(dtype=torch.bfloat16, device_type="cuda", **autocast_kwargs): | |
yield |
Or is the linter complaining ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, why cuda ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I took that part from the
accelerate
library. I guess it could work. - It is as it is suggested by the AWS Neuron documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it comes from the doc, then there must be a reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
The device type is CUDA because we are using CUDA’s list of BF16 compatible operations as mentioned above.
# It is important to set the environment variables before initializing the process group otherwise they will be ignored by the Neuron compiler. | ||
set_common_neuron_cc_flags() | ||
if os.environ.get("ACCELERATE_USE_AMP", "false") == "true": | ||
set_neuron_cc_flags_for_torch_amp() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a place in the code where you restore the env ? If not maybe consider having a singleton class to do that: upon instantiation it stores the original cc flags, and on deletion it restores them. Then you wrap all your changes under context calls and on startup you get a ref to the singleton.
When all contexts have returned all refs to the singleton are released and the env is restored.
Maybe it is too involved but I just realized that whenever you call the training code, the cc flags will be completely unusable for inference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, actually the NEURON_CC_FLAGS
need to be set before initializing the process group. Once it is done, it will never change for the Neuron compiler (during this runtime).
Currently the NeuronState
is only used for training so I dont think it will be an issue. And the original environment will not be affected, only the environment for the current process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked at the AWS documentation and they don't seem to care much either about restoring env variable.
Forget about my comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In any case, as explained, it does not really change anything, once the process group has been initialized, we cannot change the environment for the Neuron compiler. That is also the reason why I changed some of the ways we set the flags. Any flag that is model dependent cannot be set by optimum-neuron
because by the time we have the model, we usually already have initialized the process group.
What does this PR do?
There are two ways to cast to
bfloat16
:torch_xla
casting system via the environment variablesXLA_DOWNCAST_BF16
orXLA_USE_BF16
.torch.autocast
feature.The first approach was already supported, this PR adds support for the second approach.
It also fixes issues related to how we can set the
NEURON_CC_FLAGS
. If they are set too late (e.g after the process group initialization), they will be ignored by the compiler. This PR makes sure we set them at the right time.