Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mixed-precision training with both torch_xla or torch.autocast #523

Merged
merged 22 commits into from
Apr 3, 2024

Conversation

michaelbenayoun
Copy link
Member

@michaelbenayoun michaelbenayoun commented Mar 21, 2024

What does this PR do?

There are two ways to cast to bfloat16:

  • Use torch_xla casting system via the environment variables XLA_DOWNCAST_BF16 or XLA_USE_BF16.
  • Use the native torch.autocast feature.

The first approach was already supported, this PR adds support for the second approach.
It also fixes issues related to how we can set the NEURON_CC_FLAGS. If they are set too late (e.g after the process group initialization), they will be ignored by the compiler. This PR makes sure we set them at the right time.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@michaelbenayoun michaelbenayoun changed the title Small change for mixed-precision training Mixed-precision training with both torch_xla or torch.autocast Mar 22, 2024
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We restore that before merging.

@michaelbenayoun michaelbenayoun marked this pull request as ready for review April 2, 2024 10:29
Comment on lines +560 to +562
autocast_context.__enter__()
yield
autocast_context.__exit__(*sys.exc_info())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be simpler to just do:

Suggested change
autocast_context.__enter__()
yield
autocast_context.__exit__(*sys.exc_info())
with torch.autocast(dtype=torch.bfloat16, device_type="cuda", **autocast_kwargs):
yield

Or is the linter complaining ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, why cuda ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I took that part from the accelerate library. I guess it could work.
  2. It is as it is suggested by the AWS Neuron documentation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it comes from the doc, then there must be a reason.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

The device type is CUDA because we are using CUDA’s list of BF16 compatible operations as mentioned above.

# It is important to set the environment variables before initializing the process group otherwise they will be ignored by the Neuron compiler.
set_common_neuron_cc_flags()
if os.environ.get("ACCELERATE_USE_AMP", "false") == "true":
set_neuron_cc_flags_for_torch_amp()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a place in the code where you restore the env ? If not maybe consider having a singleton class to do that: upon instantiation it stores the original cc flags, and on deletion it restores them. Then you wrap all your changes under context calls and on startup you get a ref to the singleton.
When all contexts have returned all refs to the singleton are released and the env is restored.
Maybe it is too involved but I just realized that whenever you call the training code, the cc flags will be completely unusable for inference.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, actually the NEURON_CC_FLAGS need to be set before initializing the process group. Once it is done, it will never change for the Neuron compiler (during this runtime).

Currently the NeuronState is only used for training so I dont think it will be an issue. And the original environment will not be affected, only the environment for the current process.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at the AWS documentation and they don't seem to care much either about restoring env variable.
Forget about my comment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In any case, as explained, it does not really change anything, once the process group has been initialized, we cannot change the environment for the Neuron compiler. That is also the reason why I changed some of the ways we set the flags. Any flag that is model dependent cannot be set by optimum-neuron because by the time we have the model, we usually already have initialized the process group.

@michaelbenayoun michaelbenayoun merged commit 3005c77 into main Apr 3, 2024
10 of 11 checks passed
@michaelbenayoun michaelbenayoun deleted the mixed_precision branch April 3, 2024 13:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants