Implement Automatic Mixed Precision with GradScaler to Address NaN Loss Issues #13

aiihn · 2024-08-26T04:00:16Z

Description

This pull request addresses the issue of NaN losses occurring during mixed-precision training with --fp16 enabled (#12).

Key Changes

Integrated torch.cuda.amp.GradScaler to dynamically adjust loss scaling.
Replaced the manual loss scaling approach. Note: GradScaler will override the loss_scale set manually by --ls.

Usage

Use --fp16=True along with --enable_gradscaler=True. For example, below is the mixed-training command modified from run_ecm_1hour.sh.

torchrun --nnodes=1 --nproc_per_node=1 --rdzv_backend=c10d --rdzv_endpoint=localhost:9901 ct_train.py  \
    --outdir=ct-runs --data=datasets/cifar10-32x32.zip  \
    --cond=0 --arch=ddpmpp --metrics=fid50k_full        \
    --transfer=https://nvlabs-fi-cdn.nvidia.com/edm/pretrained/edm-cifar10-32x32-uncond-vp.pkl    \
    --duration=25.6 --tick=12.8 --double=250 --batch=128 --lr=0.0001 --optim=RAdam --dropout=0.2 --augment=0.0 \
    -q 256 --double 10000 --ema_beta 0.9993 --eval_every 80 --dump 80     \
    --desc bs128.200k \
    --fp16=True --enable_gradscaler=True

The FID records obtained using the above command are shown in the following images:

…ling

Gsunshine

Merge AMP via Gradscalar into ECT.

Gsunshine · 2024-08-27T07:25:09Z

ct_train.py

@@ -78,6 +78,7 @@ def convert(self, value, param, ctx):
 @click.option('--fp16',          help='Enable mixed-precision training', metavar='BOOL',            type=bool, default=False, show_default=True)
 @click.option('--tf32',          help='Enable tf32 for A100/H100 training speed', metavar='BOOL',   type=bool, default=False, show_default=True)
 @click.option('--ls',            help='Loss scaling', metavar='FLOAT',                              type=click.FloatRange(min=0, min_open=True), default=1, show_default=True)
+@click.option('--enable_gradscaler', help='Enable torch.cuda.amp.GradScaler, NOTE overwritting loss_scale set by --ls', metavar='BOOL', type=bool, default=False, show_default=True)


Hi Zixiang @aiihn ,

Thanks for your neat PR!

Would it be better to use a short abbreviation like amp as the option name? AMP already stands for Automatic Mixed Precision.

Gsunshine · 2024-08-27T07:28:05Z

training/ct_training_loop.py

+        if enable_gradscaler:
+            if 'gradscaler_state' in data:
+                dist.print0(f'Loading GradScaler state from "{resume_state_dump}"...')
+                # Although not loading the state_dict of the GradScaler works well, loading it can improve reproducibility.


Gotcha. Thanks for the comments!

Gsunshine · 2024-08-27T07:36:40Z

training/ct_training_loop.py

+            scaler.step(optimizer)
+            scaler.update()
+        else:
+            # Update weights.


TODO is also unclear to me either. It seems still useful and compatible per Claude.

It's fine to remove my commented code for lr rampup.

Gsunshine · 2024-09-23T22:43:52Z

Hi @aiihn ,

Thank you again for your PR! I had another AMP implementation that could also be helpful for ECT. I’ll check it out later and test torch.autocast, bu feel free to take a look if you’re working with mixed precision!

Links for reference:
https://github.com/locuslab/torchdeq/blob/main/deq-zoo/deq-flow/main.py
https://github.com/locuslab/torchdeq/blob/main/deq-zoo/deq-flow/core/deq_flow.py

Cheers,
Zhengyang

Enable torch.cuda.amp.GradScaler to automatically adjust the loss sca…

2beb14a

…ling

Gsunshine self-requested a review September 6, 2024 06:03

Gsunshine approved these changes Sep 23, 2024

View reviewed changes

Gsunshine merged commit f8cdf75 into locuslab:main Sep 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Automatic Mixed Precision with GradScaler to Address NaN Loss Issues #13

Implement Automatic Mixed Precision with GradScaler to Address NaN Loss Issues #13

aiihn commented Aug 26, 2024 •

edited

Loading

Gsunshine left a comment

Gsunshine Aug 27, 2024

Gsunshine Aug 27, 2024

Gsunshine Aug 27, 2024

Gsunshine commented Sep 23, 2024

Implement Automatic Mixed Precision with GradScaler to Address NaN Loss Issues #13

Implement Automatic Mixed Precision with GradScaler to Address NaN Loss Issues #13

Conversation

aiihn commented Aug 26, 2024 • edited Loading

Description

Key Changes

Usage

Gsunshine left a comment

Choose a reason for hiding this comment

Gsunshine Aug 27, 2024

Choose a reason for hiding this comment

Gsunshine Aug 27, 2024

Choose a reason for hiding this comment

Gsunshine Aug 27, 2024

Choose a reason for hiding this comment

Gsunshine commented Sep 23, 2024

aiihn commented Aug 26, 2024 •

edited

Loading