Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FloatingPointError: Minimum loss scale reached (0.0001). #1

Open
Kritarth384 opened this issue Nov 15, 2024 · 1 comment
Open

FloatingPointError: Minimum loss scale reached (0.0001). #1

Kritarth384 opened this issue Nov 15, 2024 · 1 comment

Comments

@Kritarth384
Copy link

Kritarth384 commented Nov 15, 2024

I am encountering a FloatingPointError between steps 43K and 48K while training the LGR-SMoE model on the OPUS-100 dataset for a total of 200K steps. The issue halts the training process, and I've included training script & complete traceback below. I would appreciate any guidance on resolving this.

Script

function train_single_node(){ python train.py
$data_args
--max-tokens 10384
--share-all-embeddings
--encoder-normalize-before
--decoder-normalize-before
--optimizer adam
--adam-betas '(0.9, 0.98)'
--clip-norm 1.0
--lr 0.0005
--warmup-updates 4000
--lr-scheduler inverse_sqrt
--dropout 0.1
--attention-dropout 0.1
--num-workers-valid 0
--max-update 200000
--ddp-backend ${ddp}
--user-dir ./smoe
--best-checkpoint-metric ppl
--log-format tqdm
--log-interval 100
--fp16
--record-token-expert
$save_args
$model_args
$moe_args }

Traceback

  File "/data/kritarth/Subtitles/Lingual-SMoE/fairseq/trainer.py", line 885, in train_step                                                       
    grad_norm = self.clip_grad_norm(self.cfg.optimization.clip_norm)                                                                             
  File "/data/kritarth/Subtitles/Lingual-SMoE/fairseq/trainer.py", line 1183, in clip_grad_norm                                                  
    return self.optimizer.clip_grad_norm(clip_norm, aggregate_norm_fn=None)                                                                      
  File "/data/kritarth/Subtitles/Lingual-SMoE/fairseq/optim/fp16_optimizer.py", line 426, in clip_grad_norm                                      
    self.scaler.check_overflow(grad_norm_cpu)                                                                                                    
  File "/data/kritarth/Subtitles/Lingual-SMoE/fairseq/optim/dynamic_loss_scaler.py", line 61, in check_overflow                                  
    # raise FloatingPointError(                                                                                                                  
FloatingPointError: Minimum loss scale reached (0.0001). Your loss is probably exploding. Try lowering the learning rate, using gradient clipping
 or increasing the batch size.
 
 
 Traceback (most recent call last):                                                                                                               
  File "/data/kritarth/Subtitles/lingSMOE/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 90, in _wrap                        
    fn(i, *args)                                                                                                                                 
  File "/data/kritarth/Subtitles/Lingual-SMoE/fairseq/distributed/utils.py", line 354, in distributed_main                                       
    main(cfg, **kwargs)                                                                                                                          
  File "/data/kritarth/Subtitles/Lingual-SMoE/fairseq_cli/train.py", line 191, in main                                                           
    valid_losses, should_stop = train(cfg, trainer, task, epoch_itr, max_epoch)                                                                  
  File "/usr/lib/python3.10/contextlib.py", line 79, in inner                                                                                    
    return func(*args, **kwds)                                                                                                                   
  File "/data/kritarth/Subtitles/Lingual-SMoE/fairseq_cli/train.py", line 309, in train                                                          
    log_output = trainer.train_step(samples)                                                                                                     
  File "/usr/lib/python3.10/contextlib.py", line 79, in inner                                                                                    
    return func(*args, **kwds)                                          
  File "/data/kritarth/Subtitles/Lingual-SMoE/fairseq/trainer.py", line 915, in train_step                                                       
    self.task.train_step(                                               
  File "/data/kritarth/Subtitles/Lingual-SMoE/fairseq/tasks/fairseq_task.py", line 484, in train_step
    loss, sample_size, logging_output = criterion(model, sample)                                                                                 
  File "/data/kritarth/Subtitles/lingSMOE/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)                             
  File "/data/kritarth/Subtitles/lingSMOE/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)                                
  File "/data/kritarth/Subtitles/Lingual-SMoE/fairseq/criterions/fairseq_criterion.py", line 185, in forward
    loss, inner_loss, moe_loss, lang_loss, moe_metadata, sample_size, logging_output = self.compute_loss(model, sample, reduce=reduce)
  File "/data/kritarth/Subtitles/Lingual-SMoE/fairseq/criterions/fairseq_criterion.py", line 195, in compute_loss
    net_output, inner_loss, sample_size, logging_output = self.compute_inner_loss(model, sample)                                                 
  File "/data/kritarth/Subtitles/Lingual-SMoE/fairseq/criterions/moe_cross_entropy.py", line 16, in compute_inner_loss
    net_output = model(**sample["net_input"], tgt_lang_id=sample["tgt_lang_id"])                                                                 
  File "/data/kritarth/Subtitles/lingSMOE/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)                             
  File "/data/kritarth/Subtitles/lingSMOE/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1844, in _call_impl
    return inner()                  
  File "/data/kritarth/Subtitles/lingSMOE/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1790, in inner
    result = forward_call(*args, **kwargs)                              
  File "/data/kritarth/Subtitles/lingSMOE/lib/python3.10/site-packages/fairscale/nn/data_parallel/fully_sharded_data_parallel.py", line 1104, in 
forward                             
    outputs = self.module(*args, **kwargs)                              
  File "/data/kritarth/Subtitles/lingSMOE/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)                             
  File "/data/kritarth/Subtitles/lingSMOE/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1844, in _call_impl
    return inner()                  
  File "/data/kritarth/Subtitles/lingSMOE/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1790, in inner
    result = forward_call(*args, **kwargs)                              
  File "/data/kritarth/Subtitles/lingSMOE/lib/python3.10/site-packages/fairscale/nn/misc/flatten_params_wrapper.py", line 447, in forward
    return self.module(*inputs, **kwinputs)                             
  File "/data/kritarth/Subtitles/lingSMOE/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)                             
  File "/data/kritarth/Subtitles/lingSMOE/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1844, in _call_impl
    return inner()                  
  File "/data/kritarth/Subtitles/lingSMOE/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1790, in inner
    result = forward_call(*args, **kwargs)                              
  File "/data/kritarth/Subtitles/Lingual-SMoE/smoe/smoe.py", line 105, in forward                                                                
    encoder_out = self.encoder(                                         
  File "/data/kritarth/Subtitles/lingSMOE/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)                             
  File "/data/kritarth/Subtitles/lingSMOE/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1844, in _call_impl
    return inner()                  
  File "/data/kritarth/Subtitles/lingSMOE/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1820, in inner
    var = next(v for v in var.values() if isinstance(v, torch.Tensor))                                                                           
StopIteration
@ZhaoCinyu
Copy link
Contributor

ZhaoCinyu commented Nov 24, 2024

I did not observe this small loss during my training. The total training loss is ~4.5 at 43k steps and ~4 at 200k steps. Here is my training command on 8 GPUs:

train.py $data_args --max-tokens 8192 --share-all-embeddings --encoder-normalize-before --decoder-normalize-before --optimizer adam --adam-betas "(0.9, 0.98)" --clip-norm 1.0 --lr 0.0005 --warmup-updates 4000 --lr-scheduler inverse_sqrt --dropout 0.1 --attention-dropout 0.1 --num-workers-valid 0 --max-update 200000 --ddp-backend fully_sharded --user-dir ./smoe --best-checkpoint-metric ppl --log-format simple --log-interval 100 --fp16 --save-dir output/hmoe --wandb-project moe_mmt --validate-interval-updates 2000 --save-interval-updates 5000 --keep-interval-updates 1 --no-epoch-checkpoints --no-last-checkpoints --no-save-optimizer-state-on-training-finished --arch smoe --moe-gating-use-fp32 --moe-second-expert-policy all --moe-normalize-expert-grad sqrt_world_size --criterion moe_cross_entropy --moe-gate-loss-wt 0.05 --moe-gate-loss-combine-method sum --moe-batch-prioritized-routing --use-moe-pad-mask --moe-freq 2 --moe-expert-count 32 --add-lang-loss --hmoe-gate --task-mlp-path assets/task_mlp_weight.pt

You can try training with smaller datasets to verify training process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants