You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am encountering a FloatingPointError between steps 43K and 48K while training the LGR-SMoE model on the OPUS-100 dataset for a total of 200K steps. The issue halts the training process, and I've included training script & complete traceback below. I would appreciate any guidance on resolving this.
File "/data/kritarth/Subtitles/Lingual-SMoE/fairseq/trainer.py", line 885, in train_step
grad_norm = self.clip_grad_norm(self.cfg.optimization.clip_norm)
File "/data/kritarth/Subtitles/Lingual-SMoE/fairseq/trainer.py", line 1183, in clip_grad_norm
return self.optimizer.clip_grad_norm(clip_norm, aggregate_norm_fn=None)
File "/data/kritarth/Subtitles/Lingual-SMoE/fairseq/optim/fp16_optimizer.py", line 426, in clip_grad_norm
self.scaler.check_overflow(grad_norm_cpu)
File "/data/kritarth/Subtitles/Lingual-SMoE/fairseq/optim/dynamic_loss_scaler.py", line 61, in check_overflow
# raise FloatingPointError(
FloatingPointError: Minimum loss scale reached (0.0001). Your loss is probably exploding. Try lowering the learning rate, using gradient clipping
or increasing the batch size.
Traceback (most recent call last):
File "/data/kritarth/Subtitles/lingSMOE/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 90, in _wrap
fn(i, *args)
File "/data/kritarth/Subtitles/Lingual-SMoE/fairseq/distributed/utils.py", line 354, in distributed_main
main(cfg, **kwargs)
File "/data/kritarth/Subtitles/Lingual-SMoE/fairseq_cli/train.py", line 191, in main
valid_losses, should_stop = train(cfg, trainer, task, epoch_itr, max_epoch)
File "/usr/lib/python3.10/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/data/kritarth/Subtitles/Lingual-SMoE/fairseq_cli/train.py", line 309, in train
log_output = trainer.train_step(samples)
File "/usr/lib/python3.10/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/data/kritarth/Subtitles/Lingual-SMoE/fairseq/trainer.py", line 915, in train_step
self.task.train_step(
File "/data/kritarth/Subtitles/Lingual-SMoE/fairseq/tasks/fairseq_task.py", line 484, in train_step
loss, sample_size, logging_output = criterion(model, sample)
File "/data/kritarth/Subtitles/lingSMOE/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/data/kritarth/Subtitles/lingSMOE/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/data/kritarth/Subtitles/Lingual-SMoE/fairseq/criterions/fairseq_criterion.py", line 185, in forward
loss, inner_loss, moe_loss, lang_loss, moe_metadata, sample_size, logging_output = self.compute_loss(model, sample, reduce=reduce)
File "/data/kritarth/Subtitles/Lingual-SMoE/fairseq/criterions/fairseq_criterion.py", line 195, in compute_loss
net_output, inner_loss, sample_size, logging_output = self.compute_inner_loss(model, sample)
File "/data/kritarth/Subtitles/Lingual-SMoE/fairseq/criterions/moe_cross_entropy.py", line 16, in compute_inner_loss
net_output = model(**sample["net_input"], tgt_lang_id=sample["tgt_lang_id"])
File "/data/kritarth/Subtitles/lingSMOE/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/data/kritarth/Subtitles/lingSMOE/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1844, in _call_impl
return inner()
File "/data/kritarth/Subtitles/lingSMOE/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1790, in inner
result = forward_call(*args, **kwargs)
File "/data/kritarth/Subtitles/lingSMOE/lib/python3.10/site-packages/fairscale/nn/data_parallel/fully_sharded_data_parallel.py", line 1104, in
forward
outputs = self.module(*args, **kwargs)
File "/data/kritarth/Subtitles/lingSMOE/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/data/kritarth/Subtitles/lingSMOE/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1844, in _call_impl
return inner()
File "/data/kritarth/Subtitles/lingSMOE/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1790, in inner
result = forward_call(*args, **kwargs)
File "/data/kritarth/Subtitles/lingSMOE/lib/python3.10/site-packages/fairscale/nn/misc/flatten_params_wrapper.py", line 447, in forward
return self.module(*inputs, **kwinputs)
File "/data/kritarth/Subtitles/lingSMOE/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/data/kritarth/Subtitles/lingSMOE/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1844, in _call_impl
return inner()
File "/data/kritarth/Subtitles/lingSMOE/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1790, in inner
result = forward_call(*args, **kwargs)
File "/data/kritarth/Subtitles/Lingual-SMoE/smoe/smoe.py", line 105, in forward
encoder_out = self.encoder(
File "/data/kritarth/Subtitles/lingSMOE/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/data/kritarth/Subtitles/lingSMOE/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1844, in _call_impl
return inner()
File "/data/kritarth/Subtitles/lingSMOE/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1820, in inner
var = next(v for v in var.values() if isinstance(v, torch.Tensor))
StopIteration
The text was updated successfully, but these errors were encountered:
I did not observe this small loss during my training. The total training loss is ~4.5 at 43k steps and ~4 at 200k steps. Here is my training command on 8 GPUs:
I am encountering a
FloatingPointError
between steps 43K and 48K while training the LGR-SMoE model on the OPUS-100 dataset for a total of 200K steps. The issue halts the training process, and I've included training script & complete traceback below. I would appreciate any guidance on resolving this.Script
Traceback
The text was updated successfully, but these errors were encountered: