-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Loss is nan, stopping train" appears regularly #26
Comments
@YizJia What is your PyTorch version? I strongly recommend you use the version same as the requirements.txt. |
BTW, Are you using the default configuration file ( |
Thank you for your reply. Now I found the problem and solved it. |
I am glad to hear that. |
I followed the steps in the READ.ME, configured the file directory structure, and trained the model. But there are always strange problems, like the log information intercepted below.
----OUTPUT----
Epoch: [5] [1660/2241] eta: 0:08:56 lr: 0.003000 loss: 2.2882 (2.4257) loss_proposal_cls: 0.0818 (0.0915) loss_proposal_reg: 1.2728 (1.4000) loss_box_cls: 0.1167 (0.1311) loss_box_reg: 0.1667 (0.1707) loss_box_reid: 0.4618 (0.5611) loss_rpn_reg: 0.0283 (0.0344) loss_rpn_cls: 0.0317 (0.0369) time: 0.9248 data: 0.0005 max mem: 24005
Loss is nan, stopping training
{'loss_proposal_cls': tensor(0.0837, device='cuda:0', grad_fn=), 'loss_proposal_reg': tensor(1.3923, device='cuda:0', grad_fn=), 'loss_box_cls': tensor(0.1187, device='cuda:0', grad_fn=), 'loss_box_reg': tensor(0.1719, device='cuda:0', grad_fn=), 'loss_box_reid': tensor(nan, device='cuda:0', grad_fn=), 'loss_rpn_reg': tensor(0.0457, device='cuda:0', grad_fn=), 'loss_rpn_cls': tensor(0.0226, device='cuda:0', grad_fn=)}
This phenomenon occurs after executing a fixed number of epochs. The error "Loss is nan, stopping training" is very regular. For example, after 5 epochs, it will appear after the 1160th batch of the 6th epoch, whether it is training from epoch=0 or using the --resume command .
Whether the model is trained on the RTX A6000,RTX A5000 or Tesla V100 32G, or whether the batch size and learning rate are adjusted in equal proportions, this error will occur, thus stopping the training.
I used the --resume command to train for 20 epochs, and observed that every time the problem appeared on the loss_box_reid.
This should be a bug in the code, but I'm not quite sure how it came about and how to fix it.
The text was updated successfully, but these errors were encountered: