-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training Loss Error #1
Comments
做数据的文件没问题,通过查看loss,应该是梯度爆炸的问题,我的环境是2卡1080ti,建议尝试调大batchsize或调小学习率试一下 |
@j18567260 你好,请问你训练遇到这个问题吗? |
use mmcv==0.4.0 |
@ahaheng @WeiZongqi @j18567260 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
在单卡训练的过程中,当训练到第一个epoch 450/38298时,损失值异常,直到后400batch报bbox错误。是不是数据集生成的代码有问题?
2021-04-08 14:26:24,187 - INFO - Epoch [1][400/38298] lr: 0.00692, eta: 4 days, 5:44:23, time: 0.798, data_time: 0.030, memory: 4559, loss_rpn_cls: 0.3144, loss_rpn_bbox: 0.1515, s0.rbbox_loss_cls: 0.4531, s0.rbbox_acc: 92.7930, s0.rbbox_loss_bbox: 0.5924, s1.rbbox_loss_cls: 0.2729, s1.rbbox_acc: 95.7763, s1.rbbox_loss_bbox: 0.1261, loss: 1.9104
2021-04-08 14:27:03,769 - INFO - Epoch [1][450/38298] lr: 0.00746, eta: 4 days, 5:38:36, time: 0.792, data_time: 0.027, memory: 4559, loss_rpn_cls: 1.1329, loss_rpn_bbox: 3.9938, s0.rbbox_loss_cls: 0.3731, s0.rbbox_acc: 94.3412, s0.rbbox_loss_bbox: 0.3807, s1.rbbox_loss_cls: 0.3043, s1.rbbox_acc: 95.6289, s1.rbbox_loss_bbox: 0.0709, loss: 6.2557
2021-04-08 14:27:43,211 - INFO - Epoch [1][500/38298] lr: 0.00799, eta: 4 days, 5:31:41, time: 0.789, data_time: 0.030, memory: 4559, loss_rpn_cls: 4606.1320, loss_rpn_bbox: 411.3401, s0.rbbox_loss_cls: 3.2093, s0.rbbox_acc: 95.5460, s0.rbbox_loss_bbox: 1.5279, s1.rbbox_loss_cls: 3.1706, s1.rbbox_acc: 96.1591, s1.rbbox_loss_bbox: 0.0434, loss: 5025.4235
2021-04-08 14:28:28,953 - INFO - Epoch [1][550/38298] lr: 0.00800, eta: 4 days, 6:53:34, time: 0.915, data_time: 0.139, memory: 4804, loss_rpn_cls: 2756.4105, loss_rpn_bbox: 772.6908, s0.rbbox_loss_cls: 2.1745, s0.rbbox_acc: 93.6437, s0.rbbox_loss_bbox: 1.4731, s1.rbbox_loss_cls: 2.1121, s1.rbbox_acc: 93.7372, s1.rbbox_loss_bbox: 0.0369, loss: 3534.8980
2021-04-08 14:29:06,726 - INFO - Epoch [1][600/38298] lr: 0.00800, eta: 4 days, 6:20:04, time: 0.755, data_time: 0.020, memory: 4804, loss_rpn_cls: 22709.9212, loss_rpn_bbox: 2486.1367, s0.rbbox_loss_cls: 12.9946, s0.rbbox_acc: 94.3856, s0.rbbox_loss_bbox: 5.9148, s1.rbbox_loss_cls: 17.1465, s1.rbbox_acc: 94.4870, s1.rbbox_loss_bbox: 0.2170, loss: 25232.3292
2021-04-08 14:29:44,850 - INFO - Epoch [1][650/38298] lr: 0.00800, eta: 4 days, 5:55:45, time: 0.762, data_time: 0.033, memory: 4804, loss_rpn_cls: 112816.2896, loss_rpn_bbox: 60489.4470, s0.rbbox_loss_cls: 92.8384, s0.rbbox_acc: 83.4779, s0.rbbox_loss_bbox: 22.8278, s1.rbbox_loss_cls: 109.9155, s1.rbbox_acc: 83.5014, s1.rbbox_loss_bbox: 1.9195, loss: 173533.2350
2021-04-08 14:30:21,997 - INFO - Epoch [1][700/38298] lr: 0.00800, eta: 4 days, 5:24:08, time: 0.743, data_time: 0.020, memory: 4804, loss_rpn_cls: 17055451.9681, loss_rpn_bbox: 28939940.7239, s0.rbbox_loss_cls: 1998.5494, s0.rbbox_acc: 85.2132, s0.rbbox_loss_bbox: 360.3875, s1.rbbox_loss_cls: 2116.1039, s1.rbbox_acc: 85.2131, s1.rbbox_loss_bbox: 42.7391, loss: 45999909.1724
2021-04-08 14:30:58,686 - INFO - Epoch [1][750/38298] lr: 0.00800, eta: 4 days, 4:52:00, time: 0.734, data_time: 0.020, memory: 4804, loss_rpn_cls: 610655327.4717, loss_rpn_bbox: 877621758.9251, s0.rbbox_loss_cls: 4449.5225, s0.rbbox_acc: 84.1080, s0.rbbox_loss_bbox: 1000.2603, s1.rbbox_loss_cls: 4128.5755, s1.rbbox_acc: 81.5336, s1.rbbox_loss_bbox: 79.6218, loss: 1488286705.3515
2021-04-08 14:31:36,322 - INFO - Epoch [1][800/38298] lr: 0.00800, eta: 4 days, 4:32:51, time: 0.753, data_time: 0.031, memory: 4804, loss_rpn_cls: 3780733863668.9443, loss_rpn_bbox: 4202940688343.9102, s0.rbbox_loss_cls: 191388.0878, s0.rbbox_acc: 83.0183, s0.rbbox_loss_bbox: 89285.3539, s1.rbbox_loss_cls: 480514.5821, s1.rbbox_acc: 83.1815, s1.rbbox_loss_bbox: 2610.6784, loss: 7983675289758.8213
Traceback (most recent call last):
File "tools/train.py", line 97, in
main()
File "tools/train.py", line 93, in main
logger=logger)
File "/home/gen/PycharmProjects/CG-Net-master/mmdet/apis/train.py", line 61, in train_detector
_non_dist_train(model, dataset, cfg, validate=validate)
File "/home/gen/PycharmProjects/CG-Net-master/mmdet/apis/train.py", line 219, in _non_dist_train
runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
File "/home/gen/anaconda3/envs/cgnet/lib/python3.6/site-packages/mmcv/runner/runner.py", line 384, in run
epoch_runner(data_loaders[i], **kwargs)
File "/home/gen/anaconda3/envs/cgnet/lib/python3.6/site-packages/mmcv/runner/runner.py", line 283, in train
self.model, data_batch, train_mode=True, **kwargs)
File "/home/gen/PycharmProjects/CG-Net-master/mmdet/apis/train.py", line 39, in batch_processor
losses = model(**data)
File "/home/gen/anaconda3/envs/cgnet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/home/gen/anaconda3/envs/cgnet/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 150, in forward
return self.module(*inputs[0], **kwargs[0])
File "/home/gen/anaconda3/envs/cgnet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/home/gen/PycharmProjects/CG-Net-master/mmdet/models/detectors/base_new.py", line 95, in forward
return self.forward_train(img, img_meta, **kwargs)
File "/home/gen/PycharmProjects/CG-Net-master/mmdet/models/detectors/RoITransformer.py", line 223, in forward_train
gt_labels[i])
File "/home/gen/PycharmProjects/CG-Net-master/mmdet/core/bbox/assigners/max_iou_assigner_rbbox.py", line 73, in assign
raise ValueError('No gt or bboxes')
ValueError: No gt or bboxes
The text was updated successfully, but these errors were encountered: