Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training Loss Error #1

Open
j18567260 opened this issue Apr 8, 2021 · 4 comments
Open

Training Loss Error #1

j18567260 opened this issue Apr 8, 2021 · 4 comments

Comments

@j18567260
Copy link

在单卡训练的过程中,当训练到第一个epoch 450/38298时,损失值异常,直到后400batch报bbox错误。是不是数据集生成的代码有问题?
2021-04-08 14:26:24,187 - INFO - Epoch [1][400/38298] lr: 0.00692, eta: 4 days, 5:44:23, time: 0.798, data_time: 0.030, memory: 4559, loss_rpn_cls: 0.3144, loss_rpn_bbox: 0.1515, s0.rbbox_loss_cls: 0.4531, s0.rbbox_acc: 92.7930, s0.rbbox_loss_bbox: 0.5924, s1.rbbox_loss_cls: 0.2729, s1.rbbox_acc: 95.7763, s1.rbbox_loss_bbox: 0.1261, loss: 1.9104
2021-04-08 14:27:03,769 - INFO - Epoch [1][450/38298] lr: 0.00746, eta: 4 days, 5:38:36, time: 0.792, data_time: 0.027, memory: 4559, loss_rpn_cls: 1.1329, loss_rpn_bbox: 3.9938, s0.rbbox_loss_cls: 0.3731, s0.rbbox_acc: 94.3412, s0.rbbox_loss_bbox: 0.3807, s1.rbbox_loss_cls: 0.3043, s1.rbbox_acc: 95.6289, s1.rbbox_loss_bbox: 0.0709, loss: 6.2557
2021-04-08 14:27:43,211 - INFO - Epoch [1][500/38298] lr: 0.00799, eta: 4 days, 5:31:41, time: 0.789, data_time: 0.030, memory: 4559, loss_rpn_cls: 4606.1320, loss_rpn_bbox: 411.3401, s0.rbbox_loss_cls: 3.2093, s0.rbbox_acc: 95.5460, s0.rbbox_loss_bbox: 1.5279, s1.rbbox_loss_cls: 3.1706, s1.rbbox_acc: 96.1591, s1.rbbox_loss_bbox: 0.0434, loss: 5025.4235
2021-04-08 14:28:28,953 - INFO - Epoch [1][550/38298] lr: 0.00800, eta: 4 days, 6:53:34, time: 0.915, data_time: 0.139, memory: 4804, loss_rpn_cls: 2756.4105, loss_rpn_bbox: 772.6908, s0.rbbox_loss_cls: 2.1745, s0.rbbox_acc: 93.6437, s0.rbbox_loss_bbox: 1.4731, s1.rbbox_loss_cls: 2.1121, s1.rbbox_acc: 93.7372, s1.rbbox_loss_bbox: 0.0369, loss: 3534.8980
2021-04-08 14:29:06,726 - INFO - Epoch [1][600/38298] lr: 0.00800, eta: 4 days, 6:20:04, time: 0.755, data_time: 0.020, memory: 4804, loss_rpn_cls: 22709.9212, loss_rpn_bbox: 2486.1367, s0.rbbox_loss_cls: 12.9946, s0.rbbox_acc: 94.3856, s0.rbbox_loss_bbox: 5.9148, s1.rbbox_loss_cls: 17.1465, s1.rbbox_acc: 94.4870, s1.rbbox_loss_bbox: 0.2170, loss: 25232.3292
2021-04-08 14:29:44,850 - INFO - Epoch [1][650/38298] lr: 0.00800, eta: 4 days, 5:55:45, time: 0.762, data_time: 0.033, memory: 4804, loss_rpn_cls: 112816.2896, loss_rpn_bbox: 60489.4470, s0.rbbox_loss_cls: 92.8384, s0.rbbox_acc: 83.4779, s0.rbbox_loss_bbox: 22.8278, s1.rbbox_loss_cls: 109.9155, s1.rbbox_acc: 83.5014, s1.rbbox_loss_bbox: 1.9195, loss: 173533.2350
2021-04-08 14:30:21,997 - INFO - Epoch [1][700/38298] lr: 0.00800, eta: 4 days, 5:24:08, time: 0.743, data_time: 0.020, memory: 4804, loss_rpn_cls: 17055451.9681, loss_rpn_bbox: 28939940.7239, s0.rbbox_loss_cls: 1998.5494, s0.rbbox_acc: 85.2132, s0.rbbox_loss_bbox: 360.3875, s1.rbbox_loss_cls: 2116.1039, s1.rbbox_acc: 85.2131, s1.rbbox_loss_bbox: 42.7391, loss: 45999909.1724
2021-04-08 14:30:58,686 - INFO - Epoch [1][750/38298] lr: 0.00800, eta: 4 days, 4:52:00, time: 0.734, data_time: 0.020, memory: 4804, loss_rpn_cls: 610655327.4717, loss_rpn_bbox: 877621758.9251, s0.rbbox_loss_cls: 4449.5225, s0.rbbox_acc: 84.1080, s0.rbbox_loss_bbox: 1000.2603, s1.rbbox_loss_cls: 4128.5755, s1.rbbox_acc: 81.5336, s1.rbbox_loss_bbox: 79.6218, loss: 1488286705.3515
2021-04-08 14:31:36,322 - INFO - Epoch [1][800/38298] lr: 0.00800, eta: 4 days, 4:32:51, time: 0.753, data_time: 0.031, memory: 4804, loss_rpn_cls: 3780733863668.9443, loss_rpn_bbox: 4202940688343.9102, s0.rbbox_loss_cls: 191388.0878, s0.rbbox_acc: 83.0183, s0.rbbox_loss_bbox: 89285.3539, s1.rbbox_loss_cls: 480514.5821, s1.rbbox_acc: 83.1815, s1.rbbox_loss_bbox: 2610.6784, loss: 7983675289758.8213
Traceback (most recent call last):
File "tools/train.py", line 97, in
main()
File "tools/train.py", line 93, in main
logger=logger)
File "/home/gen/PycharmProjects/CG-Net-master/mmdet/apis/train.py", line 61, in train_detector
_non_dist_train(model, dataset, cfg, validate=validate)
File "/home/gen/PycharmProjects/CG-Net-master/mmdet/apis/train.py", line 219, in _non_dist_train
runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
File "/home/gen/anaconda3/envs/cgnet/lib/python3.6/site-packages/mmcv/runner/runner.py", line 384, in run
epoch_runner(data_loaders[i], **kwargs)
File "/home/gen/anaconda3/envs/cgnet/lib/python3.6/site-packages/mmcv/runner/runner.py", line 283, in train
self.model, data_batch, train_mode=True, **kwargs)
File "/home/gen/PycharmProjects/CG-Net-master/mmdet/apis/train.py", line 39, in batch_processor
losses = model(**data)
File "/home/gen/anaconda3/envs/cgnet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/home/gen/anaconda3/envs/cgnet/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 150, in forward
return self.module(*inputs[0], **kwargs[0])
File "/home/gen/anaconda3/envs/cgnet/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/home/gen/PycharmProjects/CG-Net-master/mmdet/models/detectors/base_new.py", line 95, in forward
return self.forward_train(img, img_meta, **kwargs)
File "/home/gen/PycharmProjects/CG-Net-master/mmdet/models/detectors/RoITransformer.py", line 223, in forward_train
gt_labels[i])
File "/home/gen/PycharmProjects/CG-Net-master/mmdet/core/bbox/assigners/max_iou_assigner_rbbox.py", line 73, in assign
raise ValueError('No gt or bboxes')
ValueError: No gt or bboxes

@WeiZongqi
Copy link
Owner

WeiZongqi commented Apr 14, 2021

做数据的文件没问题,通过查看loss,应该是梯度爆炸的问题,我的环境是2卡1080ti,建议尝试调大batchsize或调小学习率试一下

@kongyan66
Copy link

@j18567260 你好,请问你训练遇到这个问题吗?
TypeError: logger must be a logging.Logger object, but got <class 'str'>

@hengseuer
Copy link

@j18567260 你好,请问你训练遇到这个问题吗?
TypeError: logger must be a logging.Logger object, but got <class 'str'>

use mmcv==0.4.0

@kongyan66
Copy link

kongyan66 commented May 28, 2021

@ahaheng @WeiZongqi @j18567260
环境问题解决了,使用DOTA数据集并转为coco格式,可是训练时候报错:
image
这个有遇到吗?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants