Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

loss=nan #14

Open
meixiangzhang opened this issue Aug 16, 2023 · 2 comments
Open

loss=nan #14

meixiangzhang opened this issue Aug 16, 2023 · 2 comments

Comments

@meixiangzhang
Copy link

During training, the initial loss was effective, but later the loss became nan. I am using the URPC2019 dataset in VOC format, and the environment is configured according to your instructions. Apart from the dataset format, I have not modified any other configuration files. How can I solve this problem?

-------------------- 2023-08-16 19:47:44,175 - mmdet - INFO - workflow: [('train', 1)], max: 12 epochs 2023-08-16 19:47:44,175 - mmdet - INFO - Checkpoints will be saved to D:\ZMX\Boosting-R-CNN-master\work_dirs\boosting_rcnn_r50_pafpn_1x_voc by HardDiskB ackend. 2023-08-16 19:48:46,645 - mmdet - INFO - Epoch [1][50/951] lr: 4.945e-04, eta: 3:56:35, time: 1.249, data_time: 0.801, memory: 6129, loss_rpn_cls: 0.5699, loss_rpn_bbox: 0.4639, loss_rpn_iou: 0.6679, loss_bbox: 0.3773, loss_cls: 1.0322, acc: 82.9473, loss: 3.1113, grad_norm: 36.3708 2023-08-16 19:49:05,882 - mmdet - INFO - Epoch [1][100/951] lr: 9.940e-04, eta: 2:34:02, time: 0.385, data_time: 0.003, memory: 6129, loss_rpn_cls: 0.5379, loss_rpn_bbox: 0.4408, loss_rpn_iou: 0.6699, loss_bbox: 0.4450, loss_cls: 0.5393, acc: 94.2852, loss: 2.6328, grad_norm: 25.0444 2023-08-16 19:49:25,151 - mmdet - INFO - Epoch [1][150/951] lr: 1.494e-03, eta: 2:06:21, time: 0.385, data_time: 0.003, memory: 6129, loss_rpn_cls: 0.7055, loss_rpn_bbox: 0.4066, loss_rpn_iou: 0.6846, loss_bbox: 0.6769, loss_cls: 0.5754, acc: 93.0977, loss: 3.0490, grad_norm: 18.9466 2023-08-16 19:49:44,496 - mmdet - INFO - Epoch [1][200/951] lr: 1.993e-03, eta: 1:52:25, time: 0.387, data_time: 0.003, memory: 6129, loss_rpn_cls: 0.6782, loss_rpn_bbox: 0.3794, loss_rpn_iou: 0.6740, loss_bbox: 0.6967, loss_cls: 0.4901, acc: 93.7217, loss: 2.9185, grad_norm: 15.8338 2023-08-16 19:50:03,819 - mmdet - INFO - Epoch [1][250/951] lr: 2.493e-03, eta: 1:43:54, time: 0.386, data_time: 0.003, memory: 6129, loss_rpn_cls: 0.6850, loss_rpn_bbox: 0.3607, loss_rpn_iou: 0.6641, loss_bbox: 0.6738, loss_cls: 0.4112, acc: 94.6426, loss: 2.7948, grad_norm: 15.0501 2023-08-16 19:50:23,153 - mmdet - INFO - Epoch [1][300/951] lr: 2.992e-03, eta: 1:38:08, time: 0.387, data_time: 0.003, memory: 6129, loss_rpn_cls: 0.7968, loss_rpn_bbox: 0.3522, loss_rpn_iou: 0.6599, loss_bbox: 0.6060, loss_cls: 0.3809, acc: 94.4990, loss: 2.7958, grad_norm: 12.3667 2023-08-16 19:50:42,446 - mmdet - INFO - Epoch [1][350/951] lr: 3.492e-03, eta: 1:33:54, time: 0.386, data_time: 0.003, memory: 6129, loss_rpn_cls: 0.6889, loss_rpn_bbox: 0.3408, loss_rpn_iou: 0.6569, loss_bbox: 0.6401, loss_cls: 0.3920, acc: 94.2471, loss: 2.7187, grad_norm: 11.3766 2023-08-16 19:51:01,799 - mmdet - INFO - Epoch [1][400/951] lr: 3.991e-03, eta: 1:30:40, time: 0.387, data_time: 0.003, memory: 6129, loss_rpn_cls: 0.6298, loss_rpn_bbox: 0.3567, loss_rpn_iou: 0.6685, loss_bbox: 0.6711, loss_cls: 0.3629, acc: 94.6396, loss: 2.6889, grad_norm: 10.2118 2023-08-16 19:51:21,118 - mmdet - INFO - Epoch [1][450/951] lr: 4.491e-03, eta: 1:28:04, time: 0.386, data_time: 0.003, memory: 6129, loss_rpn_cls: 0.7109, loss_rpn_bbox: 0.3316, loss_rpn_iou: 0.6526, loss_bbox: 0.6230, loss_cls: 0.3773, acc: 94.5322, loss: 2.6955, grad_norm: 9.5380 2023-08-16 19:51:39,808 - mmdet - INFO - Epoch [1][500/951] lr: 4.990e-03, eta: 1:25:42, time: 0.374, data_time: 0.003, memory: 6129, loss_rpn_cls: nan, loss_rpn_bbox: nan, loss_rpn_iou: nan, loss_bbox: nan, loss_cls: nan, acc: 85.8550, loss: nan, grad_norm: nan 2023-08-16 19:51:53,656 - mmdet - INFO - Epoch [1][550/951] lr: 5.000e-03, eta: 1:22:06, time: 0.277, data_time: 0.003, memory: 6129, loss_rpn_cls: nan, loss_rpn_bbox: nan, loss_rpn_iou: nan, loss_bbox: nan, loss_cls: nan, acc: 12.9807, loss: nan, grad_norm: nan

@mousecpn
Copy link
Owner

During training, the initial loss was effective, but later the loss became nan. I am using the URPC2019 dataset in VOC format, and the environment is configured according to your instructions. Apart from the dataset format, I have not modified any other configuration files. How can I solve this problem?

-------------------- 2023-08-16 19:47:44,175 - mmdet - INFO - workflow: [('train', 1)], max: 12 epochs 2023-08-16 19:47:44,175 - mmdet - INFO - Checkpoints will be saved to D:\ZMX\Boosting-R-CNN-master\work_dirs\boosting_rcnn_r50_pafpn_1x_voc by HardDiskB ackend. 2023-08-16 19:48:46,645 - mmdet - INFO - Epoch [1][50/951] lr: 4.945e-04, eta: 3:56:35, time: 1.249, data_time: 0.801, memory: 6129, loss_rpn_cls: 0.5699, loss_rpn_bbox: 0.4639, loss_rpn_iou: 0.6679, loss_bbox: 0.3773, loss_cls: 1.0322, acc: 82.9473, loss: 3.1113, grad_norm: 36.3708 2023-08-16 19:49:05,882 - mmdet - INFO - Epoch [1][100/951] lr: 9.940e-04, eta: 2:34:02, time: 0.385, data_time: 0.003, memory: 6129, loss_rpn_cls: 0.5379, loss_rpn_bbox: 0.4408, loss_rpn_iou: 0.6699, loss_bbox: 0.4450, loss_cls: 0.5393, acc: 94.2852, loss: 2.6328, grad_norm: 25.0444 2023-08-16 19:49:25,151 - mmdet - INFO - Epoch [1][150/951] lr: 1.494e-03, eta: 2:06:21, time: 0.385, data_time: 0.003, memory: 6129, loss_rpn_cls: 0.7055, loss_rpn_bbox: 0.4066, loss_rpn_iou: 0.6846, loss_bbox: 0.6769, loss_cls: 0.5754, acc: 93.0977, loss: 3.0490, grad_norm: 18.9466 2023-08-16 19:49:44,496 - mmdet - INFO - Epoch [1][200/951] lr: 1.993e-03, eta: 1:52:25, time: 0.387, data_time: 0.003, memory: 6129, loss_rpn_cls: 0.6782, loss_rpn_bbox: 0.3794, loss_rpn_iou: 0.6740, loss_bbox: 0.6967, loss_cls: 0.4901, acc: 93.7217, loss: 2.9185, grad_norm: 15.8338 2023-08-16 19:50:03,819 - mmdet - INFO - Epoch [1][250/951] lr: 2.493e-03, eta: 1:43:54, time: 0.386, data_time: 0.003, memory: 6129, loss_rpn_cls: 0.6850, loss_rpn_bbox: 0.3607, loss_rpn_iou: 0.6641, loss_bbox: 0.6738, loss_cls: 0.4112, acc: 94.6426, loss: 2.7948, grad_norm: 15.0501 2023-08-16 19:50:23,153 - mmdet - INFO - Epoch [1][300/951] lr: 2.992e-03, eta: 1:38:08, time: 0.387, data_time: 0.003, memory: 6129, loss_rpn_cls: 0.7968, loss_rpn_bbox: 0.3522, loss_rpn_iou: 0.6599, loss_bbox: 0.6060, loss_cls: 0.3809, acc: 94.4990, loss: 2.7958, grad_norm: 12.3667 2023-08-16 19:50:42,446 - mmdet - INFO - Epoch [1][350/951] lr: 3.492e-03, eta: 1:33:54, time: 0.386, data_time: 0.003, memory: 6129, loss_rpn_cls: 0.6889, loss_rpn_bbox: 0.3408, loss_rpn_iou: 0.6569, loss_bbox: 0.6401, loss_cls: 0.3920, acc: 94.2471, loss: 2.7187, grad_norm: 11.3766 2023-08-16 19:51:01,799 - mmdet - INFO - Epoch [1][400/951] lr: 3.991e-03, eta: 1:30:40, time: 0.387, data_time: 0.003, memory: 6129, loss_rpn_cls: 0.6298, loss_rpn_bbox: 0.3567, loss_rpn_iou: 0.6685, loss_bbox: 0.6711, loss_cls: 0.3629, acc: 94.6396, loss: 2.6889, grad_norm: 10.2118 2023-08-16 19:51:21,118 - mmdet - INFO - Epoch [1][450/951] lr: 4.491e-03, eta: 1:28:04, time: 0.386, data_time: 0.003, memory: 6129, loss_rpn_cls: 0.7109, loss_rpn_bbox: 0.3316, loss_rpn_iou: 0.6526, loss_bbox: 0.6230, loss_cls: 0.3773, acc: 94.5322, loss: 2.6955, grad_norm: 9.5380 2023-08-16 19:51:39,808 - mmdet - INFO - Epoch [1][500/951] lr: 4.990e-03, eta: 1:25:42, time: 0.374, data_time: 0.003, memory: 6129, loss_rpn_cls: nan, loss_rpn_bbox: nan, loss_rpn_iou: nan, loss_bbox: nan, loss_cls: nan, acc: 85.8550, loss: nan, grad_norm: nan 2023-08-16 19:51:53,656 - mmdet - INFO - Epoch [1][550/951] lr: 5.000e-03, eta: 1:22:06, time: 0.277, data_time: 0.003, memory: 6129, loss_rpn_cls: nan, loss_rpn_bbox: nan, loss_rpn_iou: nan, loss_bbox: nan, loss_cls: nan, acc: 12.9807, loss: nan, grad_norm: nan

That's super weird. But I did realize that the VOC format URPC often caused a problem. My advices are:
(1) Check whether the problem comes from the dataset. Locate the image which causes nan. Checking whether the images that cause the problem are the same image.
(2) Try a grad_clip.
(3) turn this VOC format to COCO format.

@meixiangzhang
Copy link
Author

During training, the initial loss was effective, but later the loss became nan. I am using the URPC2019 dataset in VOC format, and the environment is configured according to your instructions. Apart from the dataset format, I have not modified any other configuration files. How can I solve this problem?
-------------------- 2023-08-16 19:47:44,175 - mmdet - INFO - workflow: [('train', 1)], max: 12 epochs 2023-08-16 19:47:44,175 - mmdet - INFO - Checkpoints will be saved to D:\ZMX\Boosting-R-CNN-master\work_dirs\boosting_rcnn_r50_pafpn_1x_voc by HardDiskB ackend. 2023-08-16 19:48:46,645 - mmdet - INFO - Epoch [1][50/951] lr: 4.945e-04, eta: 3:56:35, time: 1.249, data_time: 0.801, memory: 6129, loss_rpn_cls: 0.5699, loss_rpn_bbox: 0.4639, loss_rpn_iou: 0.6679, loss_bbox: 0.3773, loss_cls: 1.0322, acc: 82.9473, loss: 3.1113, grad_norm: 36.3708 2023-08-16 19:49:05,882 - mmdet - INFO - Epoch [1][100/951] lr: 9.940e-04, eta: 2:34:02, time: 0.385, data_time: 0.003, memory: 6129, loss_rpn_cls: 0.5379, loss_rpn_bbox: 0.4408, loss_rpn_iou: 0.6699, loss_bbox: 0.4450, loss_cls: 0.5393, acc: 94.2852, loss: 2.6328, grad_norm: 25.0444 2023-08-16 19:49:25,151 - mmdet - INFO - Epoch [1][150/951] lr: 1.494e-03, eta: 2:06:21, time: 0.385, data_time: 0.003, memory: 6129, loss_rpn_cls: 0.7055, loss_rpn_bbox: 0.4066, loss_rpn_iou: 0.6846, loss_bbox: 0.6769, loss_cls: 0.5754, acc: 93.0977, loss: 3.0490, grad_norm: 18.9466 2023-08-16 19:49:44,496 - mmdet - INFO - Epoch [1][200/951] lr: 1.993e-03, eta: 1:52:25, time: 0.387, data_time: 0.003, memory: 6129, loss_rpn_cls: 0.6782, loss_rpn_bbox: 0.3794, loss_rpn_iou: 0.6740, loss_bbox: 0.6967, loss_cls: 0.4901, acc: 93.7217, loss: 2.9185, grad_norm: 15.8338 2023-08-16 19:50:03,819 - mmdet - INFO - Epoch [1][250/951] lr: 2.493e-03, eta: 1:43:54, time: 0.386, data_time: 0.003, memory: 6129, loss_rpn_cls: 0.6850, loss_rpn_bbox: 0.3607, loss_rpn_iou: 0.6641, loss_bbox: 0.6738, loss_cls: 0.4112, acc: 94.6426, loss: 2.7948, grad_norm: 15.0501 2023-08-16 19:50:23,153 - mmdet - INFO - Epoch [1][300/951] lr: 2.992e-03, eta: 1:38:08, time: 0.387, data_time: 0.003, memory: 6129, loss_rpn_cls: 0.7968, loss_rpn_bbox: 0.3522, loss_rpn_iou: 0.6599, loss_bbox: 0.6060, loss_cls: 0.3809, acc: 94.4990, loss: 2.7958, grad_norm: 12.3667 2023-08-16 19:50:42,446 - mmdet - INFO - Epoch [1][350/951] lr: 3.492e-03, eta: 1:33:54, time: 0.386, data_time: 0.003, memory: 6129, loss_rpn_cls: 0.6889, loss_rpn_bbox: 0.3408, loss_rpn_iou: 0.6569, loss_bbox: 0.6401, loss_cls: 0.3920, acc: 94.2471, loss: 2.7187, grad_norm: 11.3766 2023-08-16 19:51:01,799 - mmdet - INFO - Epoch [1][400/951] lr: 3.991e-03, eta: 1:30:40, time: 0.387, data_time: 0.003, memory: 6129, loss_rpn_cls: 0.6298, loss_rpn_bbox: 0.3567, loss_rpn_iou: 0.6685, loss_bbox: 0.6711, loss_cls: 0.3629, acc: 94.6396, loss: 2.6889, grad_norm: 10.2118 2023-08-16 19:51:21,118 - mmdet - INFO - Epoch [1][450/951] lr: 4.491e-03, eta: 1:28:04, time: 0.386, data_time: 0.003, memory: 6129, loss_rpn_cls: 0.7109, loss_rpn_bbox: 0.3316, loss_rpn_iou: 0.6526, loss_bbox: 0.6230, loss_cls: 0.3773, acc: 94.5322, loss: 2.6955, grad_norm: 9.5380 2023-08-16 19:51:39,808 - mmdet - INFO - Epoch [1][500/951] lr: 4.990e-03, eta: 1:25:42, time: 0.374, data_time: 0.003, memory: 6129, loss_rpn_cls: nan, loss_rpn_bbox: nan, loss_rpn_iou: nan, loss_bbox: nan, loss_cls: nan, acc: 85.8550, loss: nan, grad_norm: nan 2023-08-16 19:51:53,656 - mmdet - INFO - Epoch [1][550/951] lr: 5.000e-03, eta: 1:22:06, time: 0.277, data_time: 0.003, memory: 6129, loss_rpn_cls: nan, loss_rpn_bbox: nan, loss_rpn_iou: nan, loss_bbox: nan, loss_cls: nan, acc: 12.9807, loss: nan, grad_norm: nan

That's super weird. But I did realize that the VOC format URPC often caused a problem. My advices are: (1) Check whether the problem comes from the dataset. Locate the image which causes nan. Checking whether the images that cause the problem are the same image. (2) Try a grad_clip. (3) turn this VOC format to COCO format.

Thank you for your suggestions. I replaced the dataset in COCO format and she is now working properly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants