You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.
I can run training of normal faster/mask rcnn with R-50/FPN on FP16. But I failed to run it with Retinanet(sigmoid focal loss is the problem)
I guess theres couldbe 2 options.
implement half based cuda code.
use apex to handle fp32/fp16 conversion during for/backward pass.
I found the current Sigmoid Focal Loss CUDA code doesn't support FP 16(Half).
For example, I could train normal FasterRCNN-R50-FPN but failed for "RETINANET-FasterRCNN-R50-FPN which use the sigmoid focal loss function) on FP16(using O1 opt).
When I use Sigmoid Focal Loss CPU version, it runs OK for both the original maskrcnn-benchsmark's Retinanet models and FCOS's retinanet models.
BUT estimated training time is almost "1 month"!! and after some itteration, only sigmoid focal loss becomes "nan"
2019-08-16 17:42:57,158 maskrcnn_benchmark.trainer INFO: Start training
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 32768.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 32768.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 16384.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 16384.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8192.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8192.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4096.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4096.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2048.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2048.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1024.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1024.0
2019-08-16 17:45:22,396 maskrcnn_benchmark.trainer INFO: eta: 30 days, 6:08:34 iter: 20 loss: 4.3898 (5.7831) loss_centerness: 0.6876 (0.7018) loss_cls: 1.0846 (1.0754) loss_reg: 2.5395 (4.0059) time: 8.5586 (7.2618) data: 0.0029 (0.0810) lr: 0.003333 max mem: 6133
2019-08-16 17:46:51,064 maskrcnn_benchmark.trainer INFO: eta: 24 days, 8:41:46 iter: 40 loss: 3.3856 (4.6105) loss_centerness: 0.6687 (0.6854) loss_cls: 1.0689 (1.0852) loss_reg: 1.6177 (2.8399) time: 3.0172 (5.8476) data: 0.0025 (0.0420) lr: 0.003333 max mem: 6133
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 512.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 512.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 256.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 256.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 128.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 128.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 64.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 64.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 32.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 32.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 16.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 16.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.5
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.5
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.25
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.25
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.125
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.125
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0625
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0625
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.03125
2019-08-16 17:48:29,613 maskrcnn_benchmark.trainer INFO: eta: 23 days, 1:59:50 iter: 60 loss: nan (nan) loss_centerness: 0.6633 (0.6788) loss_cls: nan (nan) loss_reg: 1.5505 (2.3987) time: 4.5355 (5.5409) data: 0.0030 (0.0291) lr: 0.003333 max mem: 6133
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.03125
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.015625
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.015625
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0078125
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0078125
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.00390625
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.00390625
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.001953125
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.001953125
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0009765625
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0009765625
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.00048828125
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.00048828125
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.000244140625
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.000244140625
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0001220703125
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0001220703125
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.103515625e-05
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.103515625e-05
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.0517578125e-05
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.0517578125e-05
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.52587890625e-05
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.52587890625e-05
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.62939453125e-06
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.62939453125e-06
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.814697265625e-06
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.814697265625e-06
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9073486328125e-06
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9073486328125e-06
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.5367431640625e-07
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.5367431640625e-07
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.76837158203125e-07
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.76837158203125e-07
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.384185791015625e-07
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.384185791015625e-07
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1920928955078125e-07
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1920928955078125e-07
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.960464477539063e-08
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.960464477539063e-08
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.9802322387695312e-08
2019-08-16 17:49:48,975 maskrcnn_benchmark.trainer INFO: eta: 21 days, 10:39:17 iter: 80 loss: nan (nan) loss_centerness: 0.6606 (0.6748) loss_cls: nan (nan) loss_reg: 1.5613 (2.1857) time: 3.1732 (5.1477) data: 0.0024 (0.0225) lr: 0.003333 max mem: 6133
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.9802322387695312e-08
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4901161193847656e-08
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4901161193847656e-08
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.450580596923828e-09
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.450580596923828e-09
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.725290298461914e-09
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.725290298461914e-09
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.862645149230957e-09
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.862645149230957e-09
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.313225746154785e-10
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.313225746154785e-10
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.656612873077393e-10
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.656612873077393e-10
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.3283064365386963e-10
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.3283064365386963e-10
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1641532182693481e-10
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1641532182693481e-10
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.820766091346741e-11
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.820766091346741e-11
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.9103830456733704e-11
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.9103830456733704e-11
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4551915228366852e-11
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4551915228366852e-11
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.275957614183426e-12
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.275957614183426e-12
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.637978807091713e-12
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.637978807091713e-12
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8189894035458565e-12
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8189894035458565e-12
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.094947017729282e-13
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.094947017729282e-13
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.547473508864641e-13
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.547473508864641e-13
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.2737367544323206e-13
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.2737367544323206e-13
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1368683772161603e-13
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1368683772161603e-13
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.684341886080802e-14
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.684341886080802e-14
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.842170943040401e-14
2019-08-16 17:51:34,847 maskrcnn_benchmark.trainer INFO: eta: 21 days, 13:32:36 iter: 100 loss: nan (nan) loss_centerness: 0.6631 (0.6728) loss_cls: nan (nan) loss_reg: 1.5312 (2.0573) time: 7.3970 (5.1769) data: 0.0026 (0.0186) lr: 0.003333 max mem: 6133
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.842170943040401e-14
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4210854715202004e-14
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4210854715202004e-14
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.105427357601002e-15
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.105427357601002e-15
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.552713678800501e-15
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.552713678800501e-15
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.7763568394002505e-15
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.7763568394002505e-15
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.881784197001252e-16
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.881784197001252e-16
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.440892098500626e-16
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.440892098500626e-16
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.220446049250313e-16
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.220446049250313e-16
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1102230246251565e-16
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1102230246251565e-16
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.551115123125783e-17
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.551115123125783e-17
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.7755575615628914e-17
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.7755575615628914e-17
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3877787807814457e-17
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3877787807814457e-17
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.938893903907228e-18
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.938893903907228e-18
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.469446951953614e-18
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.469446951953614e-18
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.734723475976807e-18
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.734723475976807e-18
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.673617379884035e-19
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.673617379884035e-19
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.336808689942018e-19
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.336808689942018e-19
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.168404344971009e-19
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.168404344971009e-19
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0842021724855044e-19
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0842021724855044e-19
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.421010862427522e-20
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.421010862427522e-20
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.710505431213761e-20
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.710505431213761e-20
2019-08-16 17:52:37,750 maskrcnn_benchmark.trainer INFO: eta: 20 days, 3:39:49 iter: 120 loss: nan (nan) loss_centerness: 0.6604 (0.6710) loss_cls: nan (nan) loss_reg: 1.5168 (1.9788) time: 2.9350 (4.8383) data: 0.0030 (0.0161) lr: 0.003333 max mem: 6133
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3552527156068805e-20
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3552527156068805e-20
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.776263578034403e-21
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.776263578034403e-21
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.3881317890172014e-21
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.3881317890172014e-21
To Reproduce
after installing this repo, I just run following command
python3 -m torch.distributed.launch --nproc_per_node=$NGPUS /home/ktai01/maskrcnn-benchmark/tools/train_net.py --config-file /home/ktai01/maskrcnn-benchmark/configs/retinanet/retinanet_R-50-FPN_P5_1x.yaml MODEL.RPN.FPN_POST_NMS_TOP_N_TRAIN 1000 DTYPE "float16"
Traceback (most recent call last):
File "/home/ktai01/maskrcnn-benchmark/tools/train_net.py", line 191, in
main()
File "/home/ktai01/maskrcnn-benchmark/tools/train_net.py", line 184, in main
model = train(cfg, args.local_rank, args.distributed)
File "/home/ktai01/maskrcnn-benchmark/tools/train_net.py", line 85, in train
arguments,
File "/home/ktai01/maskrcnn-benchmark/maskrcnn_benchmark/engine/trainer.py", line 71, in do_train
loss_dict = model(images, targets)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/distributed.py", line 376, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/home/ktai01/maskrcnn-benchmark/maskrcnn_benchmark/modeling/detector/generalized_rcnn.py", line 50, in forward
proposals, proposal_losses = self.rpn(images, features, targets)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/home/ktai01/maskrcnn-benchmark/maskrcnn_benchmark/modeling/rpn/retinanet/retinanet.py", line 131, in forward
return self._forward_train(anchors, box_cls, box_regression, targets)
File "/home/ktai01/maskrcnn-benchmark/maskrcnn_benchmark/modeling/rpn/retinanet/retinanet.py", line 138, in _forward_train
anchors, box_cls, box_regression, targets
File "/home/ktai01/maskrcnn-benchmark/maskrcnn_benchmark/modeling/rpn/retinanet/loss.py", line 77, in call
labels
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, kwargs)
File "/home/ktai01/maskrcnn-benchmark/maskrcnn_benchmark/layers/sigmoid_focal_loss.py", line 68, in forward
loss = loss_func(logits, targets, self.gamma, self.alpha)
File "/home/ktai01/maskrcnn-benchmark/maskrcnn_benchmark/layers/sigmoid_focal_loss.py", line 19, in forward
logits, targets, num_classes, gamma, alpha RuntimeError: "SigmoidFocalLoss_forward" not implemented for 'Half' (operator() at /home/ktai01/maskrcnn-benchmark/maskrcnn_benchmark/csrc/cuda/SigmoidFocalLoss_cuda.cu:139)
frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x7fdc14f1b441 in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so)
frame #1: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x2a (0x7fdc14f1ad7a in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so)
frame #2: + 0x4917f (0x7fdbae35c17f in /home/ktai01/maskrcnn-benchmark/maskrcnn_benchmark/_C.cpython-36m-x86_64-linux-gnu.so)
frame #3: SigmoidFocalLoss_forward_cuda(at::Tensor const&, at::Tensor const&, int, float, float) + 0x606 (0x7fdbae35c7f5 in /home/ktai01/maskrcnn-benchmark/maskrcnn_benchmark/_C.cpython-36m-x86_64-linux-gnu.so)
frame #4: SigmoidFocalLoss_forward(at::Tensor const&, at::Tensor const&, int, float, float) + 0x64 (0x7fdbae32cb44 in /home/ktai01/maskrcnn-benchmark/maskrcnn_benchmark/_C.cpython-36m-x86_64-linux-gnu.so)
frame #5: + 0x28fcf (0x7fdbae33bfcf in /home/ktai01/maskrcnn-benchmark/maskrcnn_benchmark/_C.cpython-36m-x86_64-linux-gnu.so)
frame #6: + 0x25291 (0x7fdbae338291 in /home/ktai01/maskrcnn-benchmark/maskrcnn_benchmark/_C.cpython-36m-x86_64-linux-gnu.so)
frame #7: /usr/bin/python3() [0x5030d5]
frame #8: _PyEval_EvalFrameDefault + 0x449 (0x506859 in /usr/bin/python3)
frame #9: /usr/bin/python3() [0x504c28]
frame #10: /usr/bin/python3() [0x58644b]
frame #11: PyObject_Call + 0x3e (0x59ebbe in /usr/bin/python3)
frame #12: THPFunction_apply(_object, _object) + 0x6b1 (0x7fdc11ec5481 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #13: /usr/bin/python3() [0x502d6f]
frame #14: _PyEval_EvalFrameDefault + 0x449 (0x506859 in /usr/bin/python3)
frame #15: _PyFunction_FastCallDict + 0xf5 (0x501945 in /usr/bin/python3)
frame #16: /usr/bin/python3() [0x591461]
frame #17: PyObject_Call + 0x3e (0x59ebbe in /usr/bin/python3)
frame #18: _PyEval_EvalFrameDefault + 0x1807 (0x507c17 in /usr/bin/python3)
frame #19: /usr/bin/python3() [0x504c28]
frame #20: _PyFunction_FastCallDict + 0x2de (0x501b2e in /usr/bin/python3)
frame #21: /usr/bin/python3() [0x591461]
frame #22: PyObject_Call + 0x3e (0x59ebbe in /usr/bin/python3)
frame #23: /usr/bin/python3() [0x54d4e2]
frame #24: _PyObject_FastCallKeywords + 0x19c (0x5a730c in /usr/bin/python3)
frame #25: /usr/bin/python3() [0x503073]
frame #26: _PyEval_EvalFrameDefault + 0x449 (0x506859 in /usr/bin/python3)
frame #27: _PyFunction_FastCallDict + 0xf5 (0x501945 in /usr/bin/python3)
frame #28: /usr/bin/python3() [0x591461]
frame #29: PyObject_Call + 0x3e (0x59ebbe in /usr/bin/python3)
frame #30: /usr/bin/python3() [0x54d4e2]
frame #31: _PyObject_FastCallKeywords + 0x19c (0x5a730c in /usr/bin/python3)
frame #32: /usr/bin/python3() [0x503073]
frame #33: _PyEval_EvalFrameDefault + 0x449 (0x506859 in /usr/bin/python3)
frame #34: /usr/bin/python3() [0x502209]
frame #35: /usr/bin/python3() [0x502f3d]
frame #36: _PyEval_EvalFrameDefault + 0x449 (0x506859 in /usr/bin/python3)
frame #37: /usr/bin/python3() [0x504c28]
frame #38: _PyFunction_FastCallDict + 0x2de (0x501b2e in /usr/bin/python3)
frame #39: /usr/bin/python3() [0x591461]
frame #40: PyObject_Call + 0x3e (0x59ebbe in /usr/bin/python3)
frame #41: _PyEval_EvalFrameDefault + 0x1807 (0x507c17 in /usr/bin/python3)
frame #42: /usr/bin/python3() [0x504c28]
frame #43: _PyFunction_FastCallDict + 0x2de (0x501b2e in /usr/bin/python3)
frame #44: /usr/bin/python3() [0x591461]
frame #45: PyObject_Call + 0x3e (0x59ebbe in /usr/bin/python3)
frame #46: /usr/bin/python3() [0x54d4e2]
frame #47: _PyObject_FastCallKeywords + 0x19c (0x5a730c in /usr/bin/python3)
frame #48: /usr/bin/python3() [0x503073]
frame #49: _PyEval_EvalFrameDefault + 0x449 (0x506859 in /usr/bin/python3)
frame #50: /usr/bin/python3() [0x504c28]
frame #51: _PyFunction_FastCallDict + 0x2de (0x501b2e in /usr/bin/python3)
frame #52: /usr/bin/python3() [0x591461]
frame #53: PyObject_Call + 0x3e (0x59ebbe in /usr/bin/python3)
frame #54: _PyEval_EvalFrameDefault + 0x1807 (0x507c17 in /usr/bin/python3)
frame #55: /usr/bin/python3() [0x504c28]
frame #56: _PyFunction_FastCallDict + 0x2de (0x501b2e in /usr/bin/python3)
frame #57: /usr/bin/python3() [0x591461]
frame #58: PyObject_Call + 0x3e (0x59ebbe in /usr/bin/python3)
frame #59: /usr/bin/python3() [0x54d4e2]
frame #60: PyObject_Call + 0x3e (0x59ebbe in /usr/bin/python3)
frame #61: _PyEval_EvalFrameDefault + 0x1807 (0x507c17 in /usr/bin/python3)
frame #62: /usr/bin/python3() [0x504c28]
frame #63: _PyFunction_FastCallDict + 0x2de (0x501b2e in /usr/bin/python3)
Traceback (most recent call last):
File "/home/ktai01/maskrcnn-benchmark/tools/train_net.py", line 191, in
main()
File "/home/ktai01/maskrcnn-benchmark/tools/train_net.py", line 184, in main
model = train(cfg, args.local_rank, args.distributed)
File "/home/ktai01/maskrcnn-benchmark/tools/train_net.py", line 85, in train
arguments,
File "/home/ktai01/maskrcnn-benchmark/maskrcnn_benchmark/engine/trainer.py", line 71, in do_train
loss_dict = model(images, targets)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/distributed.py", line 376, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/home/ktai01/maskrcnn-benchmark/maskrcnn_benchmark/modeling/detector/generalized_rcnn.py", line 50, in forward
proposals, proposal_losses = self.rpn(images, features, targets)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/home/ktai01/maskrcnn-benchmark/maskrcnn_benchmark/modeling/rpn/retinanet/retinanet.py", line 131, in forward
return self._forward_train(anchors, box_cls, box_regression, targets)
File "/home/ktai01/maskrcnn-benchmark/maskrcnn_benchmark/modeling/rpn/retinanet/retinanet.py", line 138, in _forward_train
anchors, box_cls, box_regression, targets
File "/home/ktai01/maskrcnn-benchmark/maskrcnn_benchmark/modeling/rpn/retinanet/loss.py", line 77, in call
labels
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, kwargs)
File "/home/ktai01/maskrcnn-benchmark/maskrcnn_benchmark/layers/sigmoid_focal_loss.py", line 68, in forward
loss = loss_func(logits, targets, self.gamma, self.alpha)
File "/home/ktai01/maskrcnn-benchmark/maskrcnn_benchmark/layers/sigmoid_focal_loss.py", line 19, in forward
logits, targets, num_classes, gamma, alpha RuntimeError: "SigmoidFocalLoss_forward" not implemented for 'Half' (operator() at /home/ktai01/maskrcnn-benchmark/maskrcnn_benchmark/csrc/cuda/SigmoidFocalLoss_cuda.cu:139)
frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x7fd51e596441 in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so)
frame #1: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x2a (0x7fd51e595d7a in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so)
frame #2: + 0x4917f (0x7fd4b872217f in /home/ktai01/maskrcnn-benchmark/maskrcnn_benchmark/_C.cpython-36m-x86_64-linux-gnu.so)
frame #3: SigmoidFocalLoss_forward_cuda(at::Tensor const&, at::Tensor const&, int, float, float) + 0x606 (0x7fd4b87227f5 in /home/ktai01/maskrcnn-benchmark/maskrcnn_benchmark/_C.cpython-36m-x86_64-linux-gnu.so)
frame #4: SigmoidFocalLoss_forward(at::Tensor const&, at::Tensor const&, int, float, float) + 0x64 (0x7fd4b86f2b44 in /home/ktai01/maskrcnn-benchmark/maskrcnn_benchmark/_C.cpython-36m-x86_64-linux-gnu.so)
frame #5: + 0x28fcf (0x7fd4b8701fcf in /home/ktai01/maskrcnn-benchmark/maskrcnn_benchmark/_C.cpython-36m-x86_64-linux-gnu.so)
frame #6: + 0x25291 (0x7fd4b86fe291 in /home/ktai01/maskrcnn-benchmark/maskrcnn_benchmark/_C.cpython-36m-x86_64-linux-gnu.so)
frame #7: /usr/bin/python3() [0x5030d5]
frame #8: _PyEval_EvalFrameDefault + 0x449 (0x506859 in /usr/bin/python3)
frame #9: /usr/bin/python3() [0x504c28]
frame #10: /usr/bin/python3() [0x58644b]
frame #11: PyObject_Call + 0x3e (0x59ebbe in /usr/bin/python3)
frame #12: THPFunction_apply(_object, _object) + 0x6b1 (0x7fd51edb3481 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #13: /usr/bin/python3() [0x502d6f]
frame #14: _PyEval_EvalFrameDefault + 0x449 (0x506859 in /usr/bin/python3)
frame #15: _PyFunction_FastCallDict + 0xf5 (0x501945 in /usr/bin/python3)
frame #16: /usr/bin/python3() [0x591461]
frame #17: PyObject_Call + 0x3e (0x59ebbe in /usr/bin/python3)
frame #18: _PyEval_EvalFrameDefault + 0x1807 (0x507c17 in /usr/bin/python3)
frame #19: /usr/bin/python3() [0x504c28]
frame #20: _PyFunction_FastCallDict + 0x2de (0x501b2e in /usr/bin/python3)
frame #21: /usr/bin/python3() [0x591461]
frame #22: PyObject_Call + 0x3e (0x59ebbe in /usr/bin/python3)
frame #23: /usr/bin/python3() [0x54d4e2]
frame #24: _PyObject_FastCallKeywords + 0x19c (0x5a730c in /usr/bin/python3)
frame #25: /usr/bin/python3() [0x503073]
frame #26: _PyEval_EvalFrameDefault + 0x449 (0x506859 in /usr/bin/python3)
frame #27: _PyFunction_FastCallDict + 0xf5 (0x501945 in /usr/bin/python3)
frame #28: /usr/bin/python3() [0x591461]
frame #29: PyObject_Call + 0x3e (0x59ebbe in /usr/bin/python3)
frame #30: /usr/bin/python3() [0x54d4e2]
frame #31: _PyObject_FastCallKeywords + 0x19c (0x5a730c in /usr/bin/python3)
frame #32: /usr/bin/python3() [0x503073]
frame #33: _PyEval_EvalFrameDefault + 0x449 (0x506859 in /usr/bin/python3)
frame #34: /usr/bin/python3() [0x502209]
frame #35: /usr/bin/python3() [0x502f3d]
frame #36: _PyEval_EvalFrameDefault + 0x449 (0x506859 in /usr/bin/python3)
frame #37: /usr/bin/python3() [0x504c28]
frame #38: _PyFunction_FastCallDict + 0x2de (0x501b2e in /usr/bin/python3)
frame #39: /usr/bin/python3() [0x591461]
frame #40: PyObject_Call + 0x3e (0x59ebbe in /usr/bin/python3)
frame #41: _PyEval_EvalFrameDefault + 0x1807 (0x507c17 in /usr/bin/python3)
frame #42: /usr/bin/python3() [0x504c28]
frame #43: _PyFunction_FastCallDict + 0x2de (0x501b2e in /usr/bin/python3)
frame #44: /usr/bin/python3() [0x591461]
frame #45: PyObject_Call + 0x3e (0x59ebbe in /usr/bin/python3)
frame #46: /usr/bin/python3() [0x54d4e2]
frame #47: _PyObject_FastCallKeywords + 0x19c (0x5a730c in /usr/bin/python3)
frame #48: /usr/bin/python3() [0x503073]
frame #49: _PyEval_EvalFrameDefault + 0x449 (0x506859 in /usr/bin/python3)
frame #50: /usr/bin/python3() [0x504c28]
frame #51: _PyFunction_FastCallDict + 0x2de (0x501b2e in /usr/bin/python3)
frame #52: /usr/bin/python3() [0x591461]
frame #53: PyObject_Call + 0x3e (0x59ebbe in /usr/bin/python3)
frame #54: _PyEval_EvalFrameDefault + 0x1807 (0x507c17 in /usr/bin/python3)
frame #55: /usr/bin/python3() [0x504c28]
frame #56: _PyFunction_FastCallDict + 0x2de (0x501b2e in /usr/bin/python3)
frame #57: /usr/bin/python3() [0x591461]
frame #58: PyObject_Call + 0x3e (0x59ebbe in /usr/bin/python3)
frame #59: /usr/bin/python3() [0x54d4e2]
frame #60: PyObject_Call + 0x3e (0x59ebbe in /usr/bin/python3)
frame #61: _PyEval_EvalFrameDefault + 0x1807 (0x507c17 in /usr/bin/python3)
frame #62: /usr/bin/python3() [0x504c28]
frame #63: _PyFunction_FastCallDict + 0x2de (0x501b2e in /usr/bin/python3)
Traceback (most recent call last):
File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.6/dist-packages/torch/distributed/launch.py", line 235, in
main()
File "/usr/local/lib/python3.6/dist-packages/torch/distributed/launch.py", line 231, in main
cmd=process.args)
subprocess.CalledProcessError: Command '['/usr/bin/python3', '-u', '/home/ktai01/maskrcnn-benchmark/tools/train_net.py', '--local_rank=0', '--config-file', '/home/ktai01/maskrcnn-benchmark/configs/retinanet/retinanet_R-50-FPN_P5_1x.yaml', 'MODEL.RPN.FPN_POST_NMS_TOP_N_TRAIN', '1000', 'DTYPE', 'float16']' returned non-zero exit status 1.
PyTorch Version : 1.1.0
OS : ubuntu 18
How you installed PyTorch (pip):
Python version: 3.6
CUDA/cuDNN version: 10.0 / 7.4
GPU models and configuration: GTX 1080
The text was updated successfully, but these errors were encountered:
🐛 Bug
I can run training of normal faster/mask rcnn with R-50/FPN on FP16. But I failed to run it with Retinanet(sigmoid focal loss is the problem)
I guess theres couldbe 2 options.
I found the current Sigmoid Focal Loss CUDA code doesn't support FP 16(Half).
For example, I could train normal FasterRCNN-R50-FPN but failed for "RETINANET-FasterRCNN-R50-FPN which use the sigmoid focal loss function) on FP16(using O1 opt).
When I use Sigmoid Focal Loss CPU version, it runs OK for both the original maskrcnn-benchsmark's Retinanet models and FCOS's retinanet models.
BUT estimated training time is almost "1 month"!! and after some itteration, only sigmoid focal loss becomes "nan"
2019-08-16 17:42:57,158 maskrcnn_benchmark.trainer INFO: Start training
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 32768.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 32768.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 16384.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 16384.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8192.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8192.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4096.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4096.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2048.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2048.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1024.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1024.0
2019-08-16 17:45:22,396 maskrcnn_benchmark.trainer INFO: eta: 30 days, 6:08:34 iter: 20 loss: 4.3898 (5.7831) loss_centerness: 0.6876 (0.7018) loss_cls: 1.0846 (1.0754) loss_reg: 2.5395 (4.0059) time: 8.5586 (7.2618) data: 0.0029 (0.0810) lr: 0.003333 max mem: 6133
2019-08-16 17:46:51,064 maskrcnn_benchmark.trainer INFO: eta: 24 days, 8:41:46 iter: 40 loss: 3.3856 (4.6105) loss_centerness: 0.6687 (0.6854) loss_cls: 1.0689 (1.0852) loss_reg: 1.6177 (2.8399) time: 3.0172 (5.8476) data: 0.0025 (0.0420) lr: 0.003333 max mem: 6133
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 512.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 512.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 256.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 256.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 128.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 128.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 64.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 64.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 32.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 32.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 16.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 16.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.5
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.5
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.25
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.25
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.125
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.125
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0625
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0625
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.03125
2019-08-16 17:48:29,613 maskrcnn_benchmark.trainer INFO: eta: 23 days, 1:59:50 iter: 60 loss: nan (nan) loss_centerness: 0.6633 (0.6788) loss_cls: nan (nan) loss_reg: 1.5505 (2.3987) time: 4.5355 (5.5409) data: 0.0030 (0.0291) lr: 0.003333 max mem: 6133
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.03125
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.015625
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.015625
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0078125
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0078125
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.00390625
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.00390625
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.001953125
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.001953125
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0009765625
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0009765625
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.00048828125
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.00048828125
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.000244140625
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.000244140625
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0001220703125
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.0001220703125
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.103515625e-05
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.103515625e-05
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.0517578125e-05
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.0517578125e-05
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.52587890625e-05
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.52587890625e-05
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.62939453125e-06
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.62939453125e-06
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.814697265625e-06
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.814697265625e-06
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9073486328125e-06
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.9073486328125e-06
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.5367431640625e-07
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.5367431640625e-07
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.76837158203125e-07
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.76837158203125e-07
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.384185791015625e-07
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.384185791015625e-07
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1920928955078125e-07
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1920928955078125e-07
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.960464477539063e-08
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.960464477539063e-08
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.9802322387695312e-08
2019-08-16 17:49:48,975 maskrcnn_benchmark.trainer INFO: eta: 21 days, 10:39:17 iter: 80 loss: nan (nan) loss_centerness: 0.6606 (0.6748) loss_cls: nan (nan) loss_reg: 1.5613 (2.1857) time: 3.1732 (5.1477) data: 0.0024 (0.0225) lr: 0.003333 max mem: 6133
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.9802322387695312e-08
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4901161193847656e-08
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4901161193847656e-08
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.450580596923828e-09
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.450580596923828e-09
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.725290298461914e-09
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.725290298461914e-09
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.862645149230957e-09
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.862645149230957e-09
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.313225746154785e-10
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.313225746154785e-10
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.656612873077393e-10
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.656612873077393e-10
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.3283064365386963e-10
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.3283064365386963e-10
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1641532182693481e-10
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1641532182693481e-10
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.820766091346741e-11
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.820766091346741e-11
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.9103830456733704e-11
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.9103830456733704e-11
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4551915228366852e-11
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4551915228366852e-11
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.275957614183426e-12
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.275957614183426e-12
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.637978807091713e-12
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.637978807091713e-12
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8189894035458565e-12
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.8189894035458565e-12
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.094947017729282e-13
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 9.094947017729282e-13
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.547473508864641e-13
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.547473508864641e-13
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.2737367544323206e-13
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.2737367544323206e-13
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1368683772161603e-13
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1368683772161603e-13
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.684341886080802e-14
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.684341886080802e-14
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.842170943040401e-14
2019-08-16 17:51:34,847 maskrcnn_benchmark.trainer INFO: eta: 21 days, 13:32:36 iter: 100 loss: nan (nan) loss_centerness: 0.6631 (0.6728) loss_cls: nan (nan) loss_reg: 1.5312 (2.0573) time: 7.3970 (5.1769) data: 0.0026 (0.0186) lr: 0.003333 max mem: 6133
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.842170943040401e-14
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4210854715202004e-14
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.4210854715202004e-14
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.105427357601002e-15
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 7.105427357601002e-15
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.552713678800501e-15
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.552713678800501e-15
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.7763568394002505e-15
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.7763568394002505e-15
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.881784197001252e-16
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.881784197001252e-16
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.440892098500626e-16
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.440892098500626e-16
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.220446049250313e-16
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.220446049250313e-16
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1102230246251565e-16
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.1102230246251565e-16
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.551115123125783e-17
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.551115123125783e-17
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.7755575615628914e-17
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.7755575615628914e-17
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3877787807814457e-17
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3877787807814457e-17
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.938893903907228e-18
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.938893903907228e-18
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.469446951953614e-18
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.469446951953614e-18
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.734723475976807e-18
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.734723475976807e-18
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.673617379884035e-19
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.673617379884035e-19
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.336808689942018e-19
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.336808689942018e-19
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.168404344971009e-19
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.168404344971009e-19
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0842021724855044e-19
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0842021724855044e-19
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.421010862427522e-20
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 5.421010862427522e-20
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.710505431213761e-20
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.710505431213761e-20
2019-08-16 17:52:37,750 maskrcnn_benchmark.trainer INFO: eta: 20 days, 3:39:49 iter: 120 loss: nan (nan) loss_centerness: 0.6604 (0.6710) loss_cls: nan (nan) loss_reg: 1.5168 (1.9788) time: 2.9350 (4.8383) data: 0.0030 (0.0161) lr: 0.003333 max mem: 6133
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3552527156068805e-20
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.3552527156068805e-20
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.776263578034403e-21
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 6.776263578034403e-21
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.3881317890172014e-21
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 3.3881317890172014e-21
To Reproduce
after installing this repo, I just run following command
python3 -m torch.distributed.launch --nproc_per_node=$NGPUS /home/ktai01/maskrcnn-benchmark/tools/train_net.py --config-file /home/ktai01/maskrcnn-benchmark/configs/retinanet/retinanet_R-50-FPN_P5_1x.yaml MODEL.RPN.FPN_POST_NMS_TOP_N_TRAIN 1000 DTYPE "float16"
Traceback (most recent call last):
File "/home/ktai01/maskrcnn-benchmark/tools/train_net.py", line 191, in
main()
File "/home/ktai01/maskrcnn-benchmark/tools/train_net.py", line 184, in main
model = train(cfg, args.local_rank, args.distributed)
File "/home/ktai01/maskrcnn-benchmark/tools/train_net.py", line 85, in train
arguments,
File "/home/ktai01/maskrcnn-benchmark/maskrcnn_benchmark/engine/trainer.py", line 71, in do_train
loss_dict = model(images, targets)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/distributed.py", line 376, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/home/ktai01/maskrcnn-benchmark/maskrcnn_benchmark/modeling/detector/generalized_rcnn.py", line 50, in forward
proposals, proposal_losses = self.rpn(images, features, targets)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/home/ktai01/maskrcnn-benchmark/maskrcnn_benchmark/modeling/rpn/retinanet/retinanet.py", line 131, in forward
return self._forward_train(anchors, box_cls, box_regression, targets)
File "/home/ktai01/maskrcnn-benchmark/maskrcnn_benchmark/modeling/rpn/retinanet/retinanet.py", line 138, in _forward_train
anchors, box_cls, box_regression, targets
File "/home/ktai01/maskrcnn-benchmark/maskrcnn_benchmark/modeling/rpn/retinanet/loss.py", line 77, in call
labels
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, kwargs)
File "/home/ktai01/maskrcnn-benchmark/maskrcnn_benchmark/layers/sigmoid_focal_loss.py", line 68, in forward
loss = loss_func(logits, targets, self.gamma, self.alpha)
File "/home/ktai01/maskrcnn-benchmark/maskrcnn_benchmark/layers/sigmoid_focal_loss.py", line 19, in forward
logits, targets, num_classes, gamma, alpha
RuntimeError: "SigmoidFocalLoss_forward" not implemented for 'Half' (operator() at /home/ktai01/maskrcnn-benchmark/maskrcnn_benchmark/csrc/cuda/SigmoidFocalLoss_cuda.cu:139)
frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x7fdc14f1b441 in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so)
frame #1: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x2a (0x7fdc14f1ad7a in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so)
frame #2: + 0x4917f (0x7fdbae35c17f in /home/ktai01/maskrcnn-benchmark/maskrcnn_benchmark/_C.cpython-36m-x86_64-linux-gnu.so)
frame #3: SigmoidFocalLoss_forward_cuda(at::Tensor const&, at::Tensor const&, int, float, float) + 0x606 (0x7fdbae35c7f5 in /home/ktai01/maskrcnn-benchmark/maskrcnn_benchmark/_C.cpython-36m-x86_64-linux-gnu.so)
frame #4: SigmoidFocalLoss_forward(at::Tensor const&, at::Tensor const&, int, float, float) + 0x64 (0x7fdbae32cb44 in /home/ktai01/maskrcnn-benchmark/maskrcnn_benchmark/_C.cpython-36m-x86_64-linux-gnu.so)
frame #5: + 0x28fcf (0x7fdbae33bfcf in /home/ktai01/maskrcnn-benchmark/maskrcnn_benchmark/_C.cpython-36m-x86_64-linux-gnu.so)
frame #6: + 0x25291 (0x7fdbae338291 in /home/ktai01/maskrcnn-benchmark/maskrcnn_benchmark/_C.cpython-36m-x86_64-linux-gnu.so)
frame #7: /usr/bin/python3() [0x5030d5]
frame #8: _PyEval_EvalFrameDefault + 0x449 (0x506859 in /usr/bin/python3)
frame #9: /usr/bin/python3() [0x504c28]
frame #10: /usr/bin/python3() [0x58644b]
frame #11: PyObject_Call + 0x3e (0x59ebbe in /usr/bin/python3)
frame #12: THPFunction_apply(_object, _object) + 0x6b1 (0x7fdc11ec5481 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #13: /usr/bin/python3() [0x502d6f]
frame #14: _PyEval_EvalFrameDefault + 0x449 (0x506859 in /usr/bin/python3)
frame #15: _PyFunction_FastCallDict + 0xf5 (0x501945 in /usr/bin/python3)
frame #16: /usr/bin/python3() [0x591461]
frame #17: PyObject_Call + 0x3e (0x59ebbe in /usr/bin/python3)
frame #18: _PyEval_EvalFrameDefault + 0x1807 (0x507c17 in /usr/bin/python3)
frame #19: /usr/bin/python3() [0x504c28]
frame #20: _PyFunction_FastCallDict + 0x2de (0x501b2e in /usr/bin/python3)
frame #21: /usr/bin/python3() [0x591461]
frame #22: PyObject_Call + 0x3e (0x59ebbe in /usr/bin/python3)
frame #23: /usr/bin/python3() [0x54d4e2]
frame #24: _PyObject_FastCallKeywords + 0x19c (0x5a730c in /usr/bin/python3)
frame #25: /usr/bin/python3() [0x503073]
frame #26: _PyEval_EvalFrameDefault + 0x449 (0x506859 in /usr/bin/python3)
frame #27: _PyFunction_FastCallDict + 0xf5 (0x501945 in /usr/bin/python3)
frame #28: /usr/bin/python3() [0x591461]
frame #29: PyObject_Call + 0x3e (0x59ebbe in /usr/bin/python3)
frame #30: /usr/bin/python3() [0x54d4e2]
frame #31: _PyObject_FastCallKeywords + 0x19c (0x5a730c in /usr/bin/python3)
frame #32: /usr/bin/python3() [0x503073]
frame #33: _PyEval_EvalFrameDefault + 0x449 (0x506859 in /usr/bin/python3)
frame #34: /usr/bin/python3() [0x502209]
frame #35: /usr/bin/python3() [0x502f3d]
frame #36: _PyEval_EvalFrameDefault + 0x449 (0x506859 in /usr/bin/python3)
frame #37: /usr/bin/python3() [0x504c28]
frame #38: _PyFunction_FastCallDict + 0x2de (0x501b2e in /usr/bin/python3)
frame #39: /usr/bin/python3() [0x591461]
frame #40: PyObject_Call + 0x3e (0x59ebbe in /usr/bin/python3)
frame #41: _PyEval_EvalFrameDefault + 0x1807 (0x507c17 in /usr/bin/python3)
frame #42: /usr/bin/python3() [0x504c28]
frame #43: _PyFunction_FastCallDict + 0x2de (0x501b2e in /usr/bin/python3)
frame #44: /usr/bin/python3() [0x591461]
frame #45: PyObject_Call + 0x3e (0x59ebbe in /usr/bin/python3)
frame #46: /usr/bin/python3() [0x54d4e2]
frame #47: _PyObject_FastCallKeywords + 0x19c (0x5a730c in /usr/bin/python3)
frame #48: /usr/bin/python3() [0x503073]
frame #49: _PyEval_EvalFrameDefault + 0x449 (0x506859 in /usr/bin/python3)
frame #50: /usr/bin/python3() [0x504c28]
frame #51: _PyFunction_FastCallDict + 0x2de (0x501b2e in /usr/bin/python3)
frame #52: /usr/bin/python3() [0x591461]
frame #53: PyObject_Call + 0x3e (0x59ebbe in /usr/bin/python3)
frame #54: _PyEval_EvalFrameDefault + 0x1807 (0x507c17 in /usr/bin/python3)
frame #55: /usr/bin/python3() [0x504c28]
frame #56: _PyFunction_FastCallDict + 0x2de (0x501b2e in /usr/bin/python3)
frame #57: /usr/bin/python3() [0x591461]
frame #58: PyObject_Call + 0x3e (0x59ebbe in /usr/bin/python3)
frame #59: /usr/bin/python3() [0x54d4e2]
frame #60: PyObject_Call + 0x3e (0x59ebbe in /usr/bin/python3)
frame #61: _PyEval_EvalFrameDefault + 0x1807 (0x507c17 in /usr/bin/python3)
frame #62: /usr/bin/python3() [0x504c28]
frame #63: _PyFunction_FastCallDict + 0x2de (0x501b2e in /usr/bin/python3)
Traceback (most recent call last):
File "/home/ktai01/maskrcnn-benchmark/tools/train_net.py", line 191, in
main()
File "/home/ktai01/maskrcnn-benchmark/tools/train_net.py", line 184, in main
model = train(cfg, args.local_rank, args.distributed)
File "/home/ktai01/maskrcnn-benchmark/tools/train_net.py", line 85, in train
arguments,
File "/home/ktai01/maskrcnn-benchmark/maskrcnn_benchmark/engine/trainer.py", line 71, in do_train
loss_dict = model(images, targets)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/distributed.py", line 376, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/home/ktai01/maskrcnn-benchmark/maskrcnn_benchmark/modeling/detector/generalized_rcnn.py", line 50, in forward
proposals, proposal_losses = self.rpn(images, features, targets)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/home/ktai01/maskrcnn-benchmark/maskrcnn_benchmark/modeling/rpn/retinanet/retinanet.py", line 131, in forward
return self._forward_train(anchors, box_cls, box_regression, targets)
File "/home/ktai01/maskrcnn-benchmark/maskrcnn_benchmark/modeling/rpn/retinanet/retinanet.py", line 138, in _forward_train
anchors, box_cls, box_regression, targets
File "/home/ktai01/maskrcnn-benchmark/maskrcnn_benchmark/modeling/rpn/retinanet/loss.py", line 77, in call
labels
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, kwargs)
File "/home/ktai01/maskrcnn-benchmark/maskrcnn_benchmark/layers/sigmoid_focal_loss.py", line 68, in forward
loss = loss_func(logits, targets, self.gamma, self.alpha)
File "/home/ktai01/maskrcnn-benchmark/maskrcnn_benchmark/layers/sigmoid_focal_loss.py", line 19, in forward
logits, targets, num_classes, gamma, alpha
RuntimeError: "SigmoidFocalLoss_forward" not implemented for 'Half' (operator() at /home/ktai01/maskrcnn-benchmark/maskrcnn_benchmark/csrc/cuda/SigmoidFocalLoss_cuda.cu:139)
frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x7fd51e596441 in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so)
frame #1: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x2a (0x7fd51e595d7a in /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so)
frame #2: + 0x4917f (0x7fd4b872217f in /home/ktai01/maskrcnn-benchmark/maskrcnn_benchmark/_C.cpython-36m-x86_64-linux-gnu.so)
frame #3: SigmoidFocalLoss_forward_cuda(at::Tensor const&, at::Tensor const&, int, float, float) + 0x606 (0x7fd4b87227f5 in /home/ktai01/maskrcnn-benchmark/maskrcnn_benchmark/_C.cpython-36m-x86_64-linux-gnu.so)
frame #4: SigmoidFocalLoss_forward(at::Tensor const&, at::Tensor const&, int, float, float) + 0x64 (0x7fd4b86f2b44 in /home/ktai01/maskrcnn-benchmark/maskrcnn_benchmark/_C.cpython-36m-x86_64-linux-gnu.so)
frame #5: + 0x28fcf (0x7fd4b8701fcf in /home/ktai01/maskrcnn-benchmark/maskrcnn_benchmark/_C.cpython-36m-x86_64-linux-gnu.so)
frame #6: + 0x25291 (0x7fd4b86fe291 in /home/ktai01/maskrcnn-benchmark/maskrcnn_benchmark/_C.cpython-36m-x86_64-linux-gnu.so)
frame #7: /usr/bin/python3() [0x5030d5]
frame #8: _PyEval_EvalFrameDefault + 0x449 (0x506859 in /usr/bin/python3)
frame #9: /usr/bin/python3() [0x504c28]
frame #10: /usr/bin/python3() [0x58644b]
frame #11: PyObject_Call + 0x3e (0x59ebbe in /usr/bin/python3)
frame #12: THPFunction_apply(_object, _object) + 0x6b1 (0x7fd51edb3481 in /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_python.so)
frame #13: /usr/bin/python3() [0x502d6f]
frame #14: _PyEval_EvalFrameDefault + 0x449 (0x506859 in /usr/bin/python3)
frame #15: _PyFunction_FastCallDict + 0xf5 (0x501945 in /usr/bin/python3)
frame #16: /usr/bin/python3() [0x591461]
frame #17: PyObject_Call + 0x3e (0x59ebbe in /usr/bin/python3)
frame #18: _PyEval_EvalFrameDefault + 0x1807 (0x507c17 in /usr/bin/python3)
frame #19: /usr/bin/python3() [0x504c28]
frame #20: _PyFunction_FastCallDict + 0x2de (0x501b2e in /usr/bin/python3)
frame #21: /usr/bin/python3() [0x591461]
frame #22: PyObject_Call + 0x3e (0x59ebbe in /usr/bin/python3)
frame #23: /usr/bin/python3() [0x54d4e2]
frame #24: _PyObject_FastCallKeywords + 0x19c (0x5a730c in /usr/bin/python3)
frame #25: /usr/bin/python3() [0x503073]
frame #26: _PyEval_EvalFrameDefault + 0x449 (0x506859 in /usr/bin/python3)
frame #27: _PyFunction_FastCallDict + 0xf5 (0x501945 in /usr/bin/python3)
frame #28: /usr/bin/python3() [0x591461]
frame #29: PyObject_Call + 0x3e (0x59ebbe in /usr/bin/python3)
frame #30: /usr/bin/python3() [0x54d4e2]
frame #31: _PyObject_FastCallKeywords + 0x19c (0x5a730c in /usr/bin/python3)
frame #32: /usr/bin/python3() [0x503073]
frame #33: _PyEval_EvalFrameDefault + 0x449 (0x506859 in /usr/bin/python3)
frame #34: /usr/bin/python3() [0x502209]
frame #35: /usr/bin/python3() [0x502f3d]
frame #36: _PyEval_EvalFrameDefault + 0x449 (0x506859 in /usr/bin/python3)
frame #37: /usr/bin/python3() [0x504c28]
frame #38: _PyFunction_FastCallDict + 0x2de (0x501b2e in /usr/bin/python3)
frame #39: /usr/bin/python3() [0x591461]
frame #40: PyObject_Call + 0x3e (0x59ebbe in /usr/bin/python3)
frame #41: _PyEval_EvalFrameDefault + 0x1807 (0x507c17 in /usr/bin/python3)
frame #42: /usr/bin/python3() [0x504c28]
frame #43: _PyFunction_FastCallDict + 0x2de (0x501b2e in /usr/bin/python3)
frame #44: /usr/bin/python3() [0x591461]
frame #45: PyObject_Call + 0x3e (0x59ebbe in /usr/bin/python3)
frame #46: /usr/bin/python3() [0x54d4e2]
frame #47: _PyObject_FastCallKeywords + 0x19c (0x5a730c in /usr/bin/python3)
frame #48: /usr/bin/python3() [0x503073]
frame #49: _PyEval_EvalFrameDefault + 0x449 (0x506859 in /usr/bin/python3)
frame #50: /usr/bin/python3() [0x504c28]
frame #51: _PyFunction_FastCallDict + 0x2de (0x501b2e in /usr/bin/python3)
frame #52: /usr/bin/python3() [0x591461]
frame #53: PyObject_Call + 0x3e (0x59ebbe in /usr/bin/python3)
frame #54: _PyEval_EvalFrameDefault + 0x1807 (0x507c17 in /usr/bin/python3)
frame #55: /usr/bin/python3() [0x504c28]
frame #56: _PyFunction_FastCallDict + 0x2de (0x501b2e in /usr/bin/python3)
frame #57: /usr/bin/python3() [0x591461]
frame #58: PyObject_Call + 0x3e (0x59ebbe in /usr/bin/python3)
frame #59: /usr/bin/python3() [0x54d4e2]
frame #60: PyObject_Call + 0x3e (0x59ebbe in /usr/bin/python3)
frame #61: _PyEval_EvalFrameDefault + 0x1807 (0x507c17 in /usr/bin/python3)
frame #62: /usr/bin/python3() [0x504c28]
frame #63: _PyFunction_FastCallDict + 0x2de (0x501b2e in /usr/bin/python3)
Traceback (most recent call last):
File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.6/dist-packages/torch/distributed/launch.py", line 235, in
main()
File "/usr/local/lib/python3.6/dist-packages/torch/distributed/launch.py", line 231, in main
cmd=process.args)
subprocess.CalledProcessError: Command '['/usr/bin/python3', '-u', '/home/ktai01/maskrcnn-benchmark/tools/train_net.py', '--local_rank=0', '--config-file', '/home/ktai01/maskrcnn-benchmark/configs/retinanet/retinanet_R-50-FPN_P5_1x.yaml', 'MODEL.RPN.FPN_POST_NMS_TOP_N_TRAIN', '1000', 'DTYPE', 'float16']' returned non-zero exit status 1.
The text was updated successfully, but these errors were encountered: