why can't run testing on GPU 1？ #1775

johnnylecy · 2019-12-09T08:53:55Z

There are 2 GPU in my computer. I run testing on GPU0，everything is normal. But I run testing on GPU1，I got a error as follow.
I use High-level APIs for testing images like this:
gpu_id = 1
device_id = 'cuda:' + str(gpu_id)
net = init_detector(config_file, checkpoint_file, device=device_id)
... ...
predict_result = inference_detector(net, im_file)

someone can help me, please? Thank you.
error:
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=371 error=77 : an illegal memory access was encountered0:09, 2.57it/s]
Process Process-1:
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "evaluate_rcnn_base_mulscale_with_clsfy_mulprocess-4_stride_half.py", line 263, in eval_net
evaluate(net, sub_dir, img_name, img_result_dir, det_result_txt)
File "evaluate_rcnn_base_mulscale_with_clsfy_mulprocess-4_stride_half.py", line 202, in evaluate
predict_result = inference_detector(net, im_file)
File "/data/nas/workspace/jupyter/mmdetection-master/mmdet/apis/inference.py", line 86, in inference_detector
result = model(return_loss=False, rescale=True, **data)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 541, in call
result = self.forward(*input, **kwargs)
File "/data/nas/workspace/jupyter/mmdetection-master/mmdet/core/fp16/decorators.py", line 49, in new_func
return old_func(*args, **kwargs)
File "/data/nas/workspace/jupyter/mmdetection-master/mmdet/models/detectors/base.py", line 119, in forward
return self.forward_test(img, img_meta, **kwargs)
File "/data/nas/workspace/jupyter/mmdetection-master/mmdet/models/detectors/base.py", line 102, in forward_test
return self.simple_test(imgs[0], img_metas[0], **kwargs)
File "/data/nas/workspace/jupyter/mmdetection-master/mmdet/models/detectors/two_stage.py", line 273, in simple_test
x, img_meta, proposal_list, self.test_cfg.rcnn, rescale=rescale)
File "/data/nas/workspace/jupyter/mmdetection-master/mmdet/models/detectors/test_mixins.py", line 49, in simple_test_bboxes
x[:len(self.bbox_roi_extractor.featmap_strides)], rois)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 541, in call
result = self.forward(*input, **kwargs)
File "/data/nas/workspace/jupyter/mmdetection-master/mmdet/core/fp16/decorators.py", line 127, in new_func
return old_func(*args, **kwargs)
File "/data/nas/workspace/jupyter/mmdetection-master/mmdet/models/roi_extractors/single_level.py", line 106, in forward
roi_feats[inds] = roi_feats_t
RuntimeError: cuda runtime error (77) : an illegal memory access was encountered at /pytorch/aten/src/THC/THCGeneral.cpp:371
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: an illegal memory access was encountered (insert_events at /pytorch/c10/cuda/CUDACachingAllocator.cpp:569)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7fb69c895813 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: + 0x16126 (0x7fb69eaeb126 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10_cuda.so)
frame #2: + 0x16b11 (0x7fb69eaebb11 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x4d (0x7fb69c885f0d in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #4: + 0x4af752 (0x7fb68a3f6752 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #5: + 0x4af796 (0x7fb68a3f6796 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)

frame #50: __libc_start_main + 0xf5 (0x7fb6b1428b15 in /lib64/libc.so.6)

ZwwWayne · 2019-12-09T13:12:17Z

Hi @johnnylecy ,
This might because the rois and the features are not in the same device, could you check the devices of the features and rois? Furthermore, maybe you can try to use torch.cuda.set_device at the beginning of your scripts or add print out each related variables' devices to see whether they are the same.

We will also try to reproduce this bug.

johnnylecy · 2019-12-10T03:03:32Z

hi @ZwwWayne，

How should I do for checking or printing the device of the features and rois？

ZwwWayne · 2019-12-10T05:27:54Z

print(rois.device) should work.

Liys0558 · 2020-01-03T09:26:23Z

i have the same problem?

zr526799544 · 2020-01-15T05:58:19Z

I don't know why this would happened but I know how to solve it . you can use cuda_visible_device to avoid set device_id = 'cuda:' + str(gpu_id) to let the model testing on your gpu。

intgogo · 2020-01-16T08:54:03Z

same problem!!!

hellock · 2020-02-16T02:59:20Z

Should have been fixed.

johnnylecy changed the title ~~why can't run test on GPU 1？~~ why can't run testing on GPU 1？ Dec 9, 2019

hellock added the bug Something isn't working label Feb 2, 2020

hellock mentioned this issue Feb 2, 2020

Inference Demo Shows no Results #2040

Closed

yhcao6 mentioned this issue Feb 15, 2020

Fix device bug #2098

Merged

hellock closed this as completed in #2098 Feb 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

why can't run testing on GPU 1？ #1775

why can't run testing on GPU 1？ #1775

johnnylecy commented Dec 9, 2019

ZwwWayne commented Dec 9, 2019

johnnylecy commented Dec 10, 2019

ZwwWayne commented Dec 10, 2019

Liys0558 commented Jan 3, 2020

zr526799544 commented Jan 15, 2020

intgogo commented Jan 16, 2020

hellock commented Feb 16, 2020

why can't run testing on GPU 1？ #1775

why can't run testing on GPU 1？ #1775

Comments

johnnylecy commented Dec 9, 2019

ZwwWayne commented Dec 9, 2019

johnnylecy commented Dec 10, 2019

How should I do for checking or printing the device of the features and rois？

ZwwWayne commented Dec 10, 2019

Liys0558 commented Jan 3, 2020

zr526799544 commented Jan 15, 2020

intgogo commented Jan 16, 2020

hellock commented Feb 16, 2020