Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

why can't run testing on GPU 1? #1775

Closed
johnnylecy opened this issue Dec 9, 2019 · 7 comments · Fixed by #2098
Closed

why can't run testing on GPU 1? #1775

johnnylecy opened this issue Dec 9, 2019 · 7 comments · Fixed by #2098
Labels
bug Something isn't working

Comments

@johnnylecy
Copy link

There are 2 GPU in my computer. I run testing on GPU0,everything is normal. But I run testing on GPU1,I got a error as follow.
I use High-level APIs for testing images like this:
gpu_id = 1
device_id = 'cuda:' + str(gpu_id)
net = init_detector(config_file, checkpoint_file, device=device_id)
... ...
predict_result = inference_detector(net, im_file)

someone can help me, please? Thank you.
error:
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCGeneral.cpp line=371 error=77 : an illegal memory access was encountered0:09, 2.57it/s]
Process Process-1:
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "evaluate_rcnn_base_mulscale_with_clsfy_mulprocess-4_stride_half.py", line 263, in eval_net
evaluate(net, sub_dir, img_name, img_result_dir, det_result_txt)
File "evaluate_rcnn_base_mulscale_with_clsfy_mulprocess-4_stride_half.py", line 202, in evaluate
predict_result = inference_detector(net, im_file)
File "/data/nas/workspace/jupyter/mmdetection-master/mmdet/apis/inference.py", line 86, in inference_detector
result = model(return_loss=False, rescale=True, **data)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 541, in call
result = self.forward(*input, **kwargs)
File "/data/nas/workspace/jupyter/mmdetection-master/mmdet/core/fp16/decorators.py", line 49, in new_func
return old_func(*args, **kwargs)
File "/data/nas/workspace/jupyter/mmdetection-master/mmdet/models/detectors/base.py", line 119, in forward
return self.forward_test(img, img_meta, **kwargs)
File "/data/nas/workspace/jupyter/mmdetection-master/mmdet/models/detectors/base.py", line 102, in forward_test
return self.simple_test(imgs[0], img_metas[0], **kwargs)
File "/data/nas/workspace/jupyter/mmdetection-master/mmdet/models/detectors/two_stage.py", line 273, in simple_test
x, img_meta, proposal_list, self.test_cfg.rcnn, rescale=rescale)
File "/data/nas/workspace/jupyter/mmdetection-master/mmdet/models/detectors/test_mixins.py", line 49, in simple_test_bboxes
x[:len(self.bbox_roi_extractor.featmap_strides)], rois)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 541, in call
result = self.forward(*input, **kwargs)
File "/data/nas/workspace/jupyter/mmdetection-master/mmdet/core/fp16/decorators.py", line 127, in new_func
return old_func(*args, **kwargs)
File "/data/nas/workspace/jupyter/mmdetection-master/mmdet/models/roi_extractors/single_level.py", line 106, in forward
roi_feats[inds] = roi_feats_t
RuntimeError: cuda runtime error (77) : an illegal memory access was encountered at /pytorch/aten/src/THC/THCGeneral.cpp:371
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: an illegal memory access was encountered (insert_events at /pytorch/c10/cuda/CUDACachingAllocator.cpp:569)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7fb69c895813 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: + 0x16126 (0x7fb69eaeb126 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10_cuda.so)
frame #2: + 0x16b11 (0x7fb69eaebb11 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x4d (0x7fb69c885f0d in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #4: + 0x4af752 (0x7fb68a3f6752 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #5: + 0x4af796 (0x7fb68a3f6796 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)

frame #50: __libc_start_main + 0xf5 (0x7fb6b1428b15 in /lib64/libc.so.6)

@johnnylecy johnnylecy changed the title why can't run test on GPU 1? why can't run testing on GPU 1? Dec 9, 2019
@ZwwWayne
Copy link
Collaborator

ZwwWayne commented Dec 9, 2019

Hi @johnnylecy ,
This might because the rois and the features are not in the same device, could you check the devices of the features and rois? Furthermore, maybe you can try to use torch.cuda.set_device at the beginning of your scripts or add print out each related variables' devices to see whether they are the same.

We will also try to reproduce this bug.

@johnnylecy
Copy link
Author

hi @ZwwWayne

How should I do for checking or printing the device of the features and rois?

@ZwwWayne
Copy link
Collaborator

print(rois.device) should work.

@Liys0558
Copy link

Liys0558 commented Jan 3, 2020

i have the same problem?

@zr526799544
Copy link

I don't know why this would happened but I know how to solve it . you can use cuda_visible_device to avoid set device_id = 'cuda:' + str(gpu_id) to let the model testing on your gpu。

@intgogo
Copy link

intgogo commented Jan 16, 2020

same problem!!!

@hellock hellock added the bug Something isn't working label Feb 2, 2020
@yhcao6 yhcao6 mentioned this issue Feb 15, 2020
@hellock
Copy link
Member

hellock commented Feb 16, 2020

Should have been fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants