CUDA error running on multiple GPUs: torch.cuda.nccl.NcclError: System Error (2) #12

adrianalbert · 2017-10-26T20:49:51Z

Hi,

I've been trying to run the example code (on the maps dataset):

python main.py --dataset=maps --num_gpu=4

I get the error below related to the NCCL library. I'm trying to run this on 4 K80 GPUs.

Any suggestions on what could be causing this and what a solution could be?

pix2pix processing: 100%|#######################| 1096/1096 [00:00<00:00, 178591.97it/s]
pix2pix processing: 100%|#######################| 1096/1096 [00:00<00:00, 213732.43it/s]
[] MODEL dir: logs/maps_2017-10-26_20-36-34
[] PARAM path: logs/maps_2017-10-26_20-36-34/params.json
0%| | 0/500000 [00:00<?, ?it/s]

Traceback (most recent call last):
File "main.py", line 41, in
main(config)
File "main.py", line 33, in main
trainer.train()
File "/home/nbserver/DiscoGAN-pytorch/trainer.py", line 193, in train
x_AB = self.G_AB(x_A).detach()
File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 224, in
call
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/torch/nn/parallel/data_parallel.py", line
59, in forward
replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
File "/usr/local/lib/python2.7/dist-packages/torch/nn/parallel/data_parallel.py", line
64, in replicate
return replicate(module, device_ids)
File "/usr/local/lib/python2.7/dist-packages/torch/nn/parallel/replicate.py", line 12,
in replicate
param_copies = Broadcast(devices)(*params)
File "/usr/local/lib/python2.7/dist-packages/torch/nn/parallel/_functions.py", line 19
, in forward
outputs = comm.broadcast_coalesced(inputs, self.target_gpus)
File "/usr/local/lib/python2.7/dist-packages/torch/cuda/comm.py", line 54, in broadcas
t_coalesced
results = broadcast(_flatten_tensors(chunk), devices)
File "/usr/local/lib/python2.7/dist-packages/torch/cuda/comm.py", line 24, in broadcas
t
nccl.broadcast(tensors)
File "/usr/local/lib/python2.7/dist-packages/torch/cuda/nccl.py", line 182, in broadca
st
comm = communicator(inputs)
File "/usr/local/lib/python2.7/dist-packages/torch/cuda/nccl.py", line 133, in communi
cator
_communicators[key] = NcclCommList(devices)
File "/usr/local/lib/python2.7/dist-packages/torch/cuda/nccl.py", line 106, in _init
_
check_error(lib.ncclCommInitAll(self, len(devices), int_array(devices)))
File "/usr/local/lib/python2.7/dist-packages/torch/cuda/nccl.py", line 118, in check_e
rror
raise NcclError(status)
torch.cuda.nccl.NcclError: System Error (2)

adrianalbert mentioned this issue Oct 26, 2017

CUDA error running on multiple GPUs: torch.cuda.nccl.NcclError: System Error (2) pytorch/examples#246

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA error running on multiple GPUs: torch.cuda.nccl.NcclError: System Error (2) #12

CUDA error running on multiple GPUs: torch.cuda.nccl.NcclError: System Error (2) #12

adrianalbert commented Oct 26, 2017

CUDA error running on multiple GPUs: torch.cuda.nccl.NcclError: System Error (2) #12

CUDA error running on multiple GPUs: torch.cuda.nccl.NcclError: System Error (2) #12

Comments

adrianalbert commented Oct 26, 2017