multi gpu error #23

yja1 · 2018-04-27T11:28:07Z

CUDA_VISIBLE_DEVICES=6,7 python train.py --PCB --batchsize 60 --name PCB-64 --train_all

I use multi gpu, so add some code:
if torch.cuda.device_count() > 1 and use_gpu:
  model_wraped = nn.DataParallel(model).cuda()
  model = model_wraped

but error in forward:
RuntimeError: cuda runtime error (77) : an illegal memory access was encountered at /opt/conda/conda-bld/pytorch_1513368888240/work/torch/lib/THC/THCTensorCopy.cu:204

The text was updated successfully, but these errors were encountered:

layumi · 2018-04-28T01:04:29Z

Hi @yja1
Would you like to ignore this line?
https://github.com/layumi/Person_reID_baseline_pytorch/blob/master/train.py#L50
And try again.

yja1 · 2018-04-28T01:38:25Z

I have delete that when use multi gpu

layumi · 2018-04-28T06:08:55Z

This is my code. You may refer it and modify your code.

model = torch.nn.DataParallel(model, device_ids=gpu_ids).cuda()

ignored_params = list(map(id, model.module.model.fc.parameters() )) + list(map(id, model.module.classifier.parameters() ))
base_params = filter(lambda p: id(p) not in ignored_params, model.parameters())

# Observe that all parameters are being optimized
optimizer_ft = optim.SGD([
             {'params': base_params, 'lr': 0.01},
             {'params': model.module.model.fc.parameters(), 'lr': 0.1},
             {'params': model.module.classifier.parameters(), 'lr': 0.1}
         ], momentum=0.9, weight_decay=5e-4, nesterov=True)

xujian0 · 2018-11-05T12:34:50Z

This is my code. You may refer it and modify your code.

model = torch.nn.DataParallel(model, device_ids=gpu_ids).cuda()

ignored_params = list(map(id, model.module.model.fc.parameters() )) + list(map(id, model.module.classifier.parameters() ))
base_params = filter(lambda p: id(p) not in ignored_params, model.parameters())

# Observe that all parameters are being optimized
optimizer_ft = optim.SGD([
             {'params': base_params, 'lr': 0.01},
             {'params': model.module.model.fc.parameters(), 'lr': 0.1},
             {'params': model.module.classifier.parameters(), 'lr': 0.1}
         ], momentum=0.9, weight_decay=5e-4, nesterov=True)

I viewed your code above and changed the train.py line303 and below into this:
```
model = torch.nn.DataParallel(model, device_ids=[0, 1]).cuda()
ignored_params = list(map(id, model.module.model.fc.parameters() ))
ignored_params += (list(map(id, model.module.classifier0.parameters() ))
+list(map(id, model.module.classifier1.parameters() ))
+list(map(id, model.module.classifier2.parameters() ))
+list(map(id, model.module.classifier3.parameters() ))
+list(map(id, model.module.classifier4.parameters() ))
+list(map(id, model.module.classifier5.parameters() ))
#+list(map(id, model.classifier6.parameters() ))
#+list(map(id, model.classifier7.parameters() ))
)

base_params = filter(lambda p: id(p) not in ignored_params, model.parameters())

optimizer_ft = optim.SGD([
         {'params': base_params, 'lr': 0.01},
         {'params': model.module.model.fc.parameters(), 'lr': 0.1},
         {'params': model.module.classifier0.parameters(), 'lr': 0.1},
         {'params': model.module.classifier1.parameters(), 'lr': 0.1},
         {'params': model.module.classifier2.parameters(), 'lr': 0.1},
         {'params': model.module.classifier3.parameters(), 'lr': 0.1},
         {'params': model.module.classifier4.parameters(), 'lr': 0.1},
         {'params': model.module.classifier5.parameters(), 'lr': 0.1},
         #{'params': model.classifier6.parameters(), 'lr': 0.01},
         #{'params': model.classifier7.parameters(), 'lr': 0.01}
     ], weight_decay=5e-4, momentum=0.9, nesterov=True)


so that I can use two GPU for PCB, and when I want to test my trained model, I got the error like this:
RuntimeError: Error(s) in loading state_dict for PCB:
	Missing key(s) in state_dict: "model.conv1.weight", "model.bn1.weight", "model.bn1.bias", "model.bn1.running_mean", "model.bn1.running_var", 
and this is my command:
python test.py --PCB --name PCB-32
I was wondering if this error has something to do with the way I train the model.

layumi · 2018-11-05T17:25:53Z

Hi @xujian0
When you test the model, you also need to add model.module in the test.py.
Especially, this line

model = torch.nn.DataParallel(model, device_ids=gpu_ids).cuda()

xujian0 · 2018-11-06T05:55:06Z

Hi @xujian0
When you test the model, you also need to add model.module in the test.py.
Especially, this line
model = torch.nn.DataParallel(model, device_ids=gpu_ids).cuda()

Thanks for your reply, and it solved my problem!

layumi closed this as completed Dec 5, 2018

This was referenced Feb 27, 2019

multiple GPU layumi/Person-reID-triplet-loss#8

Open

Why not multi gpu training? #101

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multi gpu error #23

multi gpu error #23

yja1 commented Apr 27, 2018

layumi commented Apr 28, 2018

yja1 commented Apr 28, 2018

layumi commented Apr 28, 2018

xujian0 commented Nov 5, 2018 •

edited

Loading

layumi commented Nov 5, 2018 •

edited

Loading

xujian0 commented Nov 6, 2018

multi gpu error #23

multi gpu error #23

Comments

yja1 commented Apr 27, 2018

layumi commented Apr 28, 2018

yja1 commented Apr 28, 2018

layumi commented Apr 28, 2018

xujian0 commented Nov 5, 2018 • edited Loading

layumi commented Nov 5, 2018 • edited Loading

xujian0 commented Nov 6, 2018

xujian0 commented Nov 5, 2018 •

edited

Loading

layumi commented Nov 5, 2018 •

edited

Loading