Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multi gpu error #23

Closed
yja1 opened this issue Apr 27, 2018 · 6 comments
Closed

multi gpu error #23

yja1 opened this issue Apr 27, 2018 · 6 comments

Comments

@yja1
Copy link

yja1 commented Apr 27, 2018

CUDA_VISIBLE_DEVICES=6,7 python train.py --PCB --batchsize 60 --name PCB-64 --train_all

I use multi gpu, so add some code:
if torch.cuda.device_count() > 1 and use_gpu:
  model_wraped = nn.DataParallel(model).cuda()
  model = model_wraped

but error in forward:
RuntimeError: cuda runtime error (77) : an illegal memory access was encountered at /opt/conda/conda-bld/pytorch_1513368888240/work/torch/lib/THC/THCTensorCopy.cu:204

@layumi
Copy link
Owner

layumi commented Apr 28, 2018

Hi @yja1
Would you like to ignore this line?
https://github.com/layumi/Person_reID_baseline_pytorch/blob/master/train.py#L50
And try again.

@yja1
Copy link
Author

yja1 commented Apr 28, 2018

I have delete that when use multi gpu

@layumi
Copy link
Owner

layumi commented Apr 28, 2018

This is my code. You may refer it and modify your code.

model = torch.nn.DataParallel(model, device_ids=gpu_ids).cuda()

ignored_params = list(map(id, model.module.model.fc.parameters() )) + list(map(id, model.module.classifier.parameters() ))
base_params = filter(lambda p: id(p) not in ignored_params, model.parameters())

# Observe that all parameters are being optimized
optimizer_ft = optim.SGD([
             {'params': base_params, 'lr': 0.01},
             {'params': model.module.model.fc.parameters(), 'lr': 0.1},
             {'params': model.module.classifier.parameters(), 'lr': 0.1}
         ], momentum=0.9, weight_decay=5e-4, nesterov=True)

@xujian0
Copy link

xujian0 commented Nov 5, 2018

This is my code. You may refer it and modify your code.

model = torch.nn.DataParallel(model, device_ids=gpu_ids).cuda()

ignored_params = list(map(id, model.module.model.fc.parameters() )) + list(map(id, model.module.classifier.parameters() ))
base_params = filter(lambda p: id(p) not in ignored_params, model.parameters())

# Observe that all parameters are being optimized
optimizer_ft = optim.SGD([
             {'params': base_params, 'lr': 0.01},
             {'params': model.module.model.fc.parameters(), 'lr': 0.1},
             {'params': model.module.classifier.parameters(), 'lr': 0.1}
         ], momentum=0.9, weight_decay=5e-4, nesterov=True)

I viewed your code above and changed the train.py line303 and below into this:
```
model = torch.nn.DataParallel(model, device_ids=[0, 1]).cuda()
ignored_params = list(map(id, model.module.model.fc.parameters() ))
ignored_params += (list(map(id, model.module.classifier0.parameters() ))
+list(map(id, model.module.classifier1.parameters() ))
+list(map(id, model.module.classifier2.parameters() ))
+list(map(id, model.module.classifier3.parameters() ))
+list(map(id, model.module.classifier4.parameters() ))
+list(map(id, model.module.classifier5.parameters() ))
#+list(map(id, model.classifier6.parameters() ))
#+list(map(id, model.classifier7.parameters() ))
)

base_params = filter(lambda p: id(p) not in ignored_params, model.parameters())

optimizer_ft = optim.SGD([
         {'params': base_params, 'lr': 0.01},
         {'params': model.module.model.fc.parameters(), 'lr': 0.1},
         {'params': model.module.classifier0.parameters(), 'lr': 0.1},
         {'params': model.module.classifier1.parameters(), 'lr': 0.1},
         {'params': model.module.classifier2.parameters(), 'lr': 0.1},
         {'params': model.module.classifier3.parameters(), 'lr': 0.1},
         {'params': model.module.classifier4.parameters(), 'lr': 0.1},
         {'params': model.module.classifier5.parameters(), 'lr': 0.1},
         #{'params': model.classifier6.parameters(), 'lr': 0.01},
         #{'params': model.classifier7.parameters(), 'lr': 0.01}
     ], weight_decay=5e-4, momentum=0.9, nesterov=True)

so that I can use two GPU for PCB, and when I want to test my trained model, I got the error like this:
RuntimeError: Error(s) in loading state_dict for PCB:
	Missing key(s) in state_dict: "model.conv1.weight", "model.bn1.weight", "model.bn1.bias", "model.bn1.running_mean", "model.bn1.running_var", 
and this is my command:
python test.py --PCB --name PCB-32
I was wondering if this error has something to do with the way I train the model.

@layumi
Copy link
Owner

layumi commented Nov 5, 2018

Hi @xujian0
When you test the model, you also need to add model.module in the test.py.
Especially, this line

model = torch.nn.DataParallel(model, device_ids=gpu_ids).cuda()

@xujian0
Copy link

xujian0 commented Nov 6, 2018

Hi @xujian0
When you test the model, you also need to add model.module in the test.py.
Especially, this line

model = torch.nn.DataParallel(model, device_ids=gpu_ids).cuda()

Thanks for your reply, and it solved my problem!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants