Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: cuda runtime error (2) : out of memory #328

Closed
mhusseinsh opened this issue Jul 17, 2018 · 5 comments
Closed

RuntimeError: cuda runtime error (2) : out of memory #328

mhusseinsh opened this issue Jul 17, 2018 · 5 comments

Comments

@mhusseinsh
Copy link

Hello,
when i change the --resize_or_crop none, I have this error
my images are not that big, they are 800x600, and I am running on a 16 GB GPU
create web directory ./checkpoints/carla2kitti_cyclegan/web...
THCudaCheck FAIL file=/pytorch/aten/src/THC/generic/THCStorage.cu line=58 error=2 : out of memory
Exception NameError: "global name 'FileNotFoundError' is not defined" in <bound method _DataLoaderIter.del of <torch.utils.data.dataloader._DataLoaderIter object at 0x7fd2831a5b90>> ignored
Traceback (most recent call last):
File "train.py", line 32, in
model.optimize_parameters()
File "/mnt/DTAA_data/DTAA/code/z637177/pytorch-CycleGAN-and-pix2pix-master/models/cycle_gan_model.py", line 138, in optimize_parameters
self.forward()
File "/mnt/DTAA_data/DTAA/code/z637177/pytorch-CycleGAN-and-pix2pix-master/models/cycle_gan_model.py", line 85, in forward
self.rec_B = self.netG_A(self.fake_A)
File "/home/adm.Z637177/.local/lib/python2.7/site-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "/home/adm.Z637177/.local/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py", line 112, in forward
return self.module(*inputs[0], **kwargs[0])
File "/home/adm.Z637177/.local/lib/python2.7/site-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "/mnt/DTAA_data/DTAA/code/z637177/pytorch-CycleGAN-and-pix2pix-master/models/networks.py", line 186, in forward
return self.model(input)
File "/home/adm.Z637177/.local/lib/python2.7/site-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "/home/adm.Z637177/.local/lib/python2.7/site-packages/torch/nn/modules/container.py", line 91, in forward
input = module(input)
File "/home/adm.Z637177/.local/lib/python2.7/site-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "/home/adm.Z637177/.local/lib/python2.7/site-packages/torch/nn/modules/instancenorm.py", line 50, in forward
self.training or not self.track_running_stats, self.momentum, self.eps)
File "/home/adm.Z637177/.local/lib/python2.7/site-packages/torch/nn/functional.py", line 1245, in instance_norm
eps=eps)
File "/home/adm.Z637177/.local/lib/python2.7/site-packages/torch/onnx/init.py", line 57, in wrapper
return fn(*args, **kwargs)
File "/home/adm.Z637177/.local/lib/python2.7/site-packages/torch/nn/functional.py", line 1233, in _instance_norm
training=use_input_stats, momentum=momentum, eps=eps)
File "/home/adm.Z637177/.local/lib/python2.7/site-packages/torch/nn/functional.py", line 1194, in batch_norm
training, momentum, eps, torch.backends.cudnn.enabled
RuntimeError: cuda runtime error (2) : out of memory at /pytorch/aten/src/THC/generic/THCStorage.cu:58

@cici33
Copy link

cici33 commented Jul 19, 2018

I have the same problem,did you solve it?

@taesungp
Copy link
Collaborator

Did you try to train or test? 800x600 is more than 7 times larger than 256x256 image (800x600/(256x256) = 7.32), so the memory requirement will be very high.

One approach to save memory is to train on cropped images using --resize_or_crop resize_and_crop, and then generate the images at test time by loading only one generator network using --model test --resize_or_crop none. I think 800x600 can be dealt this way.

If it still run into out-of-memory error, you can try reducing the network size.

@mhusseinsh
Copy link
Author

hello @taesung89
Thanks for your helpful reply

Did you try to train or test?

I was training

800x600 is more than 7 times larger than 256x256 image (800x600/(256x256) = 7.32), so the memory requirement will be very high.

Yes, this another problem I have. I am already working on a server which has lots of GPUs, each one is 16 GB. When I choose a single GPU, it is allocated but not fully utilized, only 4GB out of the 16 are utilized. Is there an idea of how to fully use the GPU ?? and accordingly, if I choose multiple GPUs, only one of them is allocated. you can refer to my issue here #327

One approach to save memory is to train on cropped images using --resize_or_crop resize_and_crop, and then generate the images at test time by loading only one generator network using --model test --resize_or_crop none. I think 800x600 can be dealt this way.

Exactly, this is what I did. I did a resize and crop (as the original implementation), and then during testing, I tested on the full image, and it worked

One last question, what is your opinion concerning my case of training on rectangle images 800x600. Do you think resizing to 286x286 then cropping to 256x256 is a good idea ?, or shall I skip the resizing and only crop square patches, or shall I resize to smaller rectangle images instead of square, and then do the random cropping.
what can you recommend me to do ?

@snlee81
Copy link

snlee81 commented Sep 3, 2018

Hello @taesung89,

I also have similar problem since not enough GPU memory for my 512 x 512 images. I did exact same ways you suggested for the training (loadSize = 512, fineSize = 256) and the test (loadSize = 512, fineSize = 512). My question is a little bit different from above. How does G work for larger test image (512 x 512) even though it was trained with small (in this case, 256 x 256).

When testing, is there anyways of upsampling technique involved? I may miss some parts for now. For now, the results after G seems visually ok.

Thank you in advance.

@junyanz
Copy link
Owner

junyanz commented Sep 3, 2018

The G is a fully convolutional network (FCN). It does not require the same image size for training and test. See the original FCN paper and slides for more details.

@junyanz junyanz closed this as completed Jan 4, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants