Error during training. RuntimeError: CUDA out of memory. #46

agustinissidoro · 2022-04-21T15:42:22Z

agustinissidoro
Apr 21, 2022

Hi there! Before anything else, I would like to apologize and say that I am not a programmer so many of my questions may be too silly. I am running on Windows 10 with a 2GB GPU (Nvidia GTX940M), Intel i7 7800U.

As others in this forum, if I change the num_workers >0, I get a "can't Pickle Environment objects" error during training. This happens even when setting num_workers to 4 as suggested by the script (I've done this changes in the train_rave.py and train_prior.py files, not in the DataLoader.py file). When num_workers is set to 0, I get the following error regarding GPU memory.

I've tried decreasing BATCH size (from 8 to 4) and cleaning caché [torch.cuda.empty_cache()] as recommended by other forums running on the same problem, but I still get the same issue. I find it strange since the RAVE article said the program could be run on low performance cpus and laptops.

Any help will be much appreciated!

C:\Users\agustin\anaconda3\envs\rave\lib\site-packages\pytorch_lightning\trainer\connectors\data_connector.py:240: PossibleUserWarning: The dataloader, train_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the num_workers argument(try 4 which is the number of cpus on this machine) in theDataLoader` init to improve performance.
rank_zero_warn(
Epoch 0: 0%| | 0/82 [00:00<?, ?it/s]Traceback (most recent call last):
File "C:\Users\agustin\RAVE\train_rave.py", line 155, in
trainer.fit(model, train, val)
File "C:\Users\agustin\anaconda3\envs\rave\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 768, in fit
self._call_and_handle_interrupt(
File "C:\Users\agustin\anaconda3\envs\rave\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 721, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "C:\Users\agustin\anaconda3\envs\rave\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 809, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "C:\Users\agustin\anaconda3\envs\rave\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1234, in _run
results = self._run_stage()
File "C:\Users\agustin\anaconda3\envs\rave\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1321, in _run_stage
return self._run_train()
File "C:\Users\agustin\anaconda3\envs\rave\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1351, in _run_train
self.fit_loop.run()
File "C:\Users\agustin\anaconda3\envs\rave\lib\site-packages\pytorch_lightning\loops\base.py", line 204, in run
self.advance(*args, **kwargs)
File "C:\Users\agustin\anaconda3\envs\rave\lib\site-packages\pytorch_lightning\loops\fit_loop.py", line 269, in advance
self._outputs = self.epoch_loop.run(self._data_fetcher)
File "C:\Users\agustin\anaconda3\envs\rave\lib\site-packages\pytorch_lightning\loops\base.py", line 204, in run
self.advance(*args, **kwargs)
File "C:\Users\agustin\anaconda3\envs\rave\lib\site-packages\pytorch_lightning\loops\epoch\training_epoch_loop.py", line 208, in advance
batch_output = self.batch_loop.run(batch, batch_idx)
File "C:\Users\agustin\anaconda3\envs\rave\lib\site-packages\pytorch_lightning\loops\base.py", line 204, in run
self.advance(*args, **kwargs)
File "C:\Users\agustin\anaconda3\envs\rave\lib\site-packages\pytorch_lightning\loops\batch\training_batch_loop.py", line 90, in advance
outputs = self.manual_loop.run(split_batch, batch_idx)
File "C:\Users\agustin\anaconda3\envs\rave\lib\site-packages\pytorch_lightning\loops\base.py", line 204, in run
self.advance(*args, **kwargs)
File "C:\Users\agustin\anaconda3\envs\rave\lib\site-packages\pytorch_lightning\loops\optimization\manual_loop.py", line 115, in advance
training_step_output = self.trainer._call_strategy_hook("training_step", *step_kwargs.values())
File "C:\Users\agustin\anaconda3\envs\rave\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1763, in _call_strategy_hook
output = fn(*args, **kwargs)
File "C:\Users\agustin\anaconda3\envs\rave\lib\site-packages\pytorch_lightning\strategies\strategy.py", line 333, in training_step
return self.model.training_step(*args, **kwargs)
File "C:\Users\agustin\RAVE\rave\model.py", line 585, in training_step
distance = distance + self.distance(x, y)
File "C:\Users\agustin\RAVE\rave\model.py", line 511, in distance
lin = sum(list(map(self.lin_distance, x, y)))
File "C:\Users\agustin\RAVE\rave\model.py", line 501, in lin_distance
return torch.norm(x - y) / torch.norm(x)
RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 2.00 GiB total capacity; 1.04 GiB already allocated; 5.93 MiB free; 1.19 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

jreus · 2022-06-17T15:47:53Z

jreus
Jun 17, 2022

Hi there! Before anything else, I would like to apologize and say that I am not a programmer so many of my questions may be too silly. I am running on Windows 10 with a 2GB GPU (Nvidia GTX940M), Intel i7 7800U.

As others in this forum, if I change the num_workers >0, I get a "can't Pickle Environment objects" error during training. This happens even when setting num_workers to 4 as suggested by the script (I've done this changes in the train_rave.py and train_prior.py files, not in the DataLoader.py file). When num_workers is set to 0, I get the following error regarding GPU memory.

I've tried decreasing BATCH size (from 8 to 4) and cleaning caché [torch.cuda.empty_cache()] as recommended by other forums running on the same problem, but I still get the same issue. I find it strange since the RAVE article said the program could be run on low performance cpus and laptops.

Any help will be much appreciated!

C:\Users\agustin\anaconda3\envs\rave\lib\site-packages\pytorch_lightning\trainer\connectors\data_connector.py:240: PossibleUserWarning: The dataloader, train_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the num_workers argument(try 4 which is the number of cpus on this machine) in theDataLoader` init to improve performance. rank_zero_warn( Epoch 0: 0%| | 0/82 [00:00<?, ?it/s]Traceback (most recent call last): File "C:\Users\agustin\RAVE\train_rave.py", line 155, in trainer.fit(model, train, val) File "C:\Users\agustin\anaconda3\envs\rave\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 768, in fit self._call_and_handle_interrupt( File "C:\Users\agustin\anaconda3\envs\rave\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 721, in _call_and_handle_interrupt return trainer_fn(*args, **kwargs) File "C:\Users\agustin\anaconda3\envs\rave\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 809, in _fit_impl results = self._run(model, ckpt_path=self.ckpt_path) File "C:\Users\agustin\anaconda3\envs\rave\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1234, in _run results = self._run_stage() File "C:\Users\agustin\anaconda3\envs\rave\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1321, in _run_stage return self._run_train() File "C:\Users\agustin\anaconda3\envs\rave\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1351, in _run_train self.fit_loop.run() File "C:\Users\agustin\anaconda3\envs\rave\lib\site-packages\pytorch_lightning\loops\base.py", line 204, in run self.advance(*args, **kwargs) File "C:\Users\agustin\anaconda3\envs\rave\lib\site-packages\pytorch_lightning\loops\fit_loop.py", line 269, in advance self._outputs = self.epoch_loop.run(self._data_fetcher) File "C:\Users\agustin\anaconda3\envs\rave\lib\site-packages\pytorch_lightning\loops\base.py", line 204, in run self.advance(*args, **kwargs) File "C:\Users\agustin\anaconda3\envs\rave\lib\site-packages\pytorch_lightning\loops\epoch\training_epoch_loop.py", line 208, in advance batch_output = self.batch_loop.run(batch, batch_idx) File "C:\Users\agustin\anaconda3\envs\rave\lib\site-packages\pytorch_lightning\loops\base.py", line 204, in run self.advance(*args, **kwargs) File "C:\Users\agustin\anaconda3\envs\rave\lib\site-packages\pytorch_lightning\loops\batch\training_batch_loop.py", line 90, in advance outputs = self.manual_loop.run(split_batch, batch_idx) File "C:\Users\agustin\anaconda3\envs\rave\lib\site-packages\pytorch_lightning\loops\base.py", line 204, in run self.advance(*args, **kwargs) File "C:\Users\agustin\anaconda3\envs\rave\lib\site-packages\pytorch_lightning\loops\optimization\manual_loop.py", line 115, in advance training_step_output = self.trainer._call_strategy_hook("training_step", *step_kwargs.values()) File "C:\Users\agustin\anaconda3\envs\rave\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1763, in _call_strategy_hook output = fn(*args, **kwargs) File "C:\Users\agustin\anaconda3\envs\rave\lib\site-packages\pytorch_lightning\strategies\strategy.py", line 333, in training_step return self.model.training_step(*args, **kwargs) File "C:\Users\agustin\RAVE\rave\model.py", line 585, in training_step distance = distance + self.distance(x, y) File "C:\Users\agustin\RAVE\rave\model.py", line 511, in distance lin = sum(list(map(self.lin_distance, x, y))) File "C:\Users\agustin\RAVE\rave\model.py", line 501, in lin_distance return torch.norm(x - y) / torch.norm(x) RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 2.00 GiB total capacity; 1.04 GiB already allocated; 5.93 MiB free; 1.19 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

2GB of VRAM is very little ... especially since Windows itself is probably using a chunk of it. Ideally you'd need around 6GB/8GB at the minimum to train a RAVE model with batch size 8. You could try making your batch size 1 just to see if training is possible at all.

2 replies

caillonantoine Jun 17, 2022
Collaborator

Yup, this is far from enough ! Rave runs smoothly on a laptop cpu once trained, the training process still requires a bit of computational power :)

caillonantoine Jun 17, 2022
Collaborator

You can use the official colab in the meantime :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error during training. RuntimeError: CUDA out of memory. #46

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Error during training. RuntimeError: CUDA out of memory. #46

agustinissidoro Apr 21, 2022

Replies: 1 comment · 2 replies

jreus Jun 17, 2022

caillonantoine Jun 17, 2022 Collaborator

caillonantoine Jun 17, 2022 Collaborator

agustinissidoro
Apr 21, 2022

Replies: 1 comment 2 replies

jreus
Jun 17, 2022

caillonantoine Jun 17, 2022
Collaborator

caillonantoine Jun 17, 2022
Collaborator