Error during training. RuntimeError: CUDA out of memory. #46
Replies: 1 comment 2 replies
-
2GB of VRAM is very little ... especially since Windows itself is probably using a chunk of it. Ideally you'd need around 6GB/8GB at the minimum to train a RAVE model with batch size 8. You could try making your batch size 1 just to see if training is possible at all. |
Beta Was this translation helpful? Give feedback.
-
Hi there! Before anything else, I would like to apologize and say that I am not a programmer so many of my questions may be too silly. I am running on Windows 10 with a 2GB GPU (Nvidia GTX940M), Intel i7 7800U.
As others in this forum, if I change the num_workers >0, I get a "can't Pickle Environment objects" error during training. This happens even when setting num_workers to 4 as suggested by the script (I've done this changes in the train_rave.py and train_prior.py files, not in the DataLoader.py file). When num_workers is set to 0, I get the following error regarding GPU memory.
I've tried decreasing BATCH size (from 8 to 4) and cleaning caché [torch.cuda.empty_cache()] as recommended by other forums running on the same problem, but I still get the same issue. I find it strange since the RAVE article said the program could be run on low performance cpus and laptops.
Any help will be much appreciated!
C:\Users\agustin\anaconda3\envs\rave\lib\site-packages\pytorch_lightning\trainer\connectors\data_connector.py:240: PossibleUserWarning: The dataloader, train_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the
num_workers
argument(try 4 which is the number of cpus on this machine) in the
DataLoader` init to improve performance.rank_zero_warn(
Epoch 0: 0%| | 0/82 [00:00<?, ?it/s]Traceback (most recent call last):
File "C:\Users\agustin\RAVE\train_rave.py", line 155, in
trainer.fit(model, train, val)
File "C:\Users\agustin\anaconda3\envs\rave\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 768, in fit
self._call_and_handle_interrupt(
File "C:\Users\agustin\anaconda3\envs\rave\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 721, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "C:\Users\agustin\anaconda3\envs\rave\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 809, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "C:\Users\agustin\anaconda3\envs\rave\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1234, in _run
results = self._run_stage()
File "C:\Users\agustin\anaconda3\envs\rave\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1321, in _run_stage
return self._run_train()
File "C:\Users\agustin\anaconda3\envs\rave\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1351, in _run_train
self.fit_loop.run()
File "C:\Users\agustin\anaconda3\envs\rave\lib\site-packages\pytorch_lightning\loops\base.py", line 204, in run
self.advance(*args, **kwargs)
File "C:\Users\agustin\anaconda3\envs\rave\lib\site-packages\pytorch_lightning\loops\fit_loop.py", line 269, in advance
self._outputs = self.epoch_loop.run(self._data_fetcher)
File "C:\Users\agustin\anaconda3\envs\rave\lib\site-packages\pytorch_lightning\loops\base.py", line 204, in run
self.advance(*args, **kwargs)
File "C:\Users\agustin\anaconda3\envs\rave\lib\site-packages\pytorch_lightning\loops\epoch\training_epoch_loop.py", line 208, in advance
batch_output = self.batch_loop.run(batch, batch_idx)
File "C:\Users\agustin\anaconda3\envs\rave\lib\site-packages\pytorch_lightning\loops\base.py", line 204, in run
self.advance(*args, **kwargs)
File "C:\Users\agustin\anaconda3\envs\rave\lib\site-packages\pytorch_lightning\loops\batch\training_batch_loop.py", line 90, in advance
outputs = self.manual_loop.run(split_batch, batch_idx)
File "C:\Users\agustin\anaconda3\envs\rave\lib\site-packages\pytorch_lightning\loops\base.py", line 204, in run
self.advance(*args, **kwargs)
File "C:\Users\agustin\anaconda3\envs\rave\lib\site-packages\pytorch_lightning\loops\optimization\manual_loop.py", line 115, in advance
training_step_output = self.trainer._call_strategy_hook("training_step", *step_kwargs.values())
File "C:\Users\agustin\anaconda3\envs\rave\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1763, in _call_strategy_hook
output = fn(*args, **kwargs)
File "C:\Users\agustin\anaconda3\envs\rave\lib\site-packages\pytorch_lightning\strategies\strategy.py", line 333, in training_step
return self.model.training_step(*args, **kwargs)
File "C:\Users\agustin\RAVE\rave\model.py", line 585, in training_step
distance = distance + self.distance(x, y)
File "C:\Users\agustin\RAVE\rave\model.py", line 511, in distance
lin = sum(list(map(self.lin_distance, x, y)))
File "C:\Users\agustin\RAVE\rave\model.py", line 501, in lin_distance
return torch.norm(x - y) / torch.norm(x)
RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 2.00 GiB total capacity; 1.04 GiB already allocated; 5.93 MiB free; 1.19 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Beta Was this translation helpful? Give feedback.
All reactions