-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory leak while transferring tensor to cpu #634
Comments
I have the same problem :/ |
I am having a similar issue. It has proven to be very hard to track down because it appears inconsistently and only affects some computers whereas others are not affected at all. The situation in which the memory leak occurs is always the same: O1 mixed precision training. During the training loop everything is fine, but in the validation loop the RAM usage goes up. In every epoch. Disabling mixed precision training makes this problem go away. Here are my observations so far:
My code looks similar to what @neeraj-j has posted: with torch.no_grad():
self.network.eval()
val_losses = []
for b in range(self.num_val_batches_per_epoch):
l = self.run_iteration(self.val_gen, False) # l is a simple scalar that has been detached and converted to numpy
val_losses.append(l)
self.all_val_losses.append(np.mean(val_losses)) def run_iteration(self, data_generator, do_backprop=True):
data_dict = next(data_generator)
data = data_dict['data']
target = data_dict['target']
data = maybe_to_torch(data)
target = maybe_to_torch(target)
if torch.cuda.is_available():
data = to_cuda(data)
target = to_cuda(target)
self.optimizer.zero_grad()
output = self.network(data)
del data
loss = self.loss(output, target)
del target
if do_backprop:
if not self.fp16 or amp is None or not torch.cuda.is_available():
loss.backward()
else:
with amp.scale_loss(loss, self.optimizer) as scaled_loss:
scaled_loss.backward()
_ = clip_grad_norm_(self.network.parameters(), 12)
self.optimizer.step()
return loss.detach().cpu().numpy() Maybe someone has an idea of what could be going on? @mcarilli perhaps? :-) All my code is on github, if you are interested please contact me an I can give you step by step instructions on how to reproduce this issue. Best, |
Hey there, this problem still persists and it would be fantastic to get a response. Is this a known issue to you? |
Yes, this problem is still happening to me on Ubuntu 20.04. It took a whole day for me to trace down the memory leak to one line: t.to(cpuDevice).to(torch::kFloat); here t is a tensor on GPU and it has half precision. Update: I fixed the above problem by upgrading to CUDA 11.0, and pytorch 1.7 |
I have the same problem . |
Hi, |
I have the same problem, how could I fix the bug on older pytorch version, eg. cuda10.1 + pytorch 1.4 |
@NiHaoUCAS, I think you might have to update... |
as @FabianIsensee said, compile pytorch with cudnn 8.02 can help(pytorch1.4 + cudnn8.02)? it's big challenge to update pytorch, for engine issue. |
Hi,
I am observing memory leak while transferring tensor from GPU to CPU in pytorch. Following code can summarize the issue. Here data_loader is feeding images. Memory leak is observed while using opt_level 'O1'. If I use opt_level 'O0' there is no leak. I am seeing this issue after updating apex to current version.
I am using :
apex ver: 0.1 "https://github.com/NVIDIA/apex.git" master branch dated 11-25-2019.
Pytorch ver: 1.3.0
Ubuntu: 18.04
cuda: 10.1
I tried typecasting 'output' to float() at gpu before transferring to cpu and converting numpy array to float16. Nothing works.
The text was updated successfully, but these errors were encountered: