Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak while transferring tensor to cpu #634

Open
neeraj-j opened this issue Nov 27, 2019 · 9 comments
Open

Memory leak while transferring tensor to cpu #634

neeraj-j opened this issue Nov 27, 2019 · 9 comments

Comments

@neeraj-j
Copy link

Hi,

I am observing memory leak while transferring tensor from GPU to CPU in pytorch. Following code can summarize the issue. Here data_loader is feeding images. Memory leak is observed while using opt_level 'O1'. If I use opt_level 'O0' there is no leak. I am seeing this issue after updating apex to current version.

model, optimizer = amp.initialize(model, optimizer, opt_level=opt_level)
model.eval()
for epoch in range(10):
       for i, input in enumerate(data_loader):  
            # compute output  
            output = model(input)  
            output = output.cpu().numpy()  

I am using :
apex ver: 0.1 "https://github.com/NVIDIA/apex.git" master branch dated 11-25-2019.
Pytorch ver: 1.3.0
Ubuntu: 18.04
cuda: 10.1
I tried typecasting 'output' to float() at gpu before transferring to cpu and converting numpy array to float16. Nothing works.

@matejpavlovic-maistra
Copy link

I have the same problem :/

@FabianIsensee
Copy link

FabianIsensee commented May 14, 2020

I am having a similar issue. It has proven to be very hard to track down because it appears inconsistently and only affects some computers whereas others are not affected at all.

The situation in which the memory leak occurs is always the same: O1 mixed precision training. During the training loop everything is fine, but in the validation loop the RAM usage goes up. In every epoch. Disabling mixed precision training makes this problem go away.
Over the course of a training this easily amounts to 100GB or more of RAM usage and is enough to break the training script.

Here are my observations so far:

  • I have only been able to get this behavior on Ubuntu 18.04. It does not occur on centOS.
  • cuda 10.1 on both systems
  • I have tested the most recent apex master. This problem has been there for a while though (I just could not pinpoint it which is why I have not posted before)
  • I have tested both python 3.6.8 and 3.8.2
  • my code does semantic segmentation with a U-Net. The memory leak only occurs with 2D convs, not with 3D convs
  • Tested on RTX 2080ti
  • Strangely the issue does not always appear right at the start of the training. Sometimes the first couple of epochs are fine and then after five epochs or so the issue appears. When it appears is quite inconsistent which another reason why it took me so long to figure out it was related to mixed precision training

My code looks similar to what @neeraj-j has posted:

        with torch.no_grad():
            self.network.eval()
            val_losses = []
            for b in range(self.num_val_batches_per_epoch):
                l = self.run_iteration(self.val_gen, False) # l is a simple scalar that has been detached and converted to numpy
                val_losses.append(l)
            self.all_val_losses.append(np.mean(val_losses))
    def run_iteration(self, data_generator, do_backprop=True):
        data_dict = next(data_generator)
        data = data_dict['data']
        target = data_dict['target']

        data = maybe_to_torch(data)
        target = maybe_to_torch(target)

        if torch.cuda.is_available():
            data = to_cuda(data)
            target = to_cuda(target)

        self.optimizer.zero_grad()

        output = self.network(data)
        del data

        loss = self.loss(output, target)
        del target

        if do_backprop:
            if not self.fp16 or amp is None or not torch.cuda.is_available():
                loss.backward()
            else:
                with amp.scale_loss(loss, self.optimizer) as scaled_loss:
                    scaled_loss.backward()
            _ = clip_grad_norm_(self.network.parameters(), 12)
            self.optimizer.step()
        return loss.detach().cpu().numpy()

Maybe someone has an idea of what could be going on? @mcarilli perhaps? :-)

All my code is on github, if you are interested please contact me an I can give you step by step instructions on how to reproduce this issue.

Best,
Fabian

@FabianIsensee
Copy link

Hey there, this problem still persists and it would be fantastic to get a response. Is this a known issue to you?

@linjiapro
Copy link

linjiapro commented Oct 31, 2020

Yes, this problem is still happening to me on Ubuntu 20.04. It took a whole day for me to trace down the memory leak to one line:

t.to(cpuDevice).to(torch::kFloat);

here t is a tensor on GPU and it has half precision.

Update: I fixed the above problem by upgrading to CUDA 11.0, and pytorch 1.7

@s91005tw
Copy link

I have the same problem .

@FabianIsensee
Copy link

Hi,
the problem will go away if you compile pytorch yourself with a more recent version of cuDNN. I have no problems whatsoever with 8.0.2.
Best,
Fabian

@NiHaoUCAS
Copy link

Yes, this problem is still happening to me on Ubuntu 20.04. It took a whole day for me to trace down the memory leak to one line:

t.to(cpuDevice).to(torch::kFloat);

here t is a tensor on GPU and it has half precision.

Update: I fixed the above problem by upgrading to CUDA 11.0, and pytorch 1.7

I have the same problem, how could I fix the bug on older pytorch version, eg. cuda10.1 + pytorch 1.4

@linjiapro
Copy link

@NiHaoUCAS, I think you might have to update...

@NiHaoUCAS
Copy link

@NiHaoUCAS, I think you might have to update...

as @FabianIsensee said, compile pytorch with cudnn 8.02 can help(pytorch1.4 + cudnn8.02)? it's big challenge to update pytorch, for engine issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants