CUDA memory leak after batch size finder #6570

maxjeblick · 2021-03-17T20:44:24Z

🐛 Bug

Using transformers + AdamW optimizer + batch size finder results in ~2 - 3 GB GPU memory not being freed after
trainer.tune (for xlm-roberta-base). This causes OOM issues on a subsequent call of trainer.fit.
I suspect that the state of the AdamW optimizer causes this issue.

Please reproduce using the BoringModel

https://colab.research.google.com/drive/1cugaUmLzNvk-38OyV8zyT9M9xQY4LkfH#scrollTo=j4w0wizx5XxJ

Expected behavior

GPU memory should be freed after the batch size finder (up to the model which may stay on GPU).

Environment

CUDA:
- GPU:
  - Tesla T4
- available: True
- version: 10.1
Packages:
- numpy: 1.19.5
- pyTorch_debug: False
- pyTorch_version: 1.8.0+cu101
- pytorch-lightning: 1.2.4
- tqdm: 4.41.1
System:
- OS: Linux
- architecture:
  - 64bit
- processor: x86_64
- python: 3.7.10
- version: Proposal for help #1 SMP Thu Jul 23 08:00:38 PDT 2020

The text was updated successfully, but these errors were encountered:

maxjeblick · 2021-03-17T20:53:35Z

Using

    def configure_optimizers(self):
        optimizer = torch.optim.SGD(self.model.parameters(), lr=0.1)
        return optimizer

results in no GPU leakage.

maxjeblick · 2021-03-17T21:52:06Z

trainer._lightning_optimizers (from here) still contains the optimizer which was used for finding the correct batch size (including exp_avg stats on CUDA).

I also noticed that, for some cases, calling model.fit() resulted in a wrong fitting behavior when used together with model.tune() and tuning (batch size only) was done with a random target.

maxjeblick · 2021-03-18T09:15:35Z

Seems to be fixed already by the following PR: #6372 :D
(thanks @awaelchli)

maxjeblick added bug Something isn't working help wanted Open to be worked on labels Mar 17, 2021

maxjeblick closed this as completed Mar 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA memory leak after batch size finder #6570

CUDA memory leak after batch size finder #6570

maxjeblick commented Mar 17, 2021

maxjeblick commented Mar 17, 2021

maxjeblick commented Mar 17, 2021 •

edited

Loading

maxjeblick commented Mar 18, 2021

CUDA memory leak after batch size finder #6570

CUDA memory leak after batch size finder #6570

Comments

maxjeblick commented Mar 17, 2021

🐛 Bug

Please reproduce using the BoringModel

Expected behavior

Environment

maxjeblick commented Mar 17, 2021

maxjeblick commented Mar 17, 2021 • edited Loading

maxjeblick commented Mar 18, 2021

maxjeblick commented Mar 17, 2021 •

edited

Loading