Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sharded Training not saving memory #6047

Closed
amogkam opened this issue Feb 18, 2021 · 8 comments
Closed

Sharded Training not saving memory #6047

amogkam opened this issue Feb 18, 2021 · 8 comments
Assignees
Labels
waiting on author Waiting on user action, correction, or update working as intended Working as intended

Comments

@amogkam
Copy link
Contributor

amogkam commented Feb 18, 2021

🐛 Bug

I am training an ImageGPT model, but am not seeing any less GPU memory being used when training with ddp sharded vs. without.

Environment info:
PTL: v1.1.8
PTL Bolts: v0.3.0
Pytorch: v1.7.1
Python: v3.7.7
Cuda: 10.2

Single AWS p3.8xlarge instance- it contains 4 Tesla V100 GPUs.

The code is very simple:

import pytorch_lightning as pl
from pytorch_lightning.plugins.sharded_plugin import DDPShardedPlugin
from pl_bolts.models import ImageGPT
from pl_bolts.datamodules import MNISTDataModule

  dm = MNISTDataModule('.', batch_size=32)
  # 230K parameters.
  model = ImageGPT(embed_dim=32, layers=16, heads=4, vocab_size=32, num_pixels=28)
  trainer = pl.Trainer(accelerator="ddp", gpus=4,) #plugins=[DDPShardedPlugin()])
  trainer.fit(model, dm)

When I remove the DDPShardedPlugin I am seeing GPU memory usage of ~13MiB
image

When I include the plugin, I am still seeing GPU memory usage exactly the same
image

I would expect the per-device memory usage with the plugin to be less.

@SeanNaren any idea on what's going on here? Thanks a lot for the help.

@amogkam amogkam added bug Something isn't working help wanted Open to be worked on labels Feb 18, 2021
@SeanNaren SeanNaren self-assigned this Feb 18, 2021
@SeanNaren
Copy link
Contributor

Hey @amogkam

Thanks for the issue! The sample also made me realize there was a bug in master that needed to be resolved as well with passing plugins.

I've ran your sample and confirmed that I can replicate results. The results are not too surprising, considering the model is really small! The model is around 230k parameters which is quite small (I think the number is a bit fishy however, but its alright for now) and Sharded won't really help here as the states are tiny; partitioning the states would have a negligible improvement. Sharded really takes effect when the model size is large (roughly 100M+ parameters).

I ran tests if we bump up the embed dim size to around 1024. This was around 202M parameters which is still smallish.

On 4 A100 GPUs (40GB VRAM) here are the numbers I get:

DDP:
Average Epoch time: 221.00 seconds
Average Peak memory 21535.00MiB

Sharded:
Average Epoch time: 230.00 seconds
Average Peak memory 19052.25MiB

And If I really reduce the batch size (4 which will make the benefits of sharded even more obvious) and increase the model size, upping the embed dim to 2048 (807M parameters):

DDP
Average Epoch time: 855.00 seconds
Average Peak memory 23071.00MiB


Sharded:
Average Epoch time: 777.00 seconds
Average Peak memory 13126.00MiB

Sharded can take a larger batch size, but I wanted to compare to DDP. I'd suggest increasing the size of iGPT if possible!

In the meantime I'll update the docs to make it clearer that these improvements require the model to be off Million parameter sizes at the least, and the benefits scale as the model size increases (to an extent).

Here is my code:

import time

import torch
from pl_bolts.datamodules import MNISTDataModule
from pl_bolts.models import ImageGPT

import pytorch_lightning as pl
from pytorch_lightning import Callback
from pytorch_lightning.plugins import DDPPlugin
from pytorch_lightning.plugins import DDPShardedPlugin


class CUDACallback(Callback):

    def on_train_epoch_start(self, trainer, pl_module):
        # Reset the memory use counter
        torch.cuda.reset_peak_memory_stats(trainer.root_gpu)
        torch.cuda.synchronize(trainer.root_gpu)
        self.start_time = time.time()

    def on_train_epoch_end(self, trainer, pl_module, outputs):
        torch.cuda.synchronize(trainer.root_gpu)
        max_memory = torch.cuda.max_memory_allocated(trainer.root_gpu) / 2 ** 20
        epoch_time = time.time() - self.start_time

        max_memory = torch.tensor(max_memory, dtype=torch.int, device=trainer.root_gpu)
        epoch_time = torch.tensor(epoch_time, dtype=torch.int, device=trainer.root_gpu)

        torch.distributed.all_reduce(max_memory, op=torch.distributed.ReduceOp.SUM)
        torch.distributed.all_reduce(epoch_time, op=torch.distributed.ReduceOp.SUM)

        world_size = torch.distributed.get_world_size()

        print(f"Average Epoch time: {epoch_time.item() / float(world_size):.2f} seconds")
        print(f"Average Peak memory {max_memory.item() / float(world_size):.2f}MiB")


dm = MNISTDataModule('.', batch_size=4)

model = ImageGPT(embed_dim=2048, layers=16, heads=4, vocab_size=32, num_pixels=28)

ddp_plugin = DDPPlugin(find_unused_parameters=True)
sharded_plugin = DDPShardedPlugin()

trainer = pl.Trainer(
    max_epochs=1, gpus=4, accelerator='ddp', precision=16,
    callbacks=[CUDACallback()], plugins=ddp_plugin # Swap between the plugins
)
trainer.fit(model, dm)

@SeanNaren SeanNaren added waiting on author Waiting on user action, correction, or update with code working as intended Working as intended and removed bug Something isn't working help wanted Open to be worked on with code labels Feb 18, 2021
@miraodasilva
Copy link

Hi, big fan of lightning, big props to all the collaborators.

Unfortunately, I'm getting the same issue: virtually the same memory when I add plugins="ddp_sharded" to the trainer. Here is the comparison (gpus 1 and 3 are being used by me, rs2517):
without sharded:
image
with sharded:
image

The model I am training has, as described by lightning:
70.8 M Trainable params
66.6 M Non-trainable params
137 M Total params
So it's not that small. I am using SGD rather than Adam, btw, I don't know if that plays a big role. Unforunately I can't offer many more details as the code is still private.
My specs are:
python 3.9
pytorch 1.7.1
lightning 1.1
fairscale 0.3.0

Any suggestions on this? Thanks a lot in advance.

@SeanNaren
Copy link
Contributor

hey @miraodasilva, I see you're using lightning 1.1, could you try using lightning master?

pip install git+https://github.com/PyTorchLightning/pytorch-lightning.git

@miraodasilva
Copy link

I downgraded to 1.1 since mixed precision is broken on 1.2, as reported here: #6077

I tried lightning master but it also seems to break my code (which works on 1.1):
File "/lib/python3.9/site-packages/pytorch_lightning/core/lightning.py", line 210, in print progress_bar.print(*args, **kwargs) File "lib/python3.9/site-packages/pytorch_lightning/callbacks/progress.py", line 471, in print elif not self.test_progress_bar.disable: AttributeError: 'NoneType' object has no attribute 'disable'

@amogkam
Copy link
Contributor Author

amogkam commented Feb 25, 2021

@SeanNaren what cloud provider are you using for GPUs? AWS doesn't have any instances with 40 GB GPU memory, does it?

@miraodasilva
Copy link

@SeanNaren any updates on this?

I'm getting the same behaviour (no memory saving) on 1.2.4. Thanks in advance.

@huyvnphan
Copy link

I'm using ResNet50 training on ImageNet here. No memory saving compare to regular ddp training.

@SeanNaren
Copy link
Contributor

Circling back to this issue, if people could share their setup + model size this would help drastically.

@carmocca carmocca closed this as not planned Won't fix, can't repro, duplicate, stale Jul 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
waiting on author Waiting on user action, correction, or update working as intended Working as intended
Projects
None yet
Development

No branches or pull requests

5 participants