Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feat] 3/n pp #5036

Merged
merged 19 commits into from
Dec 9, 2020
Merged
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,6 @@ repos:
types: [python]

- repo: https://github.com/pre-commit/mirrors-mypy
rev: master
rev: v0.790
hooks:
- id: mypy
87 changes: 81 additions & 6 deletions docs/source/multi_gpu.rst
Original file line number Diff line number Diff line change
Expand Up @@ -612,6 +612,7 @@ This is useful when dealing with large Transformer based models, or in environme
Lightning currently offers the following methods to leverage model parallelism:

- Sharded Training (partitioning your gradients and optimizer state across multiple GPUs, for reduced memory overhead with **no performance loss**)
- Sequential Model Parallelism with Checkpointing (partition your :class:`nn.Sequential <torch.nn.Sequential>` module across multiple GPUs, leverage checkpointing and microbatching for further memory improvements and device utilization)

Sharded Training
^^^^^^^^^^^^^^^^
Expand Down Expand Up @@ -666,7 +667,7 @@ To use Sharded Training, you need to first install FairScale using the command b

.. code-block:: bash

pip install https://github.com/facebookresearch/fairscale/archive/bb468670838b98dc8f8d67be4eabf195042a7994.zip
pip install https://github.com/PyTorchLightning/fairscale/archive/pl_1.1.0.zip


.. code-block:: python
Expand All @@ -678,6 +679,80 @@ Sharded Training can work across all DDP variants by adding the additional ``--p

Internally we re-initialize your optimizers and shard them across your machines and processes. We handle all communication using PyTorch distributed, so no code changes are required.

----------

.. _sequential-parallelism:

Sequential Model Parallelism with Checkpointing
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
PyTorch Lightning integration for Sequential Model Parallelism using `FairScale <https://github.com/facebookresearch/fairscale>`_.
Sequential Model Parallelism splits a sequential module onto multiple GPUs, reducing peak GPU memory requirements substantially.
We also provide auto-balancing techniques through FairScale, to find optimal balances for the model across GPUs.
In addition, we use Gradient Checkpointing to reduce GPU memory requirements further, and micro-batches to minimizing device under-utilization automatically.

Reference: https://arxiv.org/abs/1811.06965

.. note:: DDPSequentialPlugin is currently supported only for Pytorch 1.6.

To get started, install FairScale through extras using with ``pip install pytorch-lightning["extra"]``

or directly using

.. code-block:: bash

pip install https://github.com/PyTorchLightning/fairscale/archive/pl_1.1.0.zip

To use Sequential Model Parallelism, you must define a :class:`nn.Sequential <torch.nn.Sequential>` module that defines the layers you wish to parallelize across GPUs.
This should be kept within the ``sequential_module`` variable within your ``LightningModule`` like below.

.. code-block:: python

from pytorch_lightning.plugins.ddp_sequential_plugin import DDPSequentialPlugin
from pytorch_lightning import LightningModule

class MyModel(LightningModule):
def __init__(self):
...
self.sequential_module = torch.nn.Sequential(my_layers)

# Split my module across 4 gpus, one layer each
model = MyModel()
plugin = DDPSequentialPlugin(balance=[1, 1, 1, 1])
trainer = Trainer(accelerator='ddp', gpus=4, plugins=[plugin])
trainer.fit(model)


We provide a minimal example of Sequential Model Parallelism using a convolutional model training on cifar10, split onto GPUs `here <https://github.com/PyTorchLightning/pytorch-lightning/tree/master/pl_examples/basic_examples/conv_sequential_example.py>`_.
To run the example, you will to install `Bolts <https://github.com/PyTorchLightning/pytorch-lightning-bolts>`_. Install with ``pip install pytorch-lightning-bolts``.
tchaton marked this conversation as resolved.
Show resolved Hide resolved

When running the Sequential Model Parallelism example on 2 GPUS we achieve these memory savings.

.. list-table:: GPU Memory Utilization
:widths: 25 25 50
:header-rows: 1

* - GPUS
- Without Balancing
- With Balancing
* - Gpu 0
- 4436 MB
- 1554 MB
* - Gpu 1
- ~0
- 994 MB

To run the example with Sequential Model Parallelism:

.. code-block:: python
Borda marked this conversation as resolved.
Show resolved Hide resolved

python pl_examples/basic_examples/conv_sequential_example.py --batch_size 1024 --gpus 2 --accelerator ddp --use_ddp_sequential
tchaton marked this conversation as resolved.
Show resolved Hide resolved

To run the same example without Sequential Model Parallelism:

.. code-block:: python
Borda marked this conversation as resolved.
Show resolved Hide resolved

python pl_examples/basic_examples/conv_sequential_example.py --batch_size 1024 --gpus 1


Batch size
----------
Expand Down Expand Up @@ -728,17 +803,17 @@ Lightning supports the use of TorchElastic to enable fault-tolerant and elastic
.. code-block:: python

Trainer(gpus=8, accelerator='ddp')


Following the `TorchElastic Quickstart documentation <https://pytorch.org/elastic/latest/quickstart.html>`_, you then need to start a single-node etcd server on one of the hosts:

.. code-block:: bash

etcd --enable-v2
--listen-client-urls http://0.0.0.0:2379,http://127.0.0.1:4001
--advertise-client-urls PUBLIC_HOSTNAME:2379


And then launch the elastic job with:

.. code-block:: bash
Expand All @@ -750,7 +825,7 @@ And then launch the elastic job with:
--rdzv_backend=etcd
--rdzv_endpoint=ETCD_HOST:ETCD_PORT
YOUR_LIGHTNING_TRAINING_SCRIPT.py (--arg1 ... train script args...)


See the official `TorchElastic documentation <https://pytorch.org/elastic>`_ for details
on installation and more use cases.
Expand Down
27 changes: 26 additions & 1 deletion docs/source/performance.rst
Original file line number Diff line number Diff line change
Expand Up @@ -131,4 +131,29 @@ To use Optimizer Sharded Training, refer to :ref:`model-parallelism`.

Sharded DDP can work across all DDP variants by adding the additional ``--plugins ddp_sharded`` flag.

Refer to the :ref:`distributed computing guide for more details <multi_gpu>`.
Refer to the :ref:`distributed computing guide for more details <multi_gpu>`.


Sequential Model Parallelism with Checkpointing
---------------------------------------------------------------------
PyTorch Lightning integration for Sequential Model Parallelism using `FairScale <https://github.com/facebookresearch/fairscale>`_.
Sequential Model Parallelism splits a sequential module onto multiple GPUs, reducing peak GPU memory requirements substantially.

.. code-block:: python

from pytorch_lightning.plugins.ddp_sequential_plugin import DDPSequentialPlugin
from pytorch_lightning import LightningModule

class MyModel(LightningModule):
def __init__(self):
...
self.sequential_module = torch.nn.Sequential(my_layers)

# Split my module across 4 gpus, one layer each
model = MyModel()
plugin = DDPSequentialPlugin(balance=[1, 1, 1, 1])
trainer = Trainer(accelerator='ddp', gpus=4, plugins=[plugin])
trainer.fit(model)


For more information, refer to :ref:`sequential-parallelism`.
25 changes: 25 additions & 0 deletions docs/source/training_tricks.rst
Original file line number Diff line number Diff line change
Expand Up @@ -123,3 +123,28 @@ The algorithm in short works by:
:members: scale_batch_size

.. warning:: Batch size finder is not supported for DDP yet, it is coming soon.


Sequential Model Parallelism with Checkpointing
---------------------------------------------------------------------
PyTorch Lightning integration for Sequential Model Parallelism using `FairScale <https://github.com/facebookresearch/fairscale>`_.
Sequential Model Parallelism splits a sequential module onto multiple GPUs, reducing peak GPU memory requirements substantially.

.. code-block:: python

from pytorch_lightning.plugins.ddp_sequential_plugin import DDPSequentialPlugin
from pytorch_lightning import LightningModule

class MyModel(LightningModule):
def __init__(self):
...
self.sequential_module = torch.nn.Sequential(my_layers)

# Split my module across 4 gpus, one layer each
model = MyModel()
plugin = DDPSequentialPlugin(balance=[1, 1, 1, 1])
trainer = Trainer(accelerator='ddp', gpus=4, plugins=[plugin])
trainer.fit(model)
SeanNaren marked this conversation as resolved.
Show resolved Hide resolved


For more information, refer to :ref:`sequential-parallelism`.