Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[docs] distributed_backend -> accelerator #4429

Merged
merged 5 commits into from
Oct 29, 2020
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/source/introduction_guide.rst
Original file line number Diff line number Diff line change
Expand Up @@ -543,7 +543,7 @@ Or multiple nodes

# (32 GPUs)
model = LitMNIST()
trainer = Trainer(gpus=8, num_nodes=4, distributed_backend='ddp')
trainer = Trainer(gpus=8, num_nodes=4, accelerator='ddp')
trainer.fit(model, train_loader)

Refer to the :ref:`distributed computing guide for more details <multi_gpu>`.
Expand Down
4 changes: 2 additions & 2 deletions docs/source/lightning_module.rst
Original file line number Diff line number Diff line change
Expand Up @@ -256,7 +256,7 @@ The matching pseudocode is:

Training with DataParallel
~~~~~~~~~~~~~~~~~~~~~~~~~~
When training using a `distributed_backend` that splits data from each batch across GPUs, sometimes you might
When training using a `accelerator` that splits data from each batch across GPUs, sometimes you might
need to aggregate them on the master GPU for processing (dp, or ddp2).

In this case, implement the `training_step_end` method
Expand Down Expand Up @@ -360,7 +360,7 @@ If you need to do something with all the outputs of each `validation_step`, over

Validating with DataParallel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
When training using a `distributed_backend` that splits data from each batch across GPUs, sometimes you might
When training using a `accelerator` that splits data from each batch across GPUs, sometimes you might
need to aggregate them on the master GPU for processing (dp, or ddp2).

In this case, implement the `validation_step_end` method
Expand Down
42 changes: 21 additions & 21 deletions docs/source/multi_gpu.rst
Original file line number Diff line number Diff line change
Expand Up @@ -231,11 +231,11 @@ Distributed modes
-----------------
Lightning allows multiple ways of training

- Data Parallel (`distributed_backend='dp'`) (multiple-gpus, 1 machine)
- DistributedDataParallel (`distributed_backend='ddp'`) (multiple-gpus across many machines (python script based)).
- DistributedDataParallel (`distributed_backend='ddp_spawn'`) (multiple-gpus across many machines (spawn based)).
- DistributedDataParallel 2 (`distributed_backend='ddp2'`) (DP in a machine, DDP across machines).
- Horovod (`distributed_backend='horovod'`) (multi-machine, multi-gpu, configured at runtime)
- Data Parallel (`accelerator='dp'`) (multiple-gpus, 1 machine)
- DistributedDataParallel (`accelerator='ddp'`) (multiple-gpus across many machines (python script based)).
- DistributedDataParallel (`accelerator='ddp_spawn'`) (multiple-gpus across many machines (spawn based)).
- DistributedDataParallel 2 (`accelerator='ddp2'`) (DP in a machine, DDP across machines).
- Horovod (`accelerator='horovod'`) (multi-machine, multi-gpu, configured at runtime)
- TPUs (`tpu_cores=8|x`) (tpu or TPU pod)

.. note::
Expand All @@ -258,7 +258,7 @@ after which the root node will aggregate the results.
:skipif: torch.cuda.device_count() < 2

# train on 2 GPUs (using DP mode)
trainer = Trainer(gpus=2, distributed_backend='dp')
trainer = Trainer(gpus=2, accelerator='dp')

Distributed Data Parallel
^^^^^^^^^^^^^^^^^^^^^^^^^
Expand All @@ -281,10 +281,10 @@ Distributed Data Parallel
.. code-block:: python

# train on 8 GPUs (same machine (ie: node))
trainer = Trainer(gpus=8, distributed_backend='ddp')
trainer = Trainer(gpus=8, accelerator='ddp')

# train on 32 GPUs (4 nodes)
trainer = Trainer(gpus=8, distributed_backend='ddp', num_nodes=4)
trainer = Trainer(gpus=8, accelerator='ddp', num_nodes=4)

This Lightning implementation of DDP calls your script under the hood multiple times with the correct environment
variables:
Expand Down Expand Up @@ -330,7 +330,7 @@ In this case, we can use DDP2 which behaves like DP in a machine and DDP across
.. code-block:: python

# train on 32 GPUs (4 nodes)
trainer = Trainer(gpus=8, distributed_backend='ddp2', num_nodes=4)
trainer = Trainer(gpus=8, accelerator='ddp2', num_nodes=4)

Distributed Data Parallel Spawn
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Expand All @@ -348,7 +348,7 @@ project module) you can use the following method:
.. code-block:: python

# train on 8 GPUs (same machine (ie: node))
trainer = Trainer(gpus=8, distributed_backend='ddp')
trainer = Trainer(gpus=8, accelerator='ddp')

We STRONGLY discourage this use because it has limitations (due to Python and PyTorch):

Expand Down Expand Up @@ -400,7 +400,7 @@ You can then call your scripts anywhere
.. code-block:: bash

cd /project/src
python some_file.py --distributed_backend 'ddp' --gpus 8
python some_file.py --accelerator 'ddp' --gpus 8


Horovod
Expand All @@ -421,10 +421,10 @@ Horovod can be configured in the training script to run with any number of GPUs
.. code-block:: python

# train Horovod on GPU (number of GPUs / machines provided on command-line)
trainer = Trainer(distributed_backend='horovod', gpus=1)
trainer = Trainer(accelerator='horovod', gpus=1)

# train Horovod on CPU (number of processes / machines provided on command-line)
trainer = Trainer(distributed_backend='horovod')
trainer = Trainer(accelerator='horovod')

When starting the training job, the driver application will then be used to specify the total
number of worker processes:
Expand Down Expand Up @@ -556,11 +556,11 @@ Below are the possible configurations we support.
+-------+---------+----+-----+---------+------------------------------------------------------------+
| Y | | | | Y | `Trainer(gpus=1, use_amp=True)` |
+-------+---------+----+-----+---------+------------------------------------------------------------+
| | Y | Y | | | `Trainer(gpus=k, distributed_backend='dp')` |
| | Y | Y | | | `Trainer(gpus=k, accelerator='dp')` |
+-------+---------+----+-----+---------+------------------------------------------------------------+
| | Y | | Y | | `Trainer(gpus=k, distributed_backend='ddp')` |
| | Y | | Y | | `Trainer(gpus=k, accelerator='ddp')` |
+-------+---------+----+-----+---------+------------------------------------------------------------+
| | Y | | Y | Y | `Trainer(gpus=k, distributed_backend='ddp', use_amp=True)` |
| | Y | | Y | Y | `Trainer(gpus=k, accelerator='ddp', use_amp=True)` |
+-------+---------+----+-----+---------+------------------------------------------------------------+


Expand Down Expand Up @@ -590,10 +590,10 @@ In (DDP, Horovod) your effective batch size will be 7 * gpus * num_nodes.
.. code-block:: python

# effective batch size = 7 * 8
Trainer(gpus=8, distributed_backend='ddp|horovod')
Trainer(gpus=8, accelerator='ddp|horovod')

# effective batch size = 7 * 8 * 10
Trainer(gpus=8, num_nodes=10, distributed_backend='ddp|horovod')
Trainer(gpus=8, num_nodes=10, accelerator='ddp|horovod')


In DDP2, your effective batch size will be 7 * num_nodes.
Expand All @@ -602,10 +602,10 @@ The reason is that the full batch is visible to all GPUs on the node when using
.. code-block:: python

# effective batch size = 7
Trainer(gpus=8, distributed_backend='ddp2')
Trainer(gpus=8, accelerator='ddp2')

# effective batch size = 7 * 10
Trainer(gpus=8, num_nodes=10, distributed_backend='ddp2')
Trainer(gpus=8, num_nodes=10, accelerator='ddp2')


.. note:: Huge batch sizes are actually really bad for convergence. Check out:
Expand All @@ -619,7 +619,7 @@ Lightning supports the use of PytorchElastic to enable fault-tolerent and elasti

.. code-block:: python

Trainer(gpus=8, distributed_backend='ddp')
Trainer(gpus=8, accelerator='ddp')


Following the `PytorchElastic Quickstart documentation <https://pytorch.org/elastic/latest/quickstart.html>`_, you then need to start a single-node etcd server on one of the hosts:
Expand Down
4 changes: 2 additions & 2 deletions docs/source/performance.rst
Original file line number Diff line number Diff line change
Expand Up @@ -33,9 +33,9 @@ The best thing to do is to increase the ``num_workers`` slowly and stop once you

Spawn
^^^^^
When using ``distributed_backend=ddp_spawn`` (the ddp default) or TPU training, the way multiple GPUs/TPU cores are used is by calling ``.spawn()`` under the hood.
When using ``accelerator=ddp_spawn`` (the ddp default) or TPU training, the way multiple GPUs/TPU cores are used is by calling ``.spawn()`` under the hood.
The problem is that PyTorch has issues with ``num_workers > 0`` when using ``.spawn()``. For this reason we recommend you
use ``distributed_backend=ddp`` so you can increase the ``num_workers``, however your script has to be callable like so:
use ``accelerator=ddp`` so you can increase the ``num_workers``, however your script has to be callable like so:

.. code-block:: bash

Expand Down
4 changes: 2 additions & 2 deletions docs/source/slurm.rst
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ To train a model using multiple nodes, do the following:
.. code-block:: python

# train on 32 GPUs across 4 nodes
trainer = Trainer(gpus=8, num_nodes=4, distributed_backend='ddp')
trainer = Trainer(gpus=8, num_nodes=4, accelerator='ddp')

3. It's a good idea to structure your training script like this:

Expand All @@ -37,7 +37,7 @@ To train a model using multiple nodes, do the following:
trainer = pl.Trainer(
gpus=8,
num_nodes=4,
distributed_backend='ddp'
accelerator='ddp'
)

trainer.fit(model)
Expand Down
2 changes: 1 addition & 1 deletion docs/source/tpu.rst
Original file line number Diff line number Diff line change
Expand Up @@ -140,7 +140,7 @@ Lightning supports training on a single TPU core. Just pass the TPU core ID [1-8

Distributed Backend with TPU
----------------------------
The ```distributed_backend``` option used for GPUs does not apply to TPUs.
The ``accelerator`` option used for GPUs does not apply to TPUs.
TPUs work in DDP mode by default (distributing over each core)

----------------
Expand Down
14 changes: 7 additions & 7 deletions pytorch_lightning/trainer/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -203,18 +203,18 @@ def forward(self, x):
.. testcode::

# default used by the Trainer
trainer = Trainer(distributed_backend=None)
trainer = Trainer(accelerator=None)

Example::

# dp = DataParallel
trainer = Trainer(gpus=2, distributed_backend='dp')
trainer = Trainer(gpus=2, accelerator='dp')

# ddp = DistributedDataParallel
trainer = Trainer(gpus=2, num_nodes=2, distributed_backend='ddp')
trainer = Trainer(gpus=2, num_nodes=2, accelerator='ddp')

# ddp2 = DistributedDataParallel + dp
trainer = Trainer(gpus=2, num_nodes=2, distributed_backend='ddp2')
trainer = Trainer(gpus=2, num_nodes=2, accelerator='ddp2')

.. note:: this option does not apply to TPU. TPUs use ```ddp``` by default (over each core)

Expand Down Expand Up @@ -948,16 +948,16 @@ def on_train_end(self, trainer, pl_module):
|

Number of processes to train with. Automatically set to the number of GPUs
when using ``distrbuted_backend="ddp"``. Set to a number greater than 1 when
using ``distributed_backend="ddp_cpu"`` to mimic distributed training on a
when using ``accelerator="ddp"``. Set to a number greater than 1 when
using ``accelerator="ddp_cpu"`` to mimic distributed training on a
machine without GPUs. This is useful for debugging, but **will not** provide
any speedup, since single-process Torch already makes effient use of multiple
CPUs.

.. testcode::

# Simulate DDP for debugging on your GPU-less laptop
trainer = Trainer(distributed_backend="ddp_cpu", num_processes=2)
trainer = Trainer(accelerator="ddp_cpu", num_processes=2)

num_sanity_val_steps
^^^^^^^^^^^^^^^^^^^^
Expand Down