From 5647087f03214b208e31ae8c749d120d9c15d2df Mon Sep 17 00:00:00 2001
From: edenlightning <66261195+edenlightning@users.noreply.github.com>
Date: Wed, 16 Jun 2021 17:28:51 -0400
Subject: [PATCH] New speed documentation (#7665)
* amp
* amp
* docs
* add guides
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* amp
* amp
* docs
* add guides
* speed guides
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Delete ds.txt
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Update conf.py
* Update docs.txt
* remove 16 bit
* remove finetune from speed guide
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* speed
* speed
* speed
* speed
* speed
* speed
* speed
* speed
* speed
* speed
* speed
* speed
* remove early stopping from speed guide
* remove early stopping from speed guide
* remove early stopping from speed guide
* fix label
* fix sync
* reviews
* Update trainer.rst
* Update trainer.rst
* Update speed.rst
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
---
docs/source/advanced/amp.rst | 94 -----
docs/source/benchmarking/performance.rst | 183 ---------
docs/source/common/fast_training.rst | 82 ----
docs/source/common/optimizers.rst | 82 ----
docs/source/common/trainer.rst | 65 ++-
docs/source/guides/speed.rst | 482 +++++++++++++++++++++++
docs/source/index.rst | 4 +-
docs/source/starter/new-project.rst | 2 +-
8 files changed, 540 insertions(+), 454 deletions(-)
delete mode 100644 docs/source/advanced/amp.rst
delete mode 100644 docs/source/benchmarking/performance.rst
delete mode 100644 docs/source/common/fast_training.rst
create mode 100644 docs/source/guides/speed.rst
diff --git a/docs/source/advanced/amp.rst b/docs/source/advanced/amp.rst
deleted file mode 100644
index 2c25f9e7f918f..0000000000000
--- a/docs/source/advanced/amp.rst
+++ /dev/null
@@ -1,94 +0,0 @@
-.. testsetup:: *
-
- from pytorch_lightning.trainer.trainer import Trainer
-
-.. _amp:
-
-16-bit training
-=================
-Lightning offers 16-bit training for CPUs, GPUs, and TPUs.
-
-.. raw:: html
-
-
-
-|
-
-
-----------
-
-GPU 16-bit
-----------
-16-bit precision can cut your memory footprint by half.
-If using volta architecture GPUs it can give a dramatic training speed-up as well.
-
-.. note:: PyTorch 1.6+ is recommended for 16-bit
-
-Native torch
-^^^^^^^^^^^^
-When using PyTorch 1.6+ Lightning uses the native amp implementation to support 16-bit.
-
-.. testcode::
- :skipif: not _APEX_AVAILABLE and not _NATIVE_AMP_AVAILABLE or not torch.cuda.is_available()
-
- # turn on 16-bit
- trainer = Trainer(precision=16, gpus=1)
-
-Apex 16-bit
-^^^^^^^^^^^
-If you are using an earlier version of PyTorch Lightning uses Apex to support 16-bit.
-
-Follow these instructions to install Apex.
-To use 16-bit precision, do two things:
-
-1. Install Apex
-2. Set the "precision" trainer flag.
-
-.. code-block:: bash
-
- # ------------------------
- # OPTIONAL: on your cluster you might need to load CUDA 10 or 9
- # depending on how you installed PyTorch
-
- # see available modules
- module avail
-
- # load correct CUDA before install
- module load cuda-10.0
- # ------------------------
-
- # make sure you've loaded a cuda version > 4.0 and < 7.0
- module load gcc-6.1.0
-
- $ pip install --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" https://github.com/NVIDIA/apex
-
-.. warning:: NVIDIA Apex and DDP have instability problems. We recommend native 16-bit in PyTorch 1.6+
-
-Enable 16-bit
-^^^^^^^^^^^^^
-
-.. testcode::
- :skipif: not _APEX_AVAILABLE and not _NATIVE_AMP_AVAILABLE or not torch.cuda.is_available()
-
- # turn on 16-bit
- trainer = Trainer(amp_level='O2', precision=16)
-
-If you need to configure the apex init for your particular use case or want to use a different way of doing
-16-bit training, override :meth:`pytorch_lightning.core.LightningModule.configure_apex`.
-
-----------
-
-TPU 16-bit
-----------
-16-bit on TPUs is much simpler. To use 16-bit with TPUs set precision to 16 when using the TPU flag
-
-.. testcode::
- :skipif: not _TPU_AVAILABLE
-
- # DEFAULT
- trainer = Trainer(tpu_cores=8, precision=32)
-
- # turn on 16-bit
- trainer = Trainer(tpu_cores=8, precision=16)
diff --git a/docs/source/benchmarking/performance.rst b/docs/source/benchmarking/performance.rst
deleted file mode 100644
index 6e2b546fb275f..0000000000000
--- a/docs/source/benchmarking/performance.rst
+++ /dev/null
@@ -1,183 +0,0 @@
-.. _performance:
-
-Fast performance tips
-=====================
-Lightning builds in all the micro-optimizations we can find to increase your performance.
-But we can only automate so much.
-
-Here are some additional things you can do to increase your performance.
-
-----------
-
-Dataloaders
------------
-When building your DataLoader set ``num_workers > 0`` and ``pin_memory=True`` (only for GPUs).
-
-.. code-block:: python
-
- Dataloader(dataset, num_workers=8, pin_memory=True)
-
-num_workers
-^^^^^^^^^^^
-The question of how many ``num_workers`` is tricky. Here's a summary of
-some references, [`1 `_], and our suggestions.
-
-1. ``num_workers=0`` means ONLY the main process will load batches (that can be a bottleneck).
-2. ``num_workers=1`` means ONLY one worker (just not the main process) will load data but it will still be slow.
-3. The ``num_workers`` depends on the batch size and your machine.
-4. A general place to start is to set ``num_workers`` equal to the number of CPUs on that machine.
-
-.. warning:: Increasing ``num_workers`` will ALSO increase your CPU memory consumption.
-
-The best thing to do is to increase the ``num_workers`` slowly and stop once you see no more improvement in your training speed.
-
-Spawn
-^^^^^
-When using ``accelerator=ddp_spawn`` (the ddp default) or TPU training, the way multiple GPUs/TPU cores are used is by calling ``.spawn()`` under the hood.
-The problem is that PyTorch has issues with ``num_workers > 0`` when using ``.spawn()``. For this reason we recommend you
-use ``accelerator=ddp`` so you can increase the ``num_workers``, however your script has to be callable like so:
-
-.. code-block:: bash
-
- python my_program.py --gpus X
-
-----------
-
-.item(), .numpy(), .cpu()
--------------------------
-Don't call ``.item()`` anywhere in your code. Use ``.detach()`` instead to remove the connected graph calls. Lightning
-takes a great deal of care to be optimized for this.
-
-----------
-
-empty_cache()
--------------
-Don't call this unnecessarily! Every time you call this ALL your GPUs have to wait to sync.
-
-----------
-
-Construct tensors directly on the device
-----------------------------------------
-LightningModules know what device they are on! Construct tensors on the device directly to avoid CPU->Device transfer.
-
-.. code-block:: python
-
- # bad
- t = torch.rand(2, 2).cuda()
-
- # good (self is LightningModule)
- t = torch.rand(2, 2, device=self.device)
-
-
-For tensors that need to be model attributes, it is best practice to register them as buffers in the modules's
-``__init__`` method:
-
-.. code-block:: python
-
- # bad
- self.t = torch.rand(2, 2, device=self.device)
-
- # good
- self.register_buffer("t", torch.rand(2, 2))
-
-----------
-
-Use DDP not DP
---------------
-DP performs three GPU transfers for EVERY batch:
-
-1. Copy model to device.
-2. Copy data to device.
-3. Copy outputs of each device back to master.
-
-|
-
-Whereas DDP only performs 1 transfer to sync gradients. Because of this, DDP is MUCH faster than DP.
-
-When using DDP set find_unused_parameters=False
------------------------------------------------
-
-By default we have enabled find unused parameters to True. This is for compatibility issues that have arisen in the past (see the `discussion `_ for more information).
-This by default comes with a performance hit, and can be disabled in most cases.
-
-.. code-block:: python
-
- from pytorch_lightning.plugins import DDPPlugin
-
- trainer = pl.Trainer(
- gpus=2,
- plugins=DDPPlugin(find_unused_parameters=False),
- )
-
-----------
-
-16-bit precision
-----------------
-Use 16-bit to decrease the memory consumption (and thus increase your batch size). On certain GPUs (V100s, 2080tis), 16-bit calculations are also faster.
-However, know that 16-bit and multi-processing (any DDP) can have issues. Here are some common problems.
-
-1. `CUDA error: an illegal memory access was encountered `_.
- The solution is likely setting a specific CUDA, CUDNN, PyTorch version combination.
-2. ``CUDA error: device-side assert triggered``. This is a general catch-all error. To see the actual error run your script like so:
-
-.. code-block:: bash
-
- # won't see what the error is
- python main.py
-
- # will see what the error is
- CUDA_LAUNCH_BLOCKING=1 python main.py
-
-.. tip:: We also recommend using 16-bit native found in PyTorch 1.6. Just install this version and Lightning will automatically use it.
-
-----------
-
-Advanced GPU Optimizations
---------------------------
-
-When training on single or multiple GPU machines, Lightning offers a host of advanced optimizations to improve throughput, memory efficiency, and model scaling.
-Refer to :doc:`Advanced GPU Optimized Training for more details <../advanced/advanced_gpu>`.
-
-----------
-
-Preload Data Into RAM
----------------------
-
-When your training or preprocessing requires many operations to be performed on entire dataset(s) it can
-sometimes be beneficial to store all data in RAM given there is enough space.
-However, loading all data at the beginning of the training script has the disadvantage that it can take a long
-time and hence it slows down the development process. Another downside is that in multiprocessing (e.g. DDP)
-the data would get copied in each process.
-One can overcome these problems by copying the data into RAM in advance.
-Most UNIX-based operating systems provide direct access to tmpfs through a mount point typically named ``/dev/shm``.
-
-0. Increase shared memory if necessary. Refer to the documentation of your OS how to do this.
-
-1. Copy training data to shared memory:
-
- .. code-block:: bash
-
- cp -r /path/to/data/on/disk /dev/shm/
-
-2. Refer to the new data root in your script or command line arguments:
-
- .. code-block:: python
-
- datamodule = MyDataModule(data_root="/dev/shm/my_data")
-
-----------
-
-Zero Grad ``set_to_none=True``
-------------------------------
-
-In order to modestly improve performance, you can override :meth:`~pytorch_lightning.core.lightning.LightningModule.optimizer_zero_grad`.
-
-For a more detailed explanation of pros / cons of this technique,
-read `this `_ documentation by the PyTorch team.
-
-.. testcode::
-
- class Model(LightningModule):
-
- def optimizer_zero_grad(self, epoch, batch_idx, optimizer, optimizer_idx):
- optimizer.zero_grad(set_to_none=True)
diff --git a/docs/source/common/fast_training.rst b/docs/source/common/fast_training.rst
deleted file mode 100644
index 2216d234836f2..0000000000000
--- a/docs/source/common/fast_training.rst
+++ /dev/null
@@ -1,82 +0,0 @@
-.. testsetup:: *
-
- from pytorch_lightning.trainer.trainer import Trainer
-
-.. _fast_training:
-
-Fast Training
-=============
-There are multiple options to speed up different parts of the training by choosing to train
-on a subset of data. This could be done for speed or debugging purposes.
-
-----------------
-
-Check validation every n epochs
--------------------------------
-If you have a small dataset you might want to check validation every n epochs
-
-.. testcode::
-
- # DEFAULT
- trainer = Trainer(check_val_every_n_epoch=1)
-
-----------------
-
-Force training for min or max epochs
-------------------------------------
-It can be useful to force training for a minimum number of epochs or limit to a max number.
-
-.. seealso::
- :class:`~pytorch_lightning.trainer.trainer.Trainer`
-
-.. testcode::
-
- # DEFAULT
- trainer = Trainer(min_epochs=1, max_epochs=1000)
-
-----------------
-
-Set validation check frequency within 1 training epoch
-------------------------------------------------------
-For large datasets it's often desirable to check validation multiple times within a training loop.
-Pass in a float to check that often within 1 training epoch. Pass in an int `k` to check every `k` training batches.
-Must use an `int` if using an `IterableDataset`.
-
-.. testcode::
-
- # DEFAULT
- trainer = Trainer(val_check_interval=0.95)
-
- # check every .25 of an epoch
- trainer = Trainer(val_check_interval=0.25)
-
- # check every 100 train batches (ie: for `IterableDatasets` or fixed frequency)
- trainer = Trainer(val_check_interval=100)
-
-----------------
-
-Use data subset for training, validation, and test
---------------------------------------------------
-If you don't want to check 100% of the training/validation/test set (for debugging or if it's huge), set these flags.
-
-.. testcode::
-
- # DEFAULT
- trainer = Trainer(
- limit_train_batches=1.0,
- limit_val_batches=1.0,
- limit_test_batches=1.0
- )
-
- # check 10%, 20%, 30% only, respectively for training, validation and test set
- trainer = Trainer(
- limit_train_batches=0.1,
- limit_val_batches=0.2,
- limit_test_batches=0.3
- )
-
-If you also pass ``shuffle=True`` to the dataloader, a different random subset of your dataset will be used for each epoch; otherwise the same subset will be used for all epochs.
-
-.. note:: ``limit_train_batches``, ``limit_val_batches`` and ``limit_test_batches`` will be overwritten by ``overfit_batches`` if ``overfit_batches`` > 0. ``limit_val_batches`` will be ignored if ``fast_dev_run=True``.
-
-.. note:: If you set ``limit_val_batches=0``, validation will be disabled.
diff --git a/docs/source/common/optimizers.rst b/docs/source/common/optimizers.rst
index 12e9c6925e7fd..cde203fdd193e 100644
--- a/docs/source/common/optimizers.rst
+++ b/docs/source/common/optimizers.rst
@@ -232,88 +232,6 @@ If you want to call ``lr_scheduler.step()`` every ``n`` steps/epochs, do the fol
-----
-Improve training speed with model toggling
-------------------------------------------
-Toggling models can improve your training speed when performing gradient accumulation with multiple optimizers in a
-distributed setting.
-
-Here is an explanation of what it does:
-
-* Considering the current optimizer as A and all other optimizers as B.
-* Toggling means that all parameters from B exclusive to A will have their ``requires_grad`` attribute set to ``False``.
-* Their original state will be restored when exiting the context manager.
-
-When performing gradient accumulation, there is no need to perform grad synchronization during the accumulation phase.
-Setting ``sync_grad`` to ``False`` will block this synchronization and improve your training speed.
-
-:class:`~pytorch_lightning.core.optimizer.LightningOptimizer` provides a
-:meth:`~pytorch_lightning.core.optimizer.LightningOptimizer.toggle_model` function as a
-:func:`contextlib.contextmanager` for advanced users.
-
-Here is an example for advanced use-case.
-
-.. testcode:: python
-
- # Scenario for a GAN with gradient accumulation every 2 batches and optimized for multiple gpus.
- class SimpleGAN(LightningModule):
-
- def __init__(self):
- super().__init__()
- self.automatic_optimization = False
-
- def training_step(self, batch, batch_idx):
- # Implementation follows the PyTorch tutorial:
- # https://pytorch.org/tutorials/beginner/dcgan_faces_tutorial.html
- g_opt, d_opt = self.optimizers()
-
- X, _ = batch
- X.requires_grad = True
- batch_size = X.shape[0]
-
- real_label = torch.ones((batch_size, 1), device=self.device)
- fake_label = torch.zeros((batch_size, 1), device=self.device)
-
- # Sync and clear gradients
- # at the end of accumulation or
- # at the end of an epoch.
- is_last_batch_to_accumulate = \
- (batch_idx + 1) % 2 == 0 or self.trainer.is_last_batch
-
- g_X = self.sample_G(batch_size)
-
- ##########################
- # Optimize Discriminator #
- ##########################
- with d_opt.toggle_model(sync_grad=is_last_batch_to_accumulate):
- d_x = self.D(X)
- errD_real = self.criterion(d_x, real_label)
-
- d_z = self.D(g_X.detach())
- errD_fake = self.criterion(d_z, fake_label)
-
- errD = (errD_real + errD_fake)
-
- self.manual_backward(errD)
- if is_last_batch_to_accumulate:
- d_opt.step()
- d_opt.zero_grad()
-
- ######################
- # Optimize Generator #
- ######################
- with g_opt.toggle_model(sync_grad=is_last_batch_to_accumulate):
- d_z = self.D(g_X)
- errG = self.criterion(d_z, real_label)
-
- self.manual_backward(errG)
- if is_last_batch_to_accumulate:
- g_opt.step()
- g_opt.zero_grad()
-
- self.log_dict({'g_loss': errG, 'd_loss': errD}, prog_bar=True)
-
------
-
Use closure for LBFGS-like optimizers
-------------------------------------
It is a good practice to provide the optimizer with a closure function that performs a ``forward``, ``zero_grad`` and
diff --git a/docs/source/common/trainer.rst b/docs/source/common/trainer.rst
index ea32ea3dd55dc..0983f0acb9eec 100644
--- a/docs/source/common/trainer.rst
+++ b/docs/source/common/trainer.rst
@@ -196,6 +196,8 @@ unique seeds across all dataloader workers and processes for :mod:`torch`, :mod:
-------
+.. _trainer_flags:
+
Trainer flags
-------------
@@ -658,6 +660,8 @@ Writes logs to disk this often.
See Also:
- :doc:`logging <../extensions/logging>`
+.. _gpus:
+
gpus
^^^^
@@ -1155,28 +1159,69 @@ precision
|
-Double precision (64), full precision (32) or half precision (16).
-Can all be used on GPU or TPUs. Only double (64) and full precision (32) available on CPU.
+Lightning supports either double precision (64), full precision (32), or half precision (16) training.
-If used on TPU will use torch.bfloat16 but tensor printing
-will still show torch.float32.
+Half precision, or mixed precision, is the combined use of 32 and 16 bit floating points to reduce memory footprint during model training. This can result in improved performance, achieving +3X speedups on modern GPUs.
.. testcode::
:skipif: not _APEX_AVAILABLE and not _NATIVE_AMP_AVAILABLE or not torch.cuda.is_available()
# default used by the Trainer
- trainer = Trainer(precision=32)
+ trainer = Trainer(precision=32, gpus=1)
# 16-bit precision
trainer = Trainer(precision=16, gpus=1)
# 64-bit precision
- trainer = Trainer(precision=64)
+ trainer = Trainer(precision=64, gpus=1)
+
+
+.. note:: When running on TPUs, torch.float16 will be used but tensor printing will still show torch.float32.
+
+.. note:: 16-bit precision is not supported on CPUs.
+
+
+.. admonition:: When using PyTorch 1.6+, Lightning uses the native AMP implementation to support 16-bit precision. 16-bit precision with PyTorch < 1.6 is supported by NVIDIA Apex library.
+ :class: dropdown, warning
+
+ NVIDIA Apex and DDP have instability problems. We recommend upgrading to PyTorch 1.6+ in order to use the native AMP 16-bit precision with multiple GPUs.
+
+ If you are using an earlier version of PyTorch (before 1.6), Lightning uses `Apex `_ to support 16-bit training.
+
+ To use Apex 16-bit training:
+
+ 1. Install Apex
+
+ .. code-block:: bash
+
+ # ------------------------
+ # OPTIONAL: on your cluster you might need to load CUDA 10 or 9
+ # depending on how you installed PyTorch
+
+ # see available modules
+ module avail
+
+ # load correct CUDA before install
+ module load cuda-10.0
+ # ------------------------
+
+ # make sure you've loaded a GCC version > 4.0 and < 7.0
+ module load gcc-6.1.0
+
+ pip install --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" https://github.com/NVIDIA/apex
+
+ 2. Set the `precision` trainer flag to 16. You can customize the `Apex optimization level `_ by setting the `amp_level` flag.
+
+ .. testcode::
+ :skipif: not _APEX_AVAILABLE and not _NATIVE_AMP_AVAILABLE or not torch.cuda.is_available()
+
+ # turn on 16-bit
+ trainer = Trainer(amp_backend="apex", amp_level='O2', precision=16)
+
+ If you need to configure the apex init for your particular use case, or want to customize the
+ 16-bit training behaviour, override :meth:`pytorch_lightning.core.LightningModule.configure_apex`.
-Example::
- # one day
- trainer = Trainer(precision=8|4|2)
process_position
^^^^^^^^^^^^^^^^
@@ -1378,6 +1423,8 @@ track_grad_norm
# track the 2-norm
trainer = Trainer(track_grad_norm=2)
+.. _tpu_cores:
+
tpu_cores
^^^^^^^^^
diff --git a/docs/source/guides/speed.rst b/docs/source/guides/speed.rst
new file mode 100644
index 0000000000000..ece806558c76c
--- /dev/null
+++ b/docs/source/guides/speed.rst
@@ -0,0 +1,482 @@
+.. testsetup:: *
+
+ from pytorch_lightning.trainer.trainer import Trainer
+ from pytorch_lightning.callbacks.early_stopping import EarlyStopping
+ from pytorch_lightning.core.lightning import LightningModule
+
+.. _speed:
+
+#######################
+Speed up model training
+#######################
+
+There are multiple ways you can speed up your model's time to convergence:
+
+* ``_
+
+* ``_
+
+* ``_
+
+* ``_
+
+* ``_
+
+* ``_
+
+* ``_
+
+* ``_
+
+* ``_
+
+****************
+GPU/TPU training
+****************
+
+**Use when:** Whenever possible!
+
+With Lightning, running on GPUs, TPUs or multiple node is a simple switch of a flag.
+
+GPU training
+============
+
+Lightning supports a variety of plugins to further speed up distributed GPU training. Most notably:
+
+* :class:`~pytorch_lightning.plugins.training_type.DDPPlugin`
+* :class:`~pytorch_lightning.plugins.training_type.DDPShardedPlugin`
+* :class:`~pytorch_lightning.plugins.training_type.DeepSpeedPlugin`
+
+.. code-block:: python
+
+ # run on 1 gpu
+ trainer = Trainer(gpus=1)
+
+ # train on 8 gpus, using DDP plugin
+ trainer = Trainer(gpus=8, accelerator="ddp")
+
+ # train on multiple GPUs across nodes (uses 8 gpus in total)
+ trainer = Trainer(gpus=2, num_nodes=4)
+
+
+GPU Training Speedup Tips
+-------------------------
+
+When training on single or multiple GPU machines, Lightning offers a host of advanced optimizations to improve throughput, memory efficiency, and model scaling.
+Refer to :doc:`Advanced GPU Optimized Training for more details <../advanced/advanced_gpu>`.
+
+Prefer DDP over DP
+^^^^^^^^^^^^^^^^^^
+:class:`~pytorch_lightning.plugins.training_type.DataParallelPlugin` performs three GPU transfers for EVERY batch:
+
+1. Copy model to device.
+2. Copy data to device.
+3. Copy outputs of each device back to master.
+
+Whereas :class:`~pytorch_lightning.plugins.training_type.DDPPlugin` only performs 1 transfer to sync gradients, making DDP MUCH faster than DP.
+
+
+When using DDP set find_unused_parameters=False
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+By default we have set ``find_unused_parameters`` to True for compatibility issues that have arisen in the past (see the `discussion `_ for more information).
+This by default comes with a performance hit, and can be disabled in most cases.
+
+.. code-block:: python
+
+ from pytorch_lightning.plugins import DDPPlugin
+
+ trainer = pl.Trainer(
+ gpus=2,
+ plugins=DDPPlugin(find_unused_parameters=False),
+ )
+
+Dataloaders
+^^^^^^^^^^^
+When building your DataLoader set ``num_workers > 0`` and ``pin_memory=True`` (only for GPUs).
+
+.. code-block:: python
+
+ Dataloader(dataset, num_workers=8, pin_memory=True)
+
+num_workers
+"""""""""""
+
+The question of how many workers to specify in ``num_workers`` is tricky. Here's a summary of
+some references, [`1 `_], and our suggestions:
+
+1. ``num_workers=0`` means ONLY the main process will load batches (that can be a bottleneck).
+2. ``num_workers=1`` means ONLY one worker (just not the main process) will load data but it will still be slow.
+3. The ``num_workers`` depends on the batch size and your machine.
+4. A general place to start is to set ``num_workers`` equal to the number of CPU cores on that machine. You can get the number of CPU cores in python using `os.cpu_count()`, but note that depending on your batch size, you may overflow RAM memory.
+
+.. warning:: Increasing ``num_workers`` will ALSO increase your CPU memory consumption.
+
+The best thing to do is to increase the ``num_workers`` slowly and stop once you see no more improvement in your training speed.
+
+Spawn
+"""""
+When using ``accelerator=ddp_spawn`` or training on TPUs, the way multiple GPUs/TPU cores are used is by calling ``.spawn()`` under the hood.
+The problem is that PyTorch has issues with ``num_workers > 0`` when using ``.spawn()``. For this reason we recommend you
+use ``accelerator=ddp`` so you can increase the ``num_workers``, however your script has to be callable like so:
+
+.. code-block:: bash
+
+ python my_program.py
+
+
+TPU training
+============
+
+You can set the ``tpu_cores`` trainer flag to 1 or 8 cores.
+
+.. code-block:: python
+
+ # train on 1 TPU core
+ trainer = Trainer(tpu_cores=1)
+
+ # train on 8 TPU cores
+ trainer = Trainer(tpu_cores=8)
+
+To train on more than 8 cores (ie: a POD),
+submit this script using the xla_dist script.
+
+Example::
+
+ python -m torch_xla.distributed.xla_dist
+ --tpu=$TPU_POD_NAME
+ --conda-env=torch-xla-nightly
+ --env=XLA_USE_BF16=1
+ -- python your_trainer_file.py
+
+
+Read more in our :ref:`accelerators` and :ref:`plugins` guides.
+
+
+-----------
+
+.. _amp:
+
+*********************************
+Mixed precision (16-bit) training
+*********************************
+
+**Use when:**
+
+* You want to optimize for memory usage on a GPU.
+* You have a GPU that supports 16 bit precision (NVIDIA pascal architecture or newer).
+* Your optimization algorithm (training_step) is numerically stable.
+* You want to be the cool person in the lab :p
+
+.. raw:: html
+
+
+
+|
+
+
+Mixed precision combines the use of both 32 and 16 bit floating points to reduce memory footprint during model training, resulting in improved performance, achieving +3X speedups on modern GPUs.
+
+Lightning offers mixed precision or 16-bit training for GPUs and TPUs.
+
+
+.. testcode::
+ :skipif: not _APEX_AVAILABLE and not _NATIVE_AMP_AVAILABLE or not torch.cuda.is_available()
+
+ # 16-bit precision
+ trainer = Trainer(precision=16, gpus=4)
+
+
+----------------
+
+
+***********************
+Control Training Epochs
+***********************
+
+**Use when:** You run a hyperparameter search to find good initial parameters and want to save time, cost (money), or power (environment).
+It can allow you to be more cost efficient and also run more experiments at the same time.
+
+You can use Trainer flags to force training for a minimum number of epochs or limit to a max number of epochs. Use the `min_epochs` and `max_epochs` Trainer flags to set the number of epochs to run.
+
+.. testcode::
+
+ # DEFAULT
+ trainer = Trainer(min_epochs=1, max_epochs=1000)
+
+
+If running iteration based training, i.e. infinite / iterable dataloader, you can also control the number of steps with the `min_steps` and `max_steps` flags:
+
+.. testcode::
+
+ trainer = Trainer(max_steps=1000)
+
+ trainer = Trainer(min_steps=100)
+
+You can also interupt training based on training time:
+
+.. testcode::
+
+ # Stop after 12 hours of training or when reaching 10 epochs (string)
+ trainer = Trainer(max_time="00:12:00:00", max_epochs=10)
+
+ # Stop after 1 day and 5 hours (dict)
+ trainer = Trainer(max_time={"days": 1, "hours": 5})
+
+Learn more in our :ref:`trainer_flags` guide.
+
+
+----------------
+
+****************************
+Control Validation Frequency
+****************************
+
+Check validation every n epochs
+===============================
+
+**Use when:** You have a small dataset, and want to run less validation checks.
+
+You can limit validation check to only run every n epochs using the `check_val_every_n_epoch` Trainer flag.
+
+.. testcode::
+
+ # DEFAULT
+ trainer = Trainer(check_val_every_n_epoch=1)
+
+
+Set validation check frequency within 1 training epoch
+======================================================
+
+**Use when:** You have a large training dataset, and want to run mid-epoch validation checks.
+
+For large datasets, it's often desirable to check validation multiple times within a training loop.
+Pass in a float to check that often within 1 training epoch. Pass in an int `k` to check every `k` training batches.
+Must use an `int` if using an `IterableDataset`.
+
+.. testcode::
+
+ # DEFAULT
+ trainer = Trainer(val_check_interval=0.95)
+
+ # check every .25 of an epoch
+ trainer = Trainer(val_check_interval=0.25)
+
+ # check every 100 train batches (ie: for `IterableDatasets` or fixed frequency)
+ trainer = Trainer(val_check_interval=100)
+
+Learn more in our :ref:`trainer_flags` guide.
+
+----------------
+
+******************
+Limit Dataset Size
+******************
+
+Use data subset for training, validation, and test
+==================================================
+
+**Use when:** Debugging or running huge datasets.
+
+If you don't want to check 100% of the training/validation/test set set these flags:
+
+.. testcode::
+
+ # DEFAULT
+ trainer = Trainer(
+ limit_train_batches=1.0,
+ limit_val_batches=1.0,
+ limit_test_batches=1.0
+ )
+
+ # check 10%, 20%, 30% only, respectively for training, validation and test set
+ trainer = Trainer(
+ limit_train_batches=0.1,
+ limit_val_batches=0.2,
+ limit_test_batches=0.3
+ )
+
+If you also pass ``shuffle=True`` to the dataloader, a different random subset of your dataset will be used for each epoch; otherwise the same subset will be used for all epochs.
+
+.. note:: ``limit_train_batches``, ``limit_val_batches`` and ``limit_test_batches`` will be overwritten by ``overfit_batches`` if ``overfit_batches`` > 0. ``limit_val_batches`` will be ignored if ``fast_dev_run=True``.
+
+.. note:: If you set ``limit_val_batches=0``, validation will be disabled.
+
+Learn more in our :ref:`trainer_flags` guide.
+
+-----
+
+*********************
+Preload Data Into RAM
+*********************
+
+**Use when:** You need access to all samples in a dataset at once.
+
+When your training or preprocessing requires many operations to be performed on entire dataset(s), it can
+sometimes be beneficial to store all data in RAM given there is enough space.
+However, loading all data at the beginning of the training script has the disadvantage that it can take a long
+time and hence it slows down the development process. Another downside is that in multiprocessing (e.g. DDP)
+the data would get copied in each process.
+One can overcome these problems by copying the data into RAM in advance.
+Most UNIX-based operating systems provide direct access to tmpfs through a mount point typically named ``/dev/shm``.
+
+0. Increase shared memory if necessary. Refer to the documentation of your OS how to do this.
+
+1. Copy training data to shared memory:
+
+ .. code-block:: bash
+
+ cp -r /path/to/data/on/disk /dev/shm/
+
+2. Refer to the new data root in your script or command line arguments:
+
+ .. code-block:: python
+
+ datamodule = MyDataModule(data_root="/dev/shm/my_data")
+
+---------
+
+**************
+Model Toggling
+**************
+
+**Use when:** Performing gradient accumulation with multiple optimizers in a
+distributed setting.
+
+Here is an explanation of what it does:
+
+* Considering the current optimizer as A and all other optimizers as B.
+* Toggling means that all parameters from B exclusive to A will have their ``requires_grad`` attribute set to ``False``.
+* Their original state will be restored when exiting the context manager.
+
+When performing gradient accumulation, there is no need to perform grad synchronization during the accumulation phase.
+Setting ``sync_grad`` to ``False`` will block this synchronization and improve your training speed.
+
+:class:`~pytorch_lightning.core.optimizer.LightningOptimizer` provides a
+:meth:`~pytorch_lightning.core.optimizer.LightningOptimizer.toggle_model` function as a
+:func:`contextlib.contextmanager` for advanced users.
+
+Here is an example for advanced use-case:
+
+.. testcode::
+
+ # Scenario for a GAN with gradient accumulation every 2 batches and optimized for multiple gpus.
+ class SimpleGAN(LightningModule):
+
+ def __init__(self):
+ super().__init__()
+ self.automatic_optimization = False
+
+ def training_step(self, batch, batch_idx):
+ # Implementation follows the PyTorch tutorial:
+ # https://pytorch.org/tutorials/beginner/dcgan_faces_tutorial.html
+ g_opt, d_opt = self.optimizers()
+
+ X, _ = batch
+ X.requires_grad = True
+ batch_size = X.shape[0]
+
+ real_label = torch.ones((batch_size, 1), device=self.device)
+ fake_label = torch.zeros((batch_size, 1), device=self.device)
+
+ # Sync and clear gradients
+ # at the end of accumulation or
+ # at the end of an epoch.
+ is_last_batch_to_accumulate = \
+ (batch_idx + 1) % 2 == 0 or self.trainer.is_last_batch
+
+ g_X = self.sample_G(batch_size)
+
+ ##########################
+ # Optimize Discriminator #
+ ##########################
+ with d_opt.toggle_model(sync_grad=is_last_batch_to_accumulate):
+ d_x = self.D(X)
+ errD_real = self.criterion(d_x, real_label)
+
+ d_z = self.D(g_X.detach())
+ errD_fake = self.criterion(d_z, fake_label)
+
+ errD = (errD_real + errD_fake)
+
+ self.manual_backward(errD)
+ if is_last_batch_to_accumulate:
+ d_opt.step()
+ d_opt.zero_grad()
+
+ ######################
+ # Optimize Generator #
+ ######################
+ with g_opt.toggle_model(sync_grad=is_last_batch_to_accumulate):
+ d_z = self.D(g_X)
+ errG = self.criterion(d_z, real_label)
+
+ self.manual_backward(errG)
+ if is_last_batch_to_accumulate:
+ g_opt.step()
+ g_opt.zero_grad()
+
+ self.log_dict({'g_loss': errG, 'd_loss': errD}, prog_bar=True)
+
+-----
+
+*****************
+Set Grads to None
+*****************
+
+In order to modestly improve performance, you can override :meth:`~pytorch_lightning.core.lightning.LightningModule.optimizer_zero_grad`.
+
+For a more detailed explanation of pros / cons of this technique,
+read `this `_ documentation by the PyTorch team.
+
+.. testcode::
+
+ class Model(LightningModule):
+
+ def optimizer_zero_grad(self, epoch, batch_idx, optimizer, optimizer_idx):
+ optimizer.zero_grad(set_to_none=True)
+
+
+-----
+
+***************
+Things to avoid
+***************
+
+.item(), .numpy(), .cpu()
+=========================
+Don't call ``.item()`` anywhere in your code. Use ``.detach()`` instead to remove the connected graph calls. Lightning
+takes a great deal of care to be optimized for this.
+
+----------
+
+empty_cache()
+=============
+Don't call this unnecessarily! Every time you call this ALL your GPUs have to wait to sync.
+
+----------
+
+Tranfering tensors to device
+============================
+LightningModules know what device they are on! Construct tensors on the device directly to avoid CPU->Device transfer.
+
+.. code-block:: python
+
+ # bad
+ t = torch.rand(2, 2).cuda()
+
+ # good (self is LightningModule)
+ t = torch.rand(2, 2, device=self.device)
+
+
+For tensors that need to be model attributes, it is best practice to register them as buffers in the modules's
+``__init__`` method:
+
+.. code-block:: python
+
+ # bad
+ self.t = torch.rand(2, 2, device=self.device)
+
+ # good
+ self.register_buffer("t", torch.rand(2, 2))
diff --git a/docs/source/index.rst b/docs/source/index.rst
index e7d0030e0e6f6..61abc2b010834 100644
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -21,8 +21,8 @@ PyTorch Lightning Documentation
:name: guides
:caption: Best practices
+ guides/speed
starter/style_guide
- benchmarking/performance
Lightning project template
benchmarking/benchmarks
@@ -98,12 +98,10 @@ PyTorch Lightning Documentation
clouds/cloud_training
clouds/cluster
- advanced/amp
common/child_modules
common/debugging
common/loggers
common/early_stopping
- common/fast_training
common/hyperparameters
common/lightning_cli
advanced/lr_finder
diff --git a/docs/source/starter/new-project.rst b/docs/source/starter/new-project.rst
index 74ad30102b4f8..07bf3624560a0 100644
--- a/docs/source/starter/new-project.rst
+++ b/docs/source/starter/new-project.rst
@@ -219,7 +219,7 @@ The :class:`~pytorch_lightning.trainer.Trainer` automates:
* Tensorboard (see :doc:`loggers <../common/loggers>` options)
* :doc:`Multi-GPU <../advanced/multi_gpu>` support
* :doc:`TPU <../advanced/tpu>`
-* :doc:`AMP <../advanced/amp>` support
+* :ref:`16-bit precision AMP ` support
.. tip:: If you prefer to manually manage optimizers you can use the :ref:`manual_opt` mode (ie: RL, GANs, etc...).