From 5647087f03214b208e31ae8c749d120d9c15d2df Mon Sep 17 00:00:00 2001
From: edenlightning <66261195+edenlightning@users.noreply.github.com>
Date: Wed, 16 Jun 2021 17:28:51 -0400
Subject: [PATCH] New speed documentation (#7665)

* amp

* amp

* docs

* add guides

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* amp

* amp

* docs

* add guides

* speed guides

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Delete ds.txt

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update conf.py

* Update docs.txt

* remove 16 bit

* remove finetune from speed guide

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* speed

* speed

* speed

* speed

* speed

* speed

* speed

* speed

* speed

* speed

* speed

* speed

* remove early stopping from speed guide

* remove early stopping from speed guide

* remove early stopping from speed guide

* fix label

* fix sync

* reviews

* Update trainer.rst

* Update trainer.rst

* Update speed.rst

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
---
 docs/source/advanced/amp.rst             |  94 -----
 docs/source/benchmarking/performance.rst | 183 ---------
 docs/source/common/fast_training.rst     |  82 ----
 docs/source/common/optimizers.rst        |  82 ----
 docs/source/common/trainer.rst           |  65 ++-
 docs/source/guides/speed.rst             | 482 +++++++++++++++++++++++
 docs/source/index.rst                    |   4 +-
 docs/source/starter/new-project.rst      |   2 +-
 8 files changed, 540 insertions(+), 454 deletions(-)
 delete mode 100644 docs/source/advanced/amp.rst
 delete mode 100644 docs/source/benchmarking/performance.rst
 delete mode 100644 docs/source/common/fast_training.rst
 create mode 100644 docs/source/guides/speed.rst
diff --git a/docs/source/advanced/amp.rst b/docs/source/advanced/amp.rst
deleted file mode 100644
index 2c25f9e7f918f..0000000000000
--- a/docs/source/advanced/amp.rst
+++ /dev/null
@@ -1,94 +0,0 @@
-.. testsetup:: *
-
-    from pytorch_lightning.trainer.trainer import Trainer
-
-.. _amp:
-
-16-bit training
-=================
-Lightning offers 16-bit training for CPUs, GPUs, and TPUs.
-
-.. raw:: html
-
-    <video width="50%" max-width="400px" controls
-    poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/yt_thumbs/thumb_precision.png"
-    src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/yt/Trainer+flags+9+-+precision_1.mp4"></video>
-
-|
-
-
-----------
-
-GPU 16-bit
-----------
-16-bit precision can cut your memory footprint by half.
-If using volta architecture GPUs it can give a dramatic training speed-up as well.
-
-.. note:: PyTorch 1.6+ is recommended for 16-bit
-
-Native torch
-^^^^^^^^^^^^
-When using PyTorch 1.6+ Lightning uses the native amp implementation to support 16-bit.
-
-.. testcode::
-    :skipif: not _APEX_AVAILABLE and not _NATIVE_AMP_AVAILABLE or not torch.cuda.is_available()
-
-    # turn on 16-bit
-    trainer = Trainer(precision=16, gpus=1)
-
-Apex 16-bit
-^^^^^^^^^^^
-If you are using an earlier version of PyTorch Lightning uses Apex to support 16-bit.
-
-Follow these instructions to install Apex.
-To use 16-bit precision, do two things:
-
-1. Install Apex
-2. Set the "precision" trainer flag.
-
-.. code-block:: bash
-
-    # ------------------------
-    # OPTIONAL: on your cluster you might need to load CUDA 10 or 9
-    # depending on how you installed PyTorch
-
-    # see available modules
-    module avail
-
-    # load correct CUDA before install
-    module load cuda-10.0
-    # ------------------------
-
-    # make sure you've loaded a cuda version > 4.0 and < 7.0
-    module load gcc-6.1.0
-
-    $ pip install --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" https://github.com/NVIDIA/apex
-
-.. warning:: NVIDIA Apex and DDP have instability problems. We recommend native 16-bit in PyTorch 1.6+
-
-Enable 16-bit
-^^^^^^^^^^^^^
-
-.. testcode::
-    :skipif: not _APEX_AVAILABLE and not _NATIVE_AMP_AVAILABLE or not torch.cuda.is_available()
-
-    # turn on 16-bit
-    trainer = Trainer(amp_level='O2', precision=16)
-
-If you need to configure the apex init for your particular use case or want to use a different way of doing
-16-bit training, override   :meth:`pytorch_lightning.core.LightningModule.configure_apex`.
-
-----------
-
-TPU 16-bit
-----------
-16-bit on TPUs is much simpler. To use 16-bit with TPUs set precision to 16 when using the TPU flag
-
-.. testcode::
-    :skipif: not _TPU_AVAILABLE
-
-    # DEFAULT
-    trainer = Trainer(tpu_cores=8, precision=32)
-
-    # turn on 16-bit
-    trainer = Trainer(tpu_cores=8, precision=16)
diff --git a/docs/source/benchmarking/performance.rst b/docs/source/benchmarking/performance.rst
deleted file mode 100644
index 6e2b546fb275f..0000000000000
--- a/docs/source/benchmarking/performance.rst
+++ /dev/null
@@ -1,183 +0,0 @@
-.. _performance:
-
-Fast performance tips
-=====================
-Lightning builds in all the micro-optimizations we can find to increase your performance.
-But we can only automate so much.
-
-Here are some additional things you can do to increase your performance.
-
-----------
-
-Dataloaders
------------
-When building your DataLoader set ``num_workers > 0`` and ``pin_memory=True`` (only for GPUs).
-
-.. code-block:: python
-
-    Dataloader(dataset, num_workers=8, pin_memory=True)
-
-num_workers
-^^^^^^^^^^^
-The question of how many ``num_workers`` is tricky. Here's a summary of
-some references, [`1 <https://discuss.pytorch.org/t/guidelines-for-assigning-num-workers-to-dataloader/813>`_], and our suggestions.
-
-1. ``num_workers=0`` means ONLY the main process will load batches (that can be a bottleneck).
-2. ``num_workers=1`` means ONLY one worker (just not the main process) will load data but it will still be slow.
-3. The ``num_workers`` depends on the batch size and your machine.
-4. A general place to start is to set ``num_workers`` equal to the number of CPUs on that machine.
-
-.. warning:: Increasing ``num_workers`` will ALSO increase your CPU memory consumption.
-
-The best thing to do is to increase the ``num_workers`` slowly and stop once you see no more improvement in your training speed.
-
-Spawn
-^^^^^
-When using ``accelerator=ddp_spawn`` (the ddp default) or TPU training, the way multiple GPUs/TPU cores are used is by calling ``.spawn()`` under the hood.
-The problem is that PyTorch has issues with ``num_workers > 0`` when using ``.spawn()``. For this reason we recommend you
-use ``accelerator=ddp`` so you can increase the ``num_workers``, however your script has to be callable like so:
-
-.. code-block:: bash
-
-    python my_program.py --gpus X
-
-----------
-
-.item(), .numpy(), .cpu()
--------------------------
-Don't call ``.item()`` anywhere in your code. Use ``.detach()`` instead to remove the connected graph calls. Lightning
-takes a great deal of care to be optimized for this.
-
-----------
-
-empty_cache()
--------------
-Don't call this unnecessarily! Every time you call this ALL your GPUs have to wait to sync.
-
-----------
-
-Construct tensors directly on the device
-----------------------------------------
-LightningModules know what device they are on! Construct tensors on the device directly to avoid CPU->Device transfer.
-
-.. code-block:: python
-
-    # bad
-    t = torch.rand(2, 2).cuda()
-
-    # good (self is LightningModule)
-    t = torch.rand(2, 2, device=self.device)
-
-
-For tensors that need to be model attributes, it is best practice to register them as buffers in the modules's
-``__init__`` method:
-
-.. code-block:: python
-
-    # bad
-    self.t = torch.rand(2, 2, device=self.device)
-
-    # good
-    self.register_buffer("t", torch.rand(2, 2))
-
-----------
-
-Use DDP not DP
---------------
-DP performs three GPU transfers for EVERY batch:
-
-1. Copy model to device.
-2. Copy data to device.
-3. Copy outputs of each device back to master.
-
-|
-
-Whereas DDP only performs 1 transfer to sync gradients. Because of this, DDP is MUCH faster than DP.
-
-When using DDP set find_unused_parameters=False
------------------------------------------------
-
-By default we have enabled find unused parameters to True. This is for compatibility issues that have arisen in the past (see the `discussion <https://github.com/PyTorchLightning/pytorch-lightning/discussions/6219>`_ for more information).
-This by default comes with a performance hit, and can be disabled in most cases.
-
-.. code-block:: python
-
-    from pytorch_lightning.plugins import DDPPlugin
-
-    trainer = pl.Trainer(
-        gpus=2,
-        plugins=DDPPlugin(find_unused_parameters=False),
-    )
-
-----------
-
-16-bit precision
-----------------
-Use 16-bit to decrease the memory consumption (and thus increase your batch size). On certain GPUs (V100s, 2080tis), 16-bit calculations are also faster.
-However, know that 16-bit and multi-processing (any DDP) can have issues. Here are some common problems.
-
-1. `CUDA error: an illegal memory access was encountered <https://github.com/pytorch/pytorch/issues/21819>`_.
-    The solution is likely setting a specific CUDA, CUDNN, PyTorch version combination.
-2. ``CUDA error: device-side assert triggered``. This is a general catch-all error. To see the actual error run your script like so:
-
-.. code-block:: bash
-
-    # won't see what the error is
-    python main.py
-
-    # will see what the error is
-    CUDA_LAUNCH_BLOCKING=1 python main.py
-
-.. tip:: We also recommend using 16-bit native found in PyTorch 1.6. Just install this version and Lightning will automatically use it.
-
-----------
-
-Advanced GPU Optimizations
---------------------------
-
-When training on single or multiple GPU machines, Lightning offers a host of advanced optimizations to improve throughput, memory efficiency, and model scaling.
-Refer to :doc:`Advanced GPU Optimized Training for more details <../advanced/advanced_gpu>`.
-
-----------
-
-Preload Data Into RAM
----------------------
-
-When your training or preprocessing requires many operations to be performed on entire dataset(s) it can
-sometimes be beneficial to store all data in RAM given there is enough space.
-However, loading all data at the beginning of the training script has the disadvantage that it can take a long
-time and hence it slows down the development process. Another downside is that in multiprocessing (e.g. DDP)
-the data would get copied in each process.
-One can overcome these problems by copying the data into RAM in advance.
-Most UNIX-based operating systems provide direct access to tmpfs through a mount point typically named ``/dev/shm``.
-
-0.  Increase shared memory if necessary. Refer to the documentation of your OS how to do this.
-
-1.  Copy training data to shared memory:
-
-    .. code-block:: bash
-
-        cp -r /path/to/data/on/disk /dev/shm/
-
-2.  Refer to the new data root in your script or command line arguments:
-
-    .. code-block:: python
-
-        datamodule = MyDataModule(data_root="/dev/shm/my_data")
-
-----------
-
-Zero Grad ``set_to_none=True``
-------------------------------
-
-In order to modestly improve performance, you can override :meth:`~pytorch_lightning.core.lightning.LightningModule.optimizer_zero_grad`.
-
-For a more detailed explanation of pros / cons of this technique,
-read `this <https://pytorch.org/docs/master/optim.html#torch.optim.Optimizer.zero_grad>`_ documentation by the PyTorch team.
-
-.. testcode::
-
-    class Model(LightningModule):
-
-        def optimizer_zero_grad(self, epoch, batch_idx, optimizer, optimizer_idx):
-            optimizer.zero_grad(set_to_none=True)
diff --git a/docs/source/common/fast_training.rst b/docs/source/common/fast_training.rst
deleted file mode 100644
index 2216d234836f2..0000000000000
--- a/docs/source/common/fast_training.rst
+++ /dev/null
@@ -1,82 +0,0 @@
-.. testsetup:: *
-
-    from pytorch_lightning.trainer.trainer import Trainer
-
-.. _fast_training:
-
-Fast Training
-=============
-There are multiple options to speed up different parts of the training by choosing to train
-on a subset of data. This could be done for speed or debugging purposes.
-
-----------------
-
-Check validation every n epochs
--------------------------------
-If you have a small dataset you might want to check validation every n epochs
-
-.. testcode::
-
-    # DEFAULT
-    trainer = Trainer(check_val_every_n_epoch=1)
-
-----------------
-
-Force training for min or max epochs
-------------------------------------
-It can be useful to force training for a minimum number of epochs or limit to a max number.
-
-.. seealso::
-    :class:`~pytorch_lightning.trainer.trainer.Trainer`
-
-.. testcode::
-
-    # DEFAULT
-    trainer = Trainer(min_epochs=1, max_epochs=1000)
-
-----------------
-
-Set validation check frequency within 1 training epoch
-------------------------------------------------------
-For large datasets it's often desirable to check validation multiple times within a training loop.
-Pass in a float to check that often within 1 training epoch. Pass in an int `k` to check every `k` training batches.
-Must use an `int` if using an `IterableDataset`.
-
-.. testcode::
-
-    # DEFAULT
-    trainer = Trainer(val_check_interval=0.95)
-
-    # check every .25 of an epoch
-    trainer = Trainer(val_check_interval=0.25)
-
-    # check every 100 train batches (ie: for `IterableDatasets` or fixed frequency)
-    trainer = Trainer(val_check_interval=100)
-
-----------------
-
-Use data subset for training, validation, and test
---------------------------------------------------
-If you don't want to check 100% of the training/validation/test set (for debugging or if it's huge), set these flags.
-
-.. testcode::
-
-    # DEFAULT
-    trainer = Trainer(
-        limit_train_batches=1.0,
-        limit_val_batches=1.0,
-        limit_test_batches=1.0
-    )
-
-    # check 10%, 20%, 30% only, respectively for training, validation and test set
-    trainer = Trainer(
-        limit_train_batches=0.1,
-        limit_val_batches=0.2,
-        limit_test_batches=0.3
-    )
-
-If you also pass ``shuffle=True`` to the dataloader, a different random subset of your dataset will be used for each epoch; otherwise the same subset will be used for all epochs.
-
-.. note:: ``limit_train_batches``, ``limit_val_batches`` and ``limit_test_batches`` will be overwritten by ``overfit_batches`` if ``overfit_batches`` > 0. ``limit_val_batches`` will be ignored if ``fast_dev_run=True``.
-
-.. note:: If you set ``limit_val_batches=0``, validation will be disabled.
diff --git a/docs/source/common/optimizers.rst b/docs/source/common/optimizers.rst
index 12e9c6925e7fd..cde203fdd193e 100644
--- a/docs/source/common/optimizers.rst
+++ b/docs/source/common/optimizers.rst
@@ -232,88 +232,6 @@ If you want to call ``lr_scheduler.step()`` every ``n`` steps/epochs, do the fol
 
 -----
 
-Improve training speed with model toggling
-------------------------------------------
-Toggling models can improve your training speed when performing gradient accumulation with multiple optimizers in a
-distributed setting.
-
-Here is an explanation of what it does:
-
-* Considering the current optimizer as A and all other optimizers as B.
-* Toggling means that all parameters from B exclusive to A will have their ``requires_grad`` attribute set to ``False``.
-* Their original state will be restored when exiting the context manager.
-
-When performing gradient accumulation, there is no need to perform grad synchronization during the accumulation phase.
-Setting ``sync_grad`` to ``False`` will block this synchronization and improve your training speed.
-
-:class:`~pytorch_lightning.core.optimizer.LightningOptimizer` provides a
-:meth:`~pytorch_lightning.core.optimizer.LightningOptimizer.toggle_model` function as a
-:func:`contextlib.contextmanager` for advanced users.
-
-Here is an example for advanced use-case.
-
-.. testcode:: python
-
-    # Scenario for a GAN with gradient accumulation every 2 batches and optimized for multiple gpus.
-    class SimpleGAN(LightningModule):
-
-        def __init__(self):
-            super().__init__()
-            self.automatic_optimization = False
-
-        def training_step(self, batch, batch_idx):
-            # Implementation follows the PyTorch tutorial:
-            # https://pytorch.org/tutorials/beginner/dcgan_faces_tutorial.html
-            g_opt, d_opt = self.optimizers()
-
-            X, _ = batch
-            X.requires_grad = True
-            batch_size = X.shape[0]
-
-            real_label = torch.ones((batch_size, 1), device=self.device)
-            fake_label = torch.zeros((batch_size, 1), device=self.device)
-
-            # Sync and clear gradients
-            # at the end of accumulation or
-            # at the end of an epoch.
-            is_last_batch_to_accumulate = \
-                (batch_idx + 1) % 2 == 0 or self.trainer.is_last_batch
-
-            g_X = self.sample_G(batch_size)
-
-            ##########################
-            # Optimize Discriminator #
-            ##########################
-            with d_opt.toggle_model(sync_grad=is_last_batch_to_accumulate):
-                d_x = self.D(X)
-                errD_real = self.criterion(d_x, real_label)
-
-                d_z = self.D(g_X.detach())
-                errD_fake = self.criterion(d_z, fake_label)
-
-                errD = (errD_real + errD_fake)
-
-                self.manual_backward(errD)
-                if is_last_batch_to_accumulate:
-                    d_opt.step()
-                    d_opt.zero_grad()
-
-            ######################
-            # Optimize Generator #
-            ######################
-            with g_opt.toggle_model(sync_grad=is_last_batch_to_accumulate):
-                d_z = self.D(g_X)
-                errG = self.criterion(d_z, real_label)
-
-                self.manual_backward(errG)
-                if is_last_batch_to_accumulate:
-                    g_opt.step()
-                    g_opt.zero_grad()
-
-            self.log_dict({'g_loss': errG, 'd_loss': errD}, prog_bar=True)
-
------
-
 Use closure for LBFGS-like optimizers
 -------------------------------------
 It is a good practice to provide the optimizer with a closure function that performs a ``forward``, ``zero_grad`` and
diff --git a/docs/source/common/trainer.rst b/docs/source/common/trainer.rst
index ea32ea3dd55dc..0983f0acb9eec 100644
--- a/docs/source/common/trainer.rst
+++ b/docs/source/common/trainer.rst
@@ -196,6 +196,8 @@ unique seeds across all dataloader workers and processes for :mod:`torch`, :mod:
 
 -------
 
+.. _trainer_flags:
+
 Trainer flags
 -------------
 
@@ -658,6 +660,8 @@ Writes logs to disk this often.
 See Also:
     - :doc:`logging <../extensions/logging>`
 
+.. _gpus:
+
 gpus
 ^^^^
 
@@ -1155,28 +1159,69 @@ precision
 
 |
 
-Double precision (64), full precision (32) or half precision (16).
-Can all be used on GPU or TPUs. Only double (64) and full precision (32) available on CPU.
+Lightning supports either double precision (64), full precision (32), or half precision (16) training.
 
-If used on TPU will use torch.bfloat16 but tensor printing
-will still show torch.float32.
+Half precision, or mixed precision, is the combined use of 32 and 16 bit floating points to reduce memory footprint during model training. This can result in improved performance, achieving +3X speedups on modern GPUs.
 
 .. testcode::
     :skipif: not _APEX_AVAILABLE and not _NATIVE_AMP_AVAILABLE or not torch.cuda.is_available()
 
     # default used by the Trainer
-    trainer = Trainer(precision=32)
+    trainer = Trainer(precision=32, gpus=1)
 
     # 16-bit precision
     trainer = Trainer(precision=16, gpus=1)
 
     # 64-bit precision
-    trainer = Trainer(precision=64)
+    trainer = Trainer(precision=64, gpus=1)
+
+
+.. note:: When running on TPUs, torch.float16 will be used but tensor printing will still show torch.float32.
+
+.. note:: 16-bit precision is not supported on CPUs.
+
+
+.. admonition::  When using PyTorch 1.6+, Lightning uses the native AMP implementation to support 16-bit precision. 16-bit precision with PyTorch < 1.6 is supported by NVIDIA Apex library.
+   :class: dropdown, warning
+
+    NVIDIA Apex and DDP have instability problems. We recommend upgrading to PyTorch 1.6+ in order to use the native AMP 16-bit precision with multiple GPUs.
+
+    If you are using an earlier version of PyTorch (before 1.6), Lightning uses `Apex <https://github.com/NVIDIA/apex>`_ to support 16-bit training.
+
+    To use Apex 16-bit training:
+
+    1. Install Apex
+
+    .. code-block:: bash
+
+        # ------------------------
+        # OPTIONAL: on your cluster you might need to load CUDA 10 or 9
+        # depending on how you installed PyTorch
+
+        # see available modules
+        module avail
+
+        # load correct CUDA before install
+        module load cuda-10.0
+        # ------------------------
+
+        # make sure you've loaded a GCC version > 4.0 and < 7.0
+        module load gcc-6.1.0
+
+        pip install --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" https://github.com/NVIDIA/apex
+
+    2. Set the `precision` trainer flag to 16. You can customize the `Apex optimization level <https://nvidia.github.io/apex/amp.html#opt-levels>`_ by setting the `amp_level` flag.
+
+    .. testcode::
+        :skipif: not _APEX_AVAILABLE and not _NATIVE_AMP_AVAILABLE or not torch.cuda.is_available()
+
+        # turn on 16-bit
+        trainer = Trainer(amp_backend="apex", amp_level='O2', precision=16)
+
+    If you need to configure the apex init for your particular use case, or want to customize the
+    16-bit training behaviour, override :meth:`pytorch_lightning.core.LightningModule.configure_apex`.
 
-Example::
 
-    # one day
-    trainer = Trainer(precision=8|4|2)
 
 process_position
 ^^^^^^^^^^^^^^^^
@@ -1378,6 +1423,8 @@ track_grad_norm
     # track the 2-norm
     trainer = Trainer(track_grad_norm=2)
 
+.. _tpu_cores:
+
 tpu_cores
 ^^^^^^^^^
 
diff --git a/docs/source/guides/speed.rst b/docs/source/guides/speed.rst
new file mode 100644
index 0000000000000..ece806558c76c
--- /dev/null
+++ b/docs/source/guides/speed.rst
@@ -0,0 +1,482 @@
+.. testsetup:: *
+
+    from pytorch_lightning.trainer.trainer import Trainer
+    from pytorch_lightning.callbacks.early_stopping import EarlyStopping
+    from pytorch_lightning.core.lightning import LightningModule
+
+.. _speed:
+
+#######################
+Speed up model training
+#######################
+
+There are multiple ways you can speed up your model's time to convergence:
+
+* `<GPU/TPU training_>`_
+
+* `<Mixed precision (16-bit) training_>`_
+
+* `<Control Training Epochs_>`_
+
+* `<Control Validation Frequency_>`_
+
+* `<Limit Dataset Size_>`_
+
+* `<Preload Data Into RAM_>`_
+
+* `<Model Toggling_>`_
+
+* `<Set Grads to None_>`_
+
+* `<Things to avoid_>`_
+
+****************
+GPU/TPU training
+****************
+
+**Use when:** Whenever possible!
+
+With Lightning, running on GPUs, TPUs or multiple node is a simple switch of a flag.
+
+GPU training
+============
+
+Lightning supports a variety of plugins to further speed up distributed GPU training. Most notably:
+
+* :class:`~pytorch_lightning.plugins.training_type.DDPPlugin`
+* :class:`~pytorch_lightning.plugins.training_type.DDPShardedPlugin`
+* :class:`~pytorch_lightning.plugins.training_type.DeepSpeedPlugin`
+
+.. code-block:: python
+
+    # run on 1 gpu
+    trainer = Trainer(gpus=1)
+
+    # train on 8 gpus, using DDP plugin
+    trainer = Trainer(gpus=8, accelerator="ddp")
+
+    # train on multiple GPUs across nodes (uses 8 gpus in total)
+    trainer = Trainer(gpus=2, num_nodes=4)
+
+
+GPU Training Speedup Tips
+-------------------------
+
+When training on single or multiple GPU machines, Lightning offers a host of advanced optimizations to improve throughput, memory efficiency, and model scaling.
+Refer to :doc:`Advanced GPU Optimized Training for more details <../advanced/advanced_gpu>`.
+
+Prefer DDP over DP
+^^^^^^^^^^^^^^^^^^
+:class:`~pytorch_lightning.plugins.training_type.DataParallelPlugin` performs three GPU transfers for EVERY batch:
+
+1. Copy model to device.
+2. Copy data to device.
+3. Copy outputs of each device back to master.
+
+Whereas :class:`~pytorch_lightning.plugins.training_type.DDPPlugin` only performs 1 transfer to sync gradients, making DDP MUCH faster than DP.
+
+
+When using DDP set find_unused_parameters=False
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+By default we have set ``find_unused_parameters`` to True for compatibility issues that have arisen in the past (see the `discussion <https://github.com/PyTorchLightning/pytorch-lightning/discussions/6219>`_ for more information).
+This by default comes with a performance hit, and can be disabled in most cases.
+
+.. code-block:: python
+
+    from pytorch_lightning.plugins import DDPPlugin
+
+    trainer = pl.Trainer(
+        gpus=2,
+        plugins=DDPPlugin(find_unused_parameters=False),
+    )
+
+Dataloaders
+^^^^^^^^^^^
+When building your DataLoader set ``num_workers > 0`` and ``pin_memory=True`` (only for GPUs).
+
+.. code-block:: python
+
+    Dataloader(dataset, num_workers=8, pin_memory=True)
+
+num_workers
+"""""""""""
+
+The question of how many workers to specify in ``num_workers`` is tricky. Here's a summary of
+some references, [`1 <https://discuss.pytorch.org/t/guidelines-for-assigning-num-workers-to-dataloader/813>`_], and our suggestions:
+
+1. ``num_workers=0`` means ONLY the main process will load batches (that can be a bottleneck).
+2. ``num_workers=1`` means ONLY one worker (just not the main process) will load data but it will still be slow.
+3. The ``num_workers`` depends on the batch size and your machine.
+4. A general place to start is to set ``num_workers`` equal to the number of CPU cores on that machine. You can get the number of CPU cores in python using `os.cpu_count()`, but note that depending on your batch size, you may overflow RAM memory.
+
+.. warning:: Increasing ``num_workers`` will ALSO increase your CPU memory consumption.
+
+The best thing to do is to increase the ``num_workers`` slowly and stop once you see no more improvement in your training speed.
+
+Spawn
+"""""
+When using ``accelerator=ddp_spawn`` or training on TPUs, the way multiple GPUs/TPU cores are used is by calling ``.spawn()`` under the hood.
+The problem is that PyTorch has issues with ``num_workers > 0`` when using ``.spawn()``. For this reason we recommend you
+use ``accelerator=ddp`` so you can increase the ``num_workers``, however your script has to be callable like so:
+
+.. code-block:: bash
+
+    python my_program.py
+
+
+TPU training
+============
+
+You can set the ``tpu_cores`` trainer flag to 1 or 8 cores.
+
+.. code-block:: python
+
+    # train on 1 TPU core
+    trainer = Trainer(tpu_cores=1)
+
+    # train on 8 TPU cores
+    trainer = Trainer(tpu_cores=8)
+
+To train on more than 8 cores (ie: a POD),
+submit this script using the xla_dist script.
+
+Example::
+
+    python -m torch_xla.distributed.xla_dist
+    --tpu=$TPU_POD_NAME
+    --conda-env=torch-xla-nightly
+    --env=XLA_USE_BF16=1
+    -- python your_trainer_file.py
+
+
+Read more in our :ref:`accelerators` and :ref:`plugins` guides.
+
+
+-----------
+
+.. _amp:
+
+*********************************
+Mixed precision (16-bit) training
+*********************************
+
+**Use when:**
+
+* You want to optimize for memory usage on a GPU.
+* You have a GPU that supports 16 bit precision (NVIDIA pascal architecture or newer).
+* Your optimization algorithm (training_step) is numerically stable.
+* You want to be the cool person in the lab :p
+
+.. raw:: html
+
+    <video width="50%" max-width="400px" controls
+    poster="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/trainer_flags/yt_thumbs/thumb_precision.png"
+    src="https://pl-bolts-doc-images.s3.us-east-2.amazonaws.com/pl_docs/yt/Trainer+flags+9+-+precision_1.mp4"></video>
+
+|
+
+
+Mixed precision combines the use of both 32 and 16 bit floating points to reduce memory footprint during model training, resulting in improved performance, achieving +3X speedups on modern GPUs.
+
+Lightning offers mixed precision or 16-bit training for GPUs and TPUs.
+
+
+.. testcode::
+    :skipif: not _APEX_AVAILABLE and not _NATIVE_AMP_AVAILABLE or not torch.cuda.is_available()
+
+    # 16-bit precision
+    trainer = Trainer(precision=16, gpus=4)
+
+
+----------------
+
+
+***********************
+Control Training Epochs
+***********************
+
+**Use when:** You run a hyperparameter search to find good initial parameters and want to save time, cost (money), or power (environment).
+It can allow you to be more cost efficient and also run more experiments at the same time.
+
+You can use Trainer flags to force training for a minimum number of epochs or limit to a max number of epochs. Use the `min_epochs` and `max_epochs` Trainer flags to set the number of epochs to run.
+
+.. testcode::
+
+    # DEFAULT
+    trainer = Trainer(min_epochs=1, max_epochs=1000)
+
+
+If running iteration based training, i.e. infinite / iterable dataloader, you can also control the number of steps with the `min_steps` and  `max_steps` flags:
+
+.. testcode::
+
+    trainer = Trainer(max_steps=1000)
+
+    trainer = Trainer(min_steps=100)
+
+You can also interupt training based on training time:
+
+.. testcode::
+
+    # Stop after 12 hours of training or when reaching 10 epochs (string)
+    trainer = Trainer(max_time="00:12:00:00", max_epochs=10)
+
+    # Stop after 1 day and 5 hours (dict)
+    trainer = Trainer(max_time={"days": 1, "hours": 5})
+
+Learn more in our :ref:`trainer_flags` guide.
+
+
+----------------
+
+****************************
+Control Validation Frequency
+****************************
+
+Check validation every n epochs
+===============================
+
+**Use when:** You have a small dataset, and want to run less validation checks.
+
+You can limit validation check to only run every n epochs using the `check_val_every_n_epoch` Trainer flag.
+
+.. testcode::
+
+    # DEFAULT
+    trainer = Trainer(check_val_every_n_epoch=1)
+
+
+Set validation check frequency within 1 training epoch
+======================================================
+
+**Use when:** You have a large training dataset, and want to run mid-epoch validation checks.
+
+For large datasets, it's often desirable to check validation multiple times within a training loop.
+Pass in a float to check that often within 1 training epoch. Pass in an int `k` to check every `k` training batches.
+Must use an `int` if using an `IterableDataset`.
+
+.. testcode::
+
+    # DEFAULT
+    trainer = Trainer(val_check_interval=0.95)
+
+    # check every .25 of an epoch
+    trainer = Trainer(val_check_interval=0.25)
+
+    # check every 100 train batches (ie: for `IterableDatasets` or fixed frequency)
+    trainer = Trainer(val_check_interval=100)
+
+Learn more in our :ref:`trainer_flags` guide.
+
+----------------
+
+******************
+Limit Dataset Size
+******************
+
+Use data subset for training, validation, and test
+==================================================
+
+**Use when:** Debugging or running huge datasets.
+
+If you don't want to check 100% of the training/validation/test set set these flags:
+
+.. testcode::
+
+    # DEFAULT
+    trainer = Trainer(
+        limit_train_batches=1.0,
+        limit_val_batches=1.0,
+        limit_test_batches=1.0
+    )
+
+    # check 10%, 20%, 30% only, respectively for training, validation and test set
+    trainer = Trainer(
+        limit_train_batches=0.1,
+        limit_val_batches=0.2,
+        limit_test_batches=0.3
+    )
+
+If you also pass ``shuffle=True`` to the dataloader, a different random subset of your dataset will be used for each epoch; otherwise the same subset will be used for all epochs.
+
+.. note:: ``limit_train_batches``, ``limit_val_batches`` and ``limit_test_batches`` will be overwritten by ``overfit_batches`` if ``overfit_batches`` > 0. ``limit_val_batches`` will be ignored if ``fast_dev_run=True``.
+
+.. note:: If you set ``limit_val_batches=0``, validation will be disabled.
+
+Learn more in our :ref:`trainer_flags` guide.
+
+-----
+
+*********************
+Preload Data Into RAM
+*********************
+
+**Use when:** You need access to all samples in a dataset at once.
+
+When your training or preprocessing requires many operations to be performed on entire dataset(s), it can
+sometimes be beneficial to store all data in RAM given there is enough space.
+However, loading all data at the beginning of the training script has the disadvantage that it can take a long
+time and hence it slows down the development process. Another downside is that in multiprocessing (e.g. DDP)
+the data would get copied in each process.
+One can overcome these problems by copying the data into RAM in advance.
+Most UNIX-based operating systems provide direct access to tmpfs through a mount point typically named ``/dev/shm``.
+
+0.  Increase shared memory if necessary. Refer to the documentation of your OS how to do this.
+
+1.  Copy training data to shared memory:
+
+    .. code-block:: bash
+
+        cp -r /path/to/data/on/disk /dev/shm/
+
+2.  Refer to the new data root in your script or command line arguments:
+
+    .. code-block:: python
+
+        datamodule = MyDataModule(data_root="/dev/shm/my_data")
+
+---------
+
+**************
+Model Toggling
+**************
+
+**Use when:** Performing gradient accumulation with multiple optimizers in a
+distributed setting.
+
+Here is an explanation of what it does:
+
+* Considering the current optimizer as A and all other optimizers as B.
+* Toggling means that all parameters from B exclusive to A will have their ``requires_grad`` attribute set to ``False``.
+* Their original state will be restored when exiting the context manager.
+
+When performing gradient accumulation, there is no need to perform grad synchronization during the accumulation phase.
+Setting ``sync_grad`` to ``False`` will block this synchronization and improve your training speed.
+
+:class:`~pytorch_lightning.core.optimizer.LightningOptimizer` provides a
+:meth:`~pytorch_lightning.core.optimizer.LightningOptimizer.toggle_model` function as a
+:func:`contextlib.contextmanager` for advanced users.
+
+Here is an example for advanced use-case:
+
+.. testcode::
+
+    # Scenario for a GAN with gradient accumulation every 2 batches and optimized for multiple gpus.
+    class SimpleGAN(LightningModule):
+
+        def __init__(self):
+            super().__init__()
+            self.automatic_optimization = False
+
+        def training_step(self, batch, batch_idx):
+            # Implementation follows the PyTorch tutorial:
+            # https://pytorch.org/tutorials/beginner/dcgan_faces_tutorial.html
+            g_opt, d_opt = self.optimizers()
+
+            X, _ = batch
+            X.requires_grad = True
+            batch_size = X.shape[0]
+
+            real_label = torch.ones((batch_size, 1), device=self.device)
+            fake_label = torch.zeros((batch_size, 1), device=self.device)
+
+            # Sync and clear gradients
+            # at the end of accumulation or
+            # at the end of an epoch.
+            is_last_batch_to_accumulate = \
+                (batch_idx + 1) % 2 == 0 or self.trainer.is_last_batch
+
+            g_X = self.sample_G(batch_size)
+
+            ##########################
+            # Optimize Discriminator #
+            ##########################
+            with d_opt.toggle_model(sync_grad=is_last_batch_to_accumulate):
+                d_x = self.D(X)
+                errD_real = self.criterion(d_x, real_label)
+
+                d_z = self.D(g_X.detach())
+                errD_fake = self.criterion(d_z, fake_label)
+
+                errD = (errD_real + errD_fake)
+
+                self.manual_backward(errD)
+                if is_last_batch_to_accumulate:
+                    d_opt.step()
+                    d_opt.zero_grad()
+
+            ######################
+            # Optimize Generator #
+            ######################
+            with g_opt.toggle_model(sync_grad=is_last_batch_to_accumulate):
+                d_z = self.D(g_X)
+                errG = self.criterion(d_z, real_label)
+
+                self.manual_backward(errG)
+                if is_last_batch_to_accumulate:
+                    g_opt.step()
+                    g_opt.zero_grad()
+
+            self.log_dict({'g_loss': errG, 'd_loss': errD}, prog_bar=True)
+
+-----
+
+*****************
+Set Grads to None
+*****************
+
+In order to modestly improve performance, you can override :meth:`~pytorch_lightning.core.lightning.LightningModule.optimizer_zero_grad`.
+
+For a more detailed explanation of pros / cons of this technique,
+read `this <https://pytorch.org/docs/master/optim.html#torch.optim.Optimizer.zero_grad>`_ documentation by the PyTorch team.
+
+.. testcode::
+
+    class Model(LightningModule):
+
+        def optimizer_zero_grad(self, epoch, batch_idx, optimizer, optimizer_idx):
+            optimizer.zero_grad(set_to_none=True)
+
+
+-----
+
+***************
+Things to avoid
+***************
+
+.item(), .numpy(), .cpu()
+=========================
+Don't call ``.item()`` anywhere in your code. Use ``.detach()`` instead to remove the connected graph calls. Lightning
+takes a great deal of care to be optimized for this.
+
+----------
+
+empty_cache()
+=============
+Don't call this unnecessarily! Every time you call this ALL your GPUs have to wait to sync.
+
+----------
+
+Tranfering tensors to device
+============================
+LightningModules know what device they are on! Construct tensors on the device directly to avoid CPU->Device transfer.
+
+.. code-block:: python
+
+    # bad
+    t = torch.rand(2, 2).cuda()
+
+    # good (self is LightningModule)
+    t = torch.rand(2, 2, device=self.device)
+
+
+For tensors that need to be model attributes, it is best practice to register them as buffers in the modules's
+``__init__`` method:
+
+.. code-block:: python
+
+    # bad
+    self.t = torch.rand(2, 2, device=self.device)
+
+    # good
+    self.register_buffer("t", torch.rand(2, 2))
diff --git a/docs/source/index.rst b/docs/source/index.rst
index e7d0030e0e6f6..61abc2b010834 100644
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -21,8 +21,8 @@ PyTorch Lightning Documentation
    :name: guides
    :caption: Best practices
 
+   guides/speed
    starter/style_guide
-   benchmarking/performance
    Lightning project template<https://github.com/PyTorchLightning/pytorch-lightning-conference-seed>
    benchmarking/benchmarks
 
@@ -98,12 +98,10 @@ PyTorch Lightning Documentation
 
    clouds/cloud_training
    clouds/cluster
-   advanced/amp
    common/child_modules
    common/debugging
    common/loggers
    common/early_stopping
-   common/fast_training
    common/hyperparameters
    common/lightning_cli
    advanced/lr_finder
diff --git a/docs/source/starter/new-project.rst b/docs/source/starter/new-project.rst
index 74ad30102b4f8..07bf3624560a0 100644
--- a/docs/source/starter/new-project.rst
+++ b/docs/source/starter/new-project.rst
@@ -219,7 +219,7 @@ The :class:`~pytorch_lightning.trainer.Trainer` automates:
 * Tensorboard (see :doc:`loggers <../common/loggers>` options)
 * :doc:`Multi-GPU <../advanced/multi_gpu>` support
 * :doc:`TPU <../advanced/tpu>`
-* :doc:`AMP <../advanced/amp>` support
+* :ref:`16-bit precision AMP <amp>` support
 
 .. tip:: If you prefer to manually manage optimizers you can use the :ref:`manual_opt` mode  (ie: RL, GANs, etc...).