From 5647087f03214b208e31ae8c749d120d9c15d2df Mon Sep 17 00:00:00 2001 From: edenlightning <66261195+edenlightning@users.noreply.github.com> Date: Wed, 16 Jun 2021 17:28:51 -0400 Subject: [PATCH] New speed documentation (#7665) * amp * amp * docs * add guides * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * amp * amp * docs * add guides * speed guides * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Delete ds.txt * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update conf.py * Update docs.txt * remove 16 bit * remove finetune from speed guide * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * speed * speed * speed * speed * speed * speed * speed * speed * speed * speed * speed * speed * remove early stopping from speed guide * remove early stopping from speed guide * remove early stopping from speed guide * fix label * fix sync * reviews * Update trainer.rst * Update trainer.rst * Update speed.rst Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> --- docs/source/advanced/amp.rst | 94 ----- docs/source/benchmarking/performance.rst | 183 --------- docs/source/common/fast_training.rst | 82 ---- docs/source/common/optimizers.rst | 82 ---- docs/source/common/trainer.rst | 65 ++- docs/source/guides/speed.rst | 482 +++++++++++++++++++++++ docs/source/index.rst | 4 +- docs/source/starter/new-project.rst | 2 +- 8 files changed, 540 insertions(+), 454 deletions(-) delete mode 100644 docs/source/advanced/amp.rst delete mode 100644 docs/source/benchmarking/performance.rst delete mode 100644 docs/source/common/fast_training.rst create mode 100644 docs/source/guides/speed.rst diff --git a/docs/source/advanced/amp.rst b/docs/source/advanced/amp.rst deleted file mode 100644 index 2c25f9e7f918f..0000000000000 --- a/docs/source/advanced/amp.rst +++ /dev/null @@ -1,94 +0,0 @@ -.. testsetup:: * - - from pytorch_lightning.trainer.trainer import Trainer - -.. _amp: - -16-bit training -================= -Lightning offers 16-bit training for CPUs, GPUs, and TPUs. - -.. raw:: html - - - -| - - ----------- - -GPU 16-bit ----------- -16-bit precision can cut your memory footprint by half. -If using volta architecture GPUs it can give a dramatic training speed-up as well. - -.. note:: PyTorch 1.6+ is recommended for 16-bit - -Native torch -^^^^^^^^^^^^ -When using PyTorch 1.6+ Lightning uses the native amp implementation to support 16-bit. - -.. testcode:: - :skipif: not _APEX_AVAILABLE and not _NATIVE_AMP_AVAILABLE or not torch.cuda.is_available() - - # turn on 16-bit - trainer = Trainer(precision=16, gpus=1) - -Apex 16-bit -^^^^^^^^^^^ -If you are using an earlier version of PyTorch Lightning uses Apex to support 16-bit. - -Follow these instructions to install Apex. -To use 16-bit precision, do two things: - -1. Install Apex -2. Set the "precision" trainer flag. - -.. code-block:: bash - - # ------------------------ - # OPTIONAL: on your cluster you might need to load CUDA 10 or 9 - # depending on how you installed PyTorch - - # see available modules - module avail - - # load correct CUDA before install - module load cuda-10.0 - # ------------------------ - - # make sure you've loaded a cuda version > 4.0 and < 7.0 - module load gcc-6.1.0 - - $ pip install --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" https://github.com/NVIDIA/apex - -.. warning:: NVIDIA Apex and DDP have instability problems. We recommend native 16-bit in PyTorch 1.6+ - -Enable 16-bit -^^^^^^^^^^^^^ - -.. testcode:: - :skipif: not _APEX_AVAILABLE and not _NATIVE_AMP_AVAILABLE or not torch.cuda.is_available() - - # turn on 16-bit - trainer = Trainer(amp_level='O2', precision=16) - -If you need to configure the apex init for your particular use case or want to use a different way of doing -16-bit training, override :meth:`pytorch_lightning.core.LightningModule.configure_apex`. - ----------- - -TPU 16-bit ----------- -16-bit on TPUs is much simpler. To use 16-bit with TPUs set precision to 16 when using the TPU flag - -.. testcode:: - :skipif: not _TPU_AVAILABLE - - # DEFAULT - trainer = Trainer(tpu_cores=8, precision=32) - - # turn on 16-bit - trainer = Trainer(tpu_cores=8, precision=16) diff --git a/docs/source/benchmarking/performance.rst b/docs/source/benchmarking/performance.rst deleted file mode 100644 index 6e2b546fb275f..0000000000000 --- a/docs/source/benchmarking/performance.rst +++ /dev/null @@ -1,183 +0,0 @@ -.. _performance: - -Fast performance tips -===================== -Lightning builds in all the micro-optimizations we can find to increase your performance. -But we can only automate so much. - -Here are some additional things you can do to increase your performance. - ----------- - -Dataloaders ------------ -When building your DataLoader set ``num_workers > 0`` and ``pin_memory=True`` (only for GPUs). - -.. code-block:: python - - Dataloader(dataset, num_workers=8, pin_memory=True) - -num_workers -^^^^^^^^^^^ -The question of how many ``num_workers`` is tricky. Here's a summary of -some references, [`1 `_], and our suggestions. - -1. ``num_workers=0`` means ONLY the main process will load batches (that can be a bottleneck). -2. ``num_workers=1`` means ONLY one worker (just not the main process) will load data but it will still be slow. -3. The ``num_workers`` depends on the batch size and your machine. -4. A general place to start is to set ``num_workers`` equal to the number of CPUs on that machine. - -.. warning:: Increasing ``num_workers`` will ALSO increase your CPU memory consumption. - -The best thing to do is to increase the ``num_workers`` slowly and stop once you see no more improvement in your training speed. - -Spawn -^^^^^ -When using ``accelerator=ddp_spawn`` (the ddp default) or TPU training, the way multiple GPUs/TPU cores are used is by calling ``.spawn()`` under the hood. -The problem is that PyTorch has issues with ``num_workers > 0`` when using ``.spawn()``. For this reason we recommend you -use ``accelerator=ddp`` so you can increase the ``num_workers``, however your script has to be callable like so: - -.. code-block:: bash - - python my_program.py --gpus X - ----------- - -.item(), .numpy(), .cpu() -------------------------- -Don't call ``.item()`` anywhere in your code. Use ``.detach()`` instead to remove the connected graph calls. Lightning -takes a great deal of care to be optimized for this. - ----------- - -empty_cache() -------------- -Don't call this unnecessarily! Every time you call this ALL your GPUs have to wait to sync. - ----------- - -Construct tensors directly on the device ----------------------------------------- -LightningModules know what device they are on! Construct tensors on the device directly to avoid CPU->Device transfer. - -.. code-block:: python - - # bad - t = torch.rand(2, 2).cuda() - - # good (self is LightningModule) - t = torch.rand(2, 2, device=self.device) - - -For tensors that need to be model attributes, it is best practice to register them as buffers in the modules's -``__init__`` method: - -.. code-block:: python - - # bad - self.t = torch.rand(2, 2, device=self.device) - - # good - self.register_buffer("t", torch.rand(2, 2)) - ----------- - -Use DDP not DP --------------- -DP performs three GPU transfers for EVERY batch: - -1. Copy model to device. -2. Copy data to device. -3. Copy outputs of each device back to master. - -| - -Whereas DDP only performs 1 transfer to sync gradients. Because of this, DDP is MUCH faster than DP. - -When using DDP set find_unused_parameters=False ------------------------------------------------ - -By default we have enabled find unused parameters to True. This is for compatibility issues that have arisen in the past (see the `discussion `_ for more information). -This by default comes with a performance hit, and can be disabled in most cases. - -.. code-block:: python - - from pytorch_lightning.plugins import DDPPlugin - - trainer = pl.Trainer( - gpus=2, - plugins=DDPPlugin(find_unused_parameters=False), - ) - ----------- - -16-bit precision ----------------- -Use 16-bit to decrease the memory consumption (and thus increase your batch size). On certain GPUs (V100s, 2080tis), 16-bit calculations are also faster. -However, know that 16-bit and multi-processing (any DDP) can have issues. Here are some common problems. - -1. `CUDA error: an illegal memory access was encountered `_. - The solution is likely setting a specific CUDA, CUDNN, PyTorch version combination. -2. ``CUDA error: device-side assert triggered``. This is a general catch-all error. To see the actual error run your script like so: - -.. code-block:: bash - - # won't see what the error is - python main.py - - # will see what the error is - CUDA_LAUNCH_BLOCKING=1 python main.py - -.. tip:: We also recommend using 16-bit native found in PyTorch 1.6. Just install this version and Lightning will automatically use it. - ----------- - -Advanced GPU Optimizations --------------------------- - -When training on single or multiple GPU machines, Lightning offers a host of advanced optimizations to improve throughput, memory efficiency, and model scaling. -Refer to :doc:`Advanced GPU Optimized Training for more details <../advanced/advanced_gpu>`. - ----------- - -Preload Data Into RAM ---------------------- - -When your training or preprocessing requires many operations to be performed on entire dataset(s) it can -sometimes be beneficial to store all data in RAM given there is enough space. -However, loading all data at the beginning of the training script has the disadvantage that it can take a long -time and hence it slows down the development process. Another downside is that in multiprocessing (e.g. DDP) -the data would get copied in each process. -One can overcome these problems by copying the data into RAM in advance. -Most UNIX-based operating systems provide direct access to tmpfs through a mount point typically named ``/dev/shm``. - -0. Increase shared memory if necessary. Refer to the documentation of your OS how to do this. - -1. Copy training data to shared memory: - - .. code-block:: bash - - cp -r /path/to/data/on/disk /dev/shm/ - -2. Refer to the new data root in your script or command line arguments: - - .. code-block:: python - - datamodule = MyDataModule(data_root="/dev/shm/my_data") - ----------- - -Zero Grad ``set_to_none=True`` ------------------------------- - -In order to modestly improve performance, you can override :meth:`~pytorch_lightning.core.lightning.LightningModule.optimizer_zero_grad`. - -For a more detailed explanation of pros / cons of this technique, -read `this `_ documentation by the PyTorch team. - -.. testcode:: - - class Model(LightningModule): - - def optimizer_zero_grad(self, epoch, batch_idx, optimizer, optimizer_idx): - optimizer.zero_grad(set_to_none=True) diff --git a/docs/source/common/fast_training.rst b/docs/source/common/fast_training.rst deleted file mode 100644 index 2216d234836f2..0000000000000 --- a/docs/source/common/fast_training.rst +++ /dev/null @@ -1,82 +0,0 @@ -.. testsetup:: * - - from pytorch_lightning.trainer.trainer import Trainer - -.. _fast_training: - -Fast Training -============= -There are multiple options to speed up different parts of the training by choosing to train -on a subset of data. This could be done for speed or debugging purposes. - ----------------- - -Check validation every n epochs -------------------------------- -If you have a small dataset you might want to check validation every n epochs - -.. testcode:: - - # DEFAULT - trainer = Trainer(check_val_every_n_epoch=1) - ----------------- - -Force training for min or max epochs ------------------------------------- -It can be useful to force training for a minimum number of epochs or limit to a max number. - -.. seealso:: - :class:`~pytorch_lightning.trainer.trainer.Trainer` - -.. testcode:: - - # DEFAULT - trainer = Trainer(min_epochs=1, max_epochs=1000) - ----------------- - -Set validation check frequency within 1 training epoch ------------------------------------------------------- -For large datasets it's often desirable to check validation multiple times within a training loop. -Pass in a float to check that often within 1 training epoch. Pass in an int `k` to check every `k` training batches. -Must use an `int` if using an `IterableDataset`. - -.. testcode:: - - # DEFAULT - trainer = Trainer(val_check_interval=0.95) - - # check every .25 of an epoch - trainer = Trainer(val_check_interval=0.25) - - # check every 100 train batches (ie: for `IterableDatasets` or fixed frequency) - trainer = Trainer(val_check_interval=100) - ----------------- - -Use data subset for training, validation, and test --------------------------------------------------- -If you don't want to check 100% of the training/validation/test set (for debugging or if it's huge), set these flags. - -.. testcode:: - - # DEFAULT - trainer = Trainer( - limit_train_batches=1.0, - limit_val_batches=1.0, - limit_test_batches=1.0 - ) - - # check 10%, 20%, 30% only, respectively for training, validation and test set - trainer = Trainer( - limit_train_batches=0.1, - limit_val_batches=0.2, - limit_test_batches=0.3 - ) - -If you also pass ``shuffle=True`` to the dataloader, a different random subset of your dataset will be used for each epoch; otherwise the same subset will be used for all epochs. - -.. note:: ``limit_train_batches``, ``limit_val_batches`` and ``limit_test_batches`` will be overwritten by ``overfit_batches`` if ``overfit_batches`` > 0. ``limit_val_batches`` will be ignored if ``fast_dev_run=True``. - -.. note:: If you set ``limit_val_batches=0``, validation will be disabled. diff --git a/docs/source/common/optimizers.rst b/docs/source/common/optimizers.rst index 12e9c6925e7fd..cde203fdd193e 100644 --- a/docs/source/common/optimizers.rst +++ b/docs/source/common/optimizers.rst @@ -232,88 +232,6 @@ If you want to call ``lr_scheduler.step()`` every ``n`` steps/epochs, do the fol ----- -Improve training speed with model toggling ------------------------------------------- -Toggling models can improve your training speed when performing gradient accumulation with multiple optimizers in a -distributed setting. - -Here is an explanation of what it does: - -* Considering the current optimizer as A and all other optimizers as B. -* Toggling means that all parameters from B exclusive to A will have their ``requires_grad`` attribute set to ``False``. -* Their original state will be restored when exiting the context manager. - -When performing gradient accumulation, there is no need to perform grad synchronization during the accumulation phase. -Setting ``sync_grad`` to ``False`` will block this synchronization and improve your training speed. - -:class:`~pytorch_lightning.core.optimizer.LightningOptimizer` provides a -:meth:`~pytorch_lightning.core.optimizer.LightningOptimizer.toggle_model` function as a -:func:`contextlib.contextmanager` for advanced users. - -Here is an example for advanced use-case. - -.. testcode:: python - - # Scenario for a GAN with gradient accumulation every 2 batches and optimized for multiple gpus. - class SimpleGAN(LightningModule): - - def __init__(self): - super().__init__() - self.automatic_optimization = False - - def training_step(self, batch, batch_idx): - # Implementation follows the PyTorch tutorial: - # https://pytorch.org/tutorials/beginner/dcgan_faces_tutorial.html - g_opt, d_opt = self.optimizers() - - X, _ = batch - X.requires_grad = True - batch_size = X.shape[0] - - real_label = torch.ones((batch_size, 1), device=self.device) - fake_label = torch.zeros((batch_size, 1), device=self.device) - - # Sync and clear gradients - # at the end of accumulation or - # at the end of an epoch. - is_last_batch_to_accumulate = \ - (batch_idx + 1) % 2 == 0 or self.trainer.is_last_batch - - g_X = self.sample_G(batch_size) - - ########################## - # Optimize Discriminator # - ########################## - with d_opt.toggle_model(sync_grad=is_last_batch_to_accumulate): - d_x = self.D(X) - errD_real = self.criterion(d_x, real_label) - - d_z = self.D(g_X.detach()) - errD_fake = self.criterion(d_z, fake_label) - - errD = (errD_real + errD_fake) - - self.manual_backward(errD) - if is_last_batch_to_accumulate: - d_opt.step() - d_opt.zero_grad() - - ###################### - # Optimize Generator # - ###################### - with g_opt.toggle_model(sync_grad=is_last_batch_to_accumulate): - d_z = self.D(g_X) - errG = self.criterion(d_z, real_label) - - self.manual_backward(errG) - if is_last_batch_to_accumulate: - g_opt.step() - g_opt.zero_grad() - - self.log_dict({'g_loss': errG, 'd_loss': errD}, prog_bar=True) - ------ - Use closure for LBFGS-like optimizers ------------------------------------- It is a good practice to provide the optimizer with a closure function that performs a ``forward``, ``zero_grad`` and diff --git a/docs/source/common/trainer.rst b/docs/source/common/trainer.rst index ea32ea3dd55dc..0983f0acb9eec 100644 --- a/docs/source/common/trainer.rst +++ b/docs/source/common/trainer.rst @@ -196,6 +196,8 @@ unique seeds across all dataloader workers and processes for :mod:`torch`, :mod: ------- +.. _trainer_flags: + Trainer flags ------------- @@ -658,6 +660,8 @@ Writes logs to disk this often. See Also: - :doc:`logging <../extensions/logging>` +.. _gpus: + gpus ^^^^ @@ -1155,28 +1159,69 @@ precision | -Double precision (64), full precision (32) or half precision (16). -Can all be used on GPU or TPUs. Only double (64) and full precision (32) available on CPU. +Lightning supports either double precision (64), full precision (32), or half precision (16) training. -If used on TPU will use torch.bfloat16 but tensor printing -will still show torch.float32. +Half precision, or mixed precision, is the combined use of 32 and 16 bit floating points to reduce memory footprint during model training. This can result in improved performance, achieving +3X speedups on modern GPUs. .. testcode:: :skipif: not _APEX_AVAILABLE and not _NATIVE_AMP_AVAILABLE or not torch.cuda.is_available() # default used by the Trainer - trainer = Trainer(precision=32) + trainer = Trainer(precision=32, gpus=1) # 16-bit precision trainer = Trainer(precision=16, gpus=1) # 64-bit precision - trainer = Trainer(precision=64) + trainer = Trainer(precision=64, gpus=1) + + +.. note:: When running on TPUs, torch.float16 will be used but tensor printing will still show torch.float32. + +.. note:: 16-bit precision is not supported on CPUs. + + +.. admonition:: When using PyTorch 1.6+, Lightning uses the native AMP implementation to support 16-bit precision. 16-bit precision with PyTorch < 1.6 is supported by NVIDIA Apex library. + :class: dropdown, warning + + NVIDIA Apex and DDP have instability problems. We recommend upgrading to PyTorch 1.6+ in order to use the native AMP 16-bit precision with multiple GPUs. + + If you are using an earlier version of PyTorch (before 1.6), Lightning uses `Apex `_ to support 16-bit training. + + To use Apex 16-bit training: + + 1. Install Apex + + .. code-block:: bash + + # ------------------------ + # OPTIONAL: on your cluster you might need to load CUDA 10 or 9 + # depending on how you installed PyTorch + + # see available modules + module avail + + # load correct CUDA before install + module load cuda-10.0 + # ------------------------ + + # make sure you've loaded a GCC version > 4.0 and < 7.0 + module load gcc-6.1.0 + + pip install --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" https://github.com/NVIDIA/apex + + 2. Set the `precision` trainer flag to 16. You can customize the `Apex optimization level `_ by setting the `amp_level` flag. + + .. testcode:: + :skipif: not _APEX_AVAILABLE and not _NATIVE_AMP_AVAILABLE or not torch.cuda.is_available() + + # turn on 16-bit + trainer = Trainer(amp_backend="apex", amp_level='O2', precision=16) + + If you need to configure the apex init for your particular use case, or want to customize the + 16-bit training behaviour, override :meth:`pytorch_lightning.core.LightningModule.configure_apex`. -Example:: - # one day - trainer = Trainer(precision=8|4|2) process_position ^^^^^^^^^^^^^^^^ @@ -1378,6 +1423,8 @@ track_grad_norm # track the 2-norm trainer = Trainer(track_grad_norm=2) +.. _tpu_cores: + tpu_cores ^^^^^^^^^ diff --git a/docs/source/guides/speed.rst b/docs/source/guides/speed.rst new file mode 100644 index 0000000000000..ece806558c76c --- /dev/null +++ b/docs/source/guides/speed.rst @@ -0,0 +1,482 @@ +.. testsetup:: * + + from pytorch_lightning.trainer.trainer import Trainer + from pytorch_lightning.callbacks.early_stopping import EarlyStopping + from pytorch_lightning.core.lightning import LightningModule + +.. _speed: + +####################### +Speed up model training +####################### + +There are multiple ways you can speed up your model's time to convergence: + +* ``_ + +* ``_ + +* ``_ + +* ``_ + +* ``_ + +* ``_ + +* ``_ + +* ``_ + +* ``_ + +**************** +GPU/TPU training +**************** + +**Use when:** Whenever possible! + +With Lightning, running on GPUs, TPUs or multiple node is a simple switch of a flag. + +GPU training +============ + +Lightning supports a variety of plugins to further speed up distributed GPU training. Most notably: + +* :class:`~pytorch_lightning.plugins.training_type.DDPPlugin` +* :class:`~pytorch_lightning.plugins.training_type.DDPShardedPlugin` +* :class:`~pytorch_lightning.plugins.training_type.DeepSpeedPlugin` + +.. code-block:: python + + # run on 1 gpu + trainer = Trainer(gpus=1) + + # train on 8 gpus, using DDP plugin + trainer = Trainer(gpus=8, accelerator="ddp") + + # train on multiple GPUs across nodes (uses 8 gpus in total) + trainer = Trainer(gpus=2, num_nodes=4) + + +GPU Training Speedup Tips +------------------------- + +When training on single or multiple GPU machines, Lightning offers a host of advanced optimizations to improve throughput, memory efficiency, and model scaling. +Refer to :doc:`Advanced GPU Optimized Training for more details <../advanced/advanced_gpu>`. + +Prefer DDP over DP +^^^^^^^^^^^^^^^^^^ +:class:`~pytorch_lightning.plugins.training_type.DataParallelPlugin` performs three GPU transfers for EVERY batch: + +1. Copy model to device. +2. Copy data to device. +3. Copy outputs of each device back to master. + +Whereas :class:`~pytorch_lightning.plugins.training_type.DDPPlugin` only performs 1 transfer to sync gradients, making DDP MUCH faster than DP. + + +When using DDP set find_unused_parameters=False +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +By default we have set ``find_unused_parameters`` to True for compatibility issues that have arisen in the past (see the `discussion `_ for more information). +This by default comes with a performance hit, and can be disabled in most cases. + +.. code-block:: python + + from pytorch_lightning.plugins import DDPPlugin + + trainer = pl.Trainer( + gpus=2, + plugins=DDPPlugin(find_unused_parameters=False), + ) + +Dataloaders +^^^^^^^^^^^ +When building your DataLoader set ``num_workers > 0`` and ``pin_memory=True`` (only for GPUs). + +.. code-block:: python + + Dataloader(dataset, num_workers=8, pin_memory=True) + +num_workers +""""""""""" + +The question of how many workers to specify in ``num_workers`` is tricky. Here's a summary of +some references, [`1 `_], and our suggestions: + +1. ``num_workers=0`` means ONLY the main process will load batches (that can be a bottleneck). +2. ``num_workers=1`` means ONLY one worker (just not the main process) will load data but it will still be slow. +3. The ``num_workers`` depends on the batch size and your machine. +4. A general place to start is to set ``num_workers`` equal to the number of CPU cores on that machine. You can get the number of CPU cores in python using `os.cpu_count()`, but note that depending on your batch size, you may overflow RAM memory. + +.. warning:: Increasing ``num_workers`` will ALSO increase your CPU memory consumption. + +The best thing to do is to increase the ``num_workers`` slowly and stop once you see no more improvement in your training speed. + +Spawn +""""" +When using ``accelerator=ddp_spawn`` or training on TPUs, the way multiple GPUs/TPU cores are used is by calling ``.spawn()`` under the hood. +The problem is that PyTorch has issues with ``num_workers > 0`` when using ``.spawn()``. For this reason we recommend you +use ``accelerator=ddp`` so you can increase the ``num_workers``, however your script has to be callable like so: + +.. code-block:: bash + + python my_program.py + + +TPU training +============ + +You can set the ``tpu_cores`` trainer flag to 1 or 8 cores. + +.. code-block:: python + + # train on 1 TPU core + trainer = Trainer(tpu_cores=1) + + # train on 8 TPU cores + trainer = Trainer(tpu_cores=8) + +To train on more than 8 cores (ie: a POD), +submit this script using the xla_dist script. + +Example:: + + python -m torch_xla.distributed.xla_dist + --tpu=$TPU_POD_NAME + --conda-env=torch-xla-nightly + --env=XLA_USE_BF16=1 + -- python your_trainer_file.py + + +Read more in our :ref:`accelerators` and :ref:`plugins` guides. + + +----------- + +.. _amp: + +********************************* +Mixed precision (16-bit) training +********************************* + +**Use when:** + +* You want to optimize for memory usage on a GPU. +* You have a GPU that supports 16 bit precision (NVIDIA pascal architecture or newer). +* Your optimization algorithm (training_step) is numerically stable. +* You want to be the cool person in the lab :p + +.. raw:: html + + + +| + + +Mixed precision combines the use of both 32 and 16 bit floating points to reduce memory footprint during model training, resulting in improved performance, achieving +3X speedups on modern GPUs. + +Lightning offers mixed precision or 16-bit training for GPUs and TPUs. + + +.. testcode:: + :skipif: not _APEX_AVAILABLE and not _NATIVE_AMP_AVAILABLE or not torch.cuda.is_available() + + # 16-bit precision + trainer = Trainer(precision=16, gpus=4) + + +---------------- + + +*********************** +Control Training Epochs +*********************** + +**Use when:** You run a hyperparameter search to find good initial parameters and want to save time, cost (money), or power (environment). +It can allow you to be more cost efficient and also run more experiments at the same time. + +You can use Trainer flags to force training for a minimum number of epochs or limit to a max number of epochs. Use the `min_epochs` and `max_epochs` Trainer flags to set the number of epochs to run. + +.. testcode:: + + # DEFAULT + trainer = Trainer(min_epochs=1, max_epochs=1000) + + +If running iteration based training, i.e. infinite / iterable dataloader, you can also control the number of steps with the `min_steps` and `max_steps` flags: + +.. testcode:: + + trainer = Trainer(max_steps=1000) + + trainer = Trainer(min_steps=100) + +You can also interupt training based on training time: + +.. testcode:: + + # Stop after 12 hours of training or when reaching 10 epochs (string) + trainer = Trainer(max_time="00:12:00:00", max_epochs=10) + + # Stop after 1 day and 5 hours (dict) + trainer = Trainer(max_time={"days": 1, "hours": 5}) + +Learn more in our :ref:`trainer_flags` guide. + + +---------------- + +**************************** +Control Validation Frequency +**************************** + +Check validation every n epochs +=============================== + +**Use when:** You have a small dataset, and want to run less validation checks. + +You can limit validation check to only run every n epochs using the `check_val_every_n_epoch` Trainer flag. + +.. testcode:: + + # DEFAULT + trainer = Trainer(check_val_every_n_epoch=1) + + +Set validation check frequency within 1 training epoch +====================================================== + +**Use when:** You have a large training dataset, and want to run mid-epoch validation checks. + +For large datasets, it's often desirable to check validation multiple times within a training loop. +Pass in a float to check that often within 1 training epoch. Pass in an int `k` to check every `k` training batches. +Must use an `int` if using an `IterableDataset`. + +.. testcode:: + + # DEFAULT + trainer = Trainer(val_check_interval=0.95) + + # check every .25 of an epoch + trainer = Trainer(val_check_interval=0.25) + + # check every 100 train batches (ie: for `IterableDatasets` or fixed frequency) + trainer = Trainer(val_check_interval=100) + +Learn more in our :ref:`trainer_flags` guide. + +---------------- + +****************** +Limit Dataset Size +****************** + +Use data subset for training, validation, and test +================================================== + +**Use when:** Debugging or running huge datasets. + +If you don't want to check 100% of the training/validation/test set set these flags: + +.. testcode:: + + # DEFAULT + trainer = Trainer( + limit_train_batches=1.0, + limit_val_batches=1.0, + limit_test_batches=1.0 + ) + + # check 10%, 20%, 30% only, respectively for training, validation and test set + trainer = Trainer( + limit_train_batches=0.1, + limit_val_batches=0.2, + limit_test_batches=0.3 + ) + +If you also pass ``shuffle=True`` to the dataloader, a different random subset of your dataset will be used for each epoch; otherwise the same subset will be used for all epochs. + +.. note:: ``limit_train_batches``, ``limit_val_batches`` and ``limit_test_batches`` will be overwritten by ``overfit_batches`` if ``overfit_batches`` > 0. ``limit_val_batches`` will be ignored if ``fast_dev_run=True``. + +.. note:: If you set ``limit_val_batches=0``, validation will be disabled. + +Learn more in our :ref:`trainer_flags` guide. + +----- + +********************* +Preload Data Into RAM +********************* + +**Use when:** You need access to all samples in a dataset at once. + +When your training or preprocessing requires many operations to be performed on entire dataset(s), it can +sometimes be beneficial to store all data in RAM given there is enough space. +However, loading all data at the beginning of the training script has the disadvantage that it can take a long +time and hence it slows down the development process. Another downside is that in multiprocessing (e.g. DDP) +the data would get copied in each process. +One can overcome these problems by copying the data into RAM in advance. +Most UNIX-based operating systems provide direct access to tmpfs through a mount point typically named ``/dev/shm``. + +0. Increase shared memory if necessary. Refer to the documentation of your OS how to do this. + +1. Copy training data to shared memory: + + .. code-block:: bash + + cp -r /path/to/data/on/disk /dev/shm/ + +2. Refer to the new data root in your script or command line arguments: + + .. code-block:: python + + datamodule = MyDataModule(data_root="/dev/shm/my_data") + +--------- + +************** +Model Toggling +************** + +**Use when:** Performing gradient accumulation with multiple optimizers in a +distributed setting. + +Here is an explanation of what it does: + +* Considering the current optimizer as A and all other optimizers as B. +* Toggling means that all parameters from B exclusive to A will have their ``requires_grad`` attribute set to ``False``. +* Their original state will be restored when exiting the context manager. + +When performing gradient accumulation, there is no need to perform grad synchronization during the accumulation phase. +Setting ``sync_grad`` to ``False`` will block this synchronization and improve your training speed. + +:class:`~pytorch_lightning.core.optimizer.LightningOptimizer` provides a +:meth:`~pytorch_lightning.core.optimizer.LightningOptimizer.toggle_model` function as a +:func:`contextlib.contextmanager` for advanced users. + +Here is an example for advanced use-case: + +.. testcode:: + + # Scenario for a GAN with gradient accumulation every 2 batches and optimized for multiple gpus. + class SimpleGAN(LightningModule): + + def __init__(self): + super().__init__() + self.automatic_optimization = False + + def training_step(self, batch, batch_idx): + # Implementation follows the PyTorch tutorial: + # https://pytorch.org/tutorials/beginner/dcgan_faces_tutorial.html + g_opt, d_opt = self.optimizers() + + X, _ = batch + X.requires_grad = True + batch_size = X.shape[0] + + real_label = torch.ones((batch_size, 1), device=self.device) + fake_label = torch.zeros((batch_size, 1), device=self.device) + + # Sync and clear gradients + # at the end of accumulation or + # at the end of an epoch. + is_last_batch_to_accumulate = \ + (batch_idx + 1) % 2 == 0 or self.trainer.is_last_batch + + g_X = self.sample_G(batch_size) + + ########################## + # Optimize Discriminator # + ########################## + with d_opt.toggle_model(sync_grad=is_last_batch_to_accumulate): + d_x = self.D(X) + errD_real = self.criterion(d_x, real_label) + + d_z = self.D(g_X.detach()) + errD_fake = self.criterion(d_z, fake_label) + + errD = (errD_real + errD_fake) + + self.manual_backward(errD) + if is_last_batch_to_accumulate: + d_opt.step() + d_opt.zero_grad() + + ###################### + # Optimize Generator # + ###################### + with g_opt.toggle_model(sync_grad=is_last_batch_to_accumulate): + d_z = self.D(g_X) + errG = self.criterion(d_z, real_label) + + self.manual_backward(errG) + if is_last_batch_to_accumulate: + g_opt.step() + g_opt.zero_grad() + + self.log_dict({'g_loss': errG, 'd_loss': errD}, prog_bar=True) + +----- + +***************** +Set Grads to None +***************** + +In order to modestly improve performance, you can override :meth:`~pytorch_lightning.core.lightning.LightningModule.optimizer_zero_grad`. + +For a more detailed explanation of pros / cons of this technique, +read `this `_ documentation by the PyTorch team. + +.. testcode:: + + class Model(LightningModule): + + def optimizer_zero_grad(self, epoch, batch_idx, optimizer, optimizer_idx): + optimizer.zero_grad(set_to_none=True) + + +----- + +*************** +Things to avoid +*************** + +.item(), .numpy(), .cpu() +========================= +Don't call ``.item()`` anywhere in your code. Use ``.detach()`` instead to remove the connected graph calls. Lightning +takes a great deal of care to be optimized for this. + +---------- + +empty_cache() +============= +Don't call this unnecessarily! Every time you call this ALL your GPUs have to wait to sync. + +---------- + +Tranfering tensors to device +============================ +LightningModules know what device they are on! Construct tensors on the device directly to avoid CPU->Device transfer. + +.. code-block:: python + + # bad + t = torch.rand(2, 2).cuda() + + # good (self is LightningModule) + t = torch.rand(2, 2, device=self.device) + + +For tensors that need to be model attributes, it is best practice to register them as buffers in the modules's +``__init__`` method: + +.. code-block:: python + + # bad + self.t = torch.rand(2, 2, device=self.device) + + # good + self.register_buffer("t", torch.rand(2, 2)) diff --git a/docs/source/index.rst b/docs/source/index.rst index e7d0030e0e6f6..61abc2b010834 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -21,8 +21,8 @@ PyTorch Lightning Documentation :name: guides :caption: Best practices + guides/speed starter/style_guide - benchmarking/performance Lightning project template benchmarking/benchmarks @@ -98,12 +98,10 @@ PyTorch Lightning Documentation clouds/cloud_training clouds/cluster - advanced/amp common/child_modules common/debugging common/loggers common/early_stopping - common/fast_training common/hyperparameters common/lightning_cli advanced/lr_finder diff --git a/docs/source/starter/new-project.rst b/docs/source/starter/new-project.rst index 74ad30102b4f8..07bf3624560a0 100644 --- a/docs/source/starter/new-project.rst +++ b/docs/source/starter/new-project.rst @@ -219,7 +219,7 @@ The :class:`~pytorch_lightning.trainer.Trainer` automates: * Tensorboard (see :doc:`loggers <../common/loggers>` options) * :doc:`Multi-GPU <../advanced/multi_gpu>` support * :doc:`TPU <../advanced/tpu>` -* :doc:`AMP <../advanced/amp>` support +* :ref:`16-bit precision AMP ` support .. tip:: If you prefer to manually manage optimizers you can use the :ref:`manual_opt` mode (ie: RL, GANs, etc...).