diff --git a/tutorials/pytorch/tut3_mixed_precision/README.md b/tutorials/pytorch/tut3_mixed_precision/README.md index 6096846..87771f5 100644 --- a/tutorials/pytorch/tut3_mixed_precision/README.md +++ b/tutorials/pytorch/tut3_mixed_precision/README.md @@ -1,91 +1,90 @@ -Half and mixed precision in PopTorch -==================================== - -This tutorial shows how to use half and mixed precision in PopTorch with the example task of training a simple CNN model on a single Graphcore IPU (Mk1 or Mk2). - -If you are not familiar with PopTorch, you may need to go through this [introduction to PopTorch tutorial](../tut1_basics) first. +# Half and mixed precision in PopTorch +This tutorial shows how to use half and mixed precision in PopTorch with the +example task of training a simple CNN model on a single +Graphcore IPU (Mk1 or Mk2). Requirements: - - an installed Poplar SDK. See the Getting Started guide for your IPU system for details of how to install the SDK; - - Other Python modules: `pip install -r requirements.txt` - -Table of Contents -================= -* [General](#general) - + [Motives for half precision](#motives-for-half-precision) - + [Numerical stability](#numerical-stability) - - [Loss scaling](#loss-scaling) - - [Stochastic rounding](#stochastic-rounding) -* [Train a model in half precision](#train-a-model-in-half-precision) - + [Import the packages](#import-the-packages) - + [Build the model](#build-the-model) - - [Casting a model's parameters](#casting-a-model-s-parameters) - - [Casting a single layer's parameters](#casting-a-single-layer-s-parameters) - + [Prepare the data](#prepare-the-data) - + [Optimizers and loss scaling](#optimizers-and-loss-scaling) - + [Set PopTorch's options](#set-poptorch-s-options) - - [Stochastic rounding](#stochastic-rounding) - - [Partials data type](#partials-data-type) - + [Train the model](#train-the-model) - + [Evaluate the model](#evaluate-the-model) -* [Visualise the memory footprint](#visualise-the-memory-footprint) -* [Debug floating-point exceptions](#debug-floating-point-exceptions) -* [PopTorch tracing](#poptorch-tracing) -* [Summary](#summary) +- an installed Poplar SDK. See the Getting Started guide for your IPU +hardware for details of how to install the SDK; +- Other Python modules: `pip install -r requirements.txt` # General ## Motives for half precision -Data is stored in memory, and some formats to store that data require less memory than others. In a device's memory, when it comes to numerical data, we use either integers or real numbers. Real numbers are represented by one of several floating point formats, which vary in how many bits they use to represent each number. Using more bits allows for greater precision and a wider range of representable numbers, whereas using fewer bits allows for faster calculations and reduces memory and power usage. In deep learning applications, where less precise calculations are acceptable and throughput is critical, using a lower precision format can provide substantial gains in performance. +Data is stored in memory, and some formats to store that data require less memory than others. In a device's memory, +when it comes to numerical data, we use either integers or real numbers. Real numbers are represented by one of several +floating point formats, which vary in how many bits they use to represent each number. Using more bits allows for +greater precision and a wider range of representable numbers, whereas using fewer bits allows for faster calculations +and reduces memory and power usage. In deep learning applications, where less precise calculations are acceptable and +throughput is critical, using a lower precision format can provide substantial gains in performance. The Graphcore IPU provides native support for two floating-point formats: - IEEE single-precision, which uses 32 bits for each number (FP32) - IEEE half-precision, which uses 16 bits for each number (FP16) -Some applications which use FP16 do all calculations in FP16, whereas others use a mix of FP16 and FP32. The latter approach is known as *mixed precision*. +Some applications which use FP16 do all calculations in FP16, whereas others use a mix of FP16 and FP32. The latter +approach is known as *mixed precision*. -In this tutorial, we are going to talk about real numbers represented in FP32 and FP16, and how to use these data types (dtypes) in PopTorch in order to reduce the memory requirements of a model. +In this tutorial, we are going to talk about real numbers represented in FP32 and FP16, and how to use these data types +(dtypes) in PopTorch in order to reduce the memory requirements of a model. ## Numerical stability -Numeric stability refers to how a model's performance is affected by the use of a lower-precision dtype. We say an operation is "numerically unstable" in FP16 if running it in this dtype causes the model to have worse accuracy compared to running the operation in FP32. Two techniques that can be used to increase the numerical stability of a model are loss scaling and stochastic rounding. +Numeric stability refers to how a model's performance is affected by the use of a lower-precision dtype. We say an +operation is "numerically unstable" in FP16 if running it in this dtype causes the model to have worse accuracy compared + to running the operation in FP32. Two techniques that can be used to increase the numerical stability of a model are + loss scaling and stochastic rounding. ### Loss scaling -A numerical issue that can occur when training a model in half-precision is that the gradients can underflow. This can be difficult to debug because the model will simply appear to not be training, and can be especially damaging because any gradients which underflow will propagate a value of 0 backwards to other gradient calculations. +A numerical issue that can occur when training a model in half-precision is that the gradients can underflow. This can +be difficult to debug because the model will simply appear to not be training, and can be especially damaging because +any gradients which underflow will propagate a value of 0 backwards to other gradient calculations. -The standard solution to this is known as *loss scaling*, which consists of scaling up the loss value right before the start of backpropagation to prevent numerical underflow of the gradients. Instructions on how to use loss scaling will be discussed later in this tutorial. +The standard solution to this is known as *loss scaling*, which consists of scaling up the loss value right before the +start of backpropagation to prevent numerical underflow of the gradients. Instructions on how to use loss scaling will +be discussed later in this tutorial. ### Stochastic rounding -When training in half or mixed precision, numbers multiplied by each other will need to be rounded in order to fit into the floating point format used. Stochastic rounding is the process of using a probabilistic equation for the rounding. Instead of always rounding to the nearest representable number, we round up or down with a probability such that the expected value after rounding is equal to the value before rounding. Since the expected value of an addition after rounding is equal to the exact result of the addition, the expected value of a sum is also its exact value. - -This means that on average, the values of the parameters of a network will be close to the values they would have had if a higher-precision format had been used. The added bonus of using stochastic rounding is that the parameters can be stored in FP16, which means the parameters can be stored using half as much memory. This can be especially helpful when training with small batch sizes, where the memory used to store the parameters is proportionally greater than the memory used to store parameters when training with large batch sizes. +When training in half or mixed precision, numbers multiplied by each other will need to be rounded in order to fit into +the floating point format used. Stochastic rounding is the process of using a probabilistic equation for the rounding. +Instead of always rounding to the nearest representable number, we round up or down with a probability such that the +expected value after rounding is equal to the value before rounding. Since the expected value of an addition after +rounding is equal to the exact result of the addition, the expected value of a sum is also its exact value. -It is highly recommended that you enable this feature when training neural networks with FP16 weights. The instructions to enable it in PopTorch are presented later in this tutorial. +This means that on average, the values of the parameters of a network will be close to the values they would have had if +a higher-precision format had been used. The added bonus of using stochastic rounding is that the parameters can be +stored in FP16, which means the parameters can be stored using half as much memory. This can be especially helpful when +training with small batch sizes, where the memory used to store the parameters is proportionally greater than the memory +used to store parameters when training with large batch sizes. -# Train a model in half precision +It is highly recommended that you enable this feature when training neural networks with FP16 weights. The instructions +to enable it in PopTorch are presented later in this tutorial. -## Import the packages +Import the packages -Among the packages we will use, there is `torchvision` from which we will download a dataset and construct a simple model, and `tqdm` which is a simple package to create progress bars so that we can visually monitor the progress of our training job. ```python import torch -import poptorch +import torch.nn as nn import torchvision -from torchvision import transforms -from tqdm import tqdm +import torchvision.transforms as transforms +import poptorch +from tqdm.auto import tqdm ``` ## Build the model -We use the same model as in [the previous tutorials on PopTorch](../). Just like in the [previous tutorial](../tut2_efficient_data_loading), we are using larger images (128x128) to simulate a heavier data load. This will make the difference in memory between FP32 and FP16 meaningful enough to showcase in this tutorial. +We use the same model as in [the previous tutorials on PopTorch](../). +Just like in the [previous tutorial](../tut2_efficient_data_loading), we are using larger images (128x128) to simulate +a heavier data load. This will make the difference in memory between FP32 and FP16 meaningful enough to showcase +in this tutorial. + ```python -# Build the model class CustomModel(nn.Module): def __init__(self): super().__init__() @@ -110,43 +109,81 @@ class CustomModel(nn.Module): if self.training: return x, self.loss(x, labels) return x - -model = CustomModel() ``` ->**NOTE:** The model inherits `self.training` from `torch.nn.Module` which initialises its value to True. Use `model.eval()` to set it to False and `model.train()` to switch it back to True. +>**NOTE:** The model inherits `self.training` from `torch.nn.Module` which initialises its value to True. +>Use `model.eval()` to set it to False and `model.train()` to switch it back to True. + +Choose parameters. + +>**NOTE** If you wish to modify these parameters for educational purposes, make sure you re-run all the cells below +>this one, including this entire cell as well: + + +```python +# Cast the model parameters to FP16 +model_half = True + +# Cast the data to FP16 +data_half = True + +# Cast the accumulation of gradients values types of the optimiser to FP16 +optimizer_half = True + +# Use stochasting rounding +stochastic_rounding = True + +# Set partials data type to FP16 +partials_half = True +``` ### Casting a model's parameters -The default data type of the parameters of a PyTorch module is FP32 (`torch.float32`). To convert all the parameters of a model to be represented in FP16 (`torch.float16`), an operation we will call _downcasting_, we simply do: +The default data type of the parameters of a PyTorch module is FP32 (`torch.float32`). To convert all the parameters +of a model to be represented in FP16 (`torch.float16`), an operation we will call _downcasting_, we simply do: + ```python -model = model.half() +model = CustomModel() + +if model_half: + model = model.half() ``` For this tutorial, we will cast all the model's parameters to FP16. ### Casting a single layer's parameters -For bigger or more complex models, downcasting all the layers may generate numerical instabilities and cause underflows. While the PopTorch and the IPU offer features to alleviate those issues, it is still sensible for those models to cast only the parameters of certain layers and observe how it affects the overall training job. To downcast the parameters of a single layer, we select the layer by its _name_ and use `half()`: +For bigger or more complex models, downcasting all the layers may generate numerical instabilities and cause underflows. +While the PopTorch and the IPU offer features to alleviate those issues, it is still sensible for those models to cast +only the parameters of certain layers and observe how it affects the overall training job. To downcast the parameters of + a single layer, we select the layer by its _name_ and use `half()`: + ```python model.conv1 = model.conv1.half() ``` If you would like to upcast a layer instead, you can use `model.conv1.float()`. - >**NOTE**: One can print out a list of the components of a PyTorch model, with their names, by doing `print(model)`. ## Prepare the data -We will use the FashionMNIST dataset that we download from `torchvision`. The last stage of the pipeline will have to convert the data type of the tensors representing the images to `torch.half` (equivalent to `torch.float16`) so that our input data is also in FP16. This has the advantage of reducing the bandwidth needed between the host and the IPU. +We will use the FashionMNIST dataset that we download from `torchvision`. The last stage of the pipeline will have to +convert the data type of the tensors representing the images to `torch.half` (equivalent to `torch.float16`) so that +our input data is also in FP16. This has the advantage of reducing the bandwidth needed between the host and the IPU. + ```python -transform = transforms.Compose([transforms.Resize(128), - transforms.ToTensor(), - transforms.Normalize((0.5,), (0.5,)), - transforms.ConvertImageDtype(torch.half)]) +if data_half: + transform = transforms.Compose([transforms.Resize(128), + transforms.ToTensor(), + transforms.Normalize((0.5,), (0.5,)), + transforms.ConvertImageDtype(torch.half)]) +else: + transform = transforms.Compose([transforms.Resize(128), + transforms.ToTensor(), + transforms.Normalize((0.5,), (0.5,))]) train_dataset = torchvision.datasets.FashionMNIST("./datasets/", transform=transform, @@ -158,56 +195,93 @@ test_dataset = torchvision.datasets.FashionMNIST("./datasets/", train=False) ``` -If the model has not been converted to half precision, but the input data has, then some layers of the model may be converted to use FP16. Conversely, if the input data has not been converted, but the model has, then the input tensors will be converted to FP16 on the IPU. This behaviour is the opposite of PyTorch's default behaviour. +If the model has not been converted to half precision, but the input data has, then some layers of the model may be +converted to use FP16. Conversely, if the input data has not been converted, but the model has, then the input tensors +will be converted to FP16 on the IPU. This behaviour is the opposite of PyTorch's default behaviour. ->**NOTE**: To stop PopTorch automatically downcasting tensors and parameters, so that it preserves PyTorch's default behaviour (upcasting), use the option `opts.Precision.halfFloatCasting(poptorch.HalfFloatCastingBehavior.HalfUpcastToFloat)`. +>**NOTE**: To stop PopTorch automatically downcasting tensors and parameters, so that it preserves PyTorch's default +>behaviour (upcasting), use the option: +>`opts.Precision.halfFloatCasting(poptorch.HalfFloatCastingBehavior.HalfUpcastToFloat)`. ## Optimizers and loss scaling -The value of the loss scaling factor can be passed as a parameter to the optimisers in `poptorch.optim`. In this tutorial, we will set it to 1024 for an AdamW optimizer. For all optimisers (except `poptorch.optim.SGD`), using a model in FP16 requires the argument `accum_type` to be set to `torch.float16` as well: +The value of the loss scaling factor can be passed as a parameter to the optimisers in `poptorch.optim`. In this +tutorial, we will set it to 1024 for an AdamW optimizer. For all optimisers (except `poptorch.optim.SGD`), using +a model in FP16 requires the argument `accum_type` to be set to `torch.float16` as well: + + ```python -optimizer = poptorch.optim.AdamW(model.parameters(), - lr=0.001, - loss_scaling=1024, - accum_type=torch.float16) +if optimizer_half: + optimizer = poptorch.optim.AdamW(model.parameters(), + lr=0.001, + loss_scaling=1024, + accum_type=torch.float16) +else: + optimizer = poptorch.optim.AdamW(model.parameters(), + lr=0.001, + accum_type=torch.float32) ``` -While higher values of `loss_scaling` minimize underflows, values that are too high can also generate overflows as well as hurt convergence of the loss. The optimal value depends on the model and the training job. This is therefore a hyperparameter for you to tune. +While higher values of `loss_scaling` minimize underflows, values that are too high can also generate overflows as well +as hurt convergence of the loss. The optimal value depends on the model and the training job. +This is therefore a hyperparameter for you to tune. ## Set PopTorch's options -To configure some features of the IPU and to be able to use PopTorch's classes in the next sections, we will need to create an instance of `poptorch.Options` which stores the options we will be using. We covered some of the available options in the [introductory tutorial for PopTorch](https://github.com/graphcore/examples/tree/master/tutorials/pytorch/tut1_basics). +To configure some features of the IPU and to be able to use PopTorch's classes in the next sections, we will need to +create an instance of `poptorch.Options` which stores the options we will be using. +We covered some of the available options in: +[introductory tutorial for PopTorch](https://github.com/graphcore/examples/tree/master/tutorials/pytorch/tut1_basics). Let's initialise our options object before we talk about the options we will use: + ```python opts = poptorch.Options() ``` ->**NOTE**: This tutorial has been designed to be run on a single IPU. If you do not have access to an IPU, you can use the option [`useIpuModel`](https://docs.graphcore.ai/projects/poptorch-user-guide/en/latest/overview.html#poptorch.Options.useIpuModel) to run a simulation on CPU instead. You can read more on the IPU Model and its limitations [here](https://docs.graphcore.ai/projects/poplar-user-guide/en/latest/poplar_programs.html#programming-with-poplar). +>**NOTE**: This tutorial has been designed to be run on a single IPU. If you do not have access to an IPU, you can use +>the option [`useIpuModel`](https://docs.graphcore.ai/projects/poptorch-user-guide/en/latest/overview.html#poptorch.Options.useIpuModel) to run a simulation on CPU instead. +>You can read more on the IPU Model and its limitations [here](https://docs.graphcore.ai/projects/poplar-user-guide/en/latest/poplar_programs.html#programming-with-poplar). ### Stochastic rounding -With the IPU, stochastic rounding is implemented directly in the hardware and only requires you to enable it. To do so, there is the option `enableStochasticRounding` in the `Precision` namespace of `poptorch.Options`. This namespace holds other options for using mixed precision that we will talk about. To enable stochastic rounding, we do: +With the IPU, stochastic rounding is implemented directly in the hardware and only requires you to enable it. +To do so, there is the option `enableStochasticRounding` in the `Precision` namespace of `poptorch.Options`. +This namespace holds other options for using mixed precision that we will talk about. +To enable stochastic rounding, we do: + ```python -opts.Precision.enableStochasticRounding(True) +if stochastic_rounding: + opts.Precision.enableStochasticRounding(True) ``` With the IPU Model, this option won't change anything since stochastic rounding is implemented on the IPU. ### Partials data type -Matrix multiplications and convolutions have intermediate states we call _partials_. Those partials can be stored in FP32 or FP16. There is a memory benefit to using FP16 partials but the main benefit is that it can increase the throughput for some models without affecting accuracy. However there is a risk of increasing numerical instability if the values being multiplied are small, due to underflows. The default data type of partials is the input's data type(FP16). For this tutorial, we set partials to FP32 just to showcase how it can be done. We use the option `setPartialsType` to do it: +Matrix multiplications and convolutions have intermediate states we call _partials_. Those partials can be stored +in FP32 or FP16. There is a memory benefit to using FP16 partials but the main benefit is that it can increase +the throughput for some models without affecting accuracy. However there is a risk of increasing numerical instability +if the values being multiplied are small, due to underflows. The default data type of partials is the input's +data type(FP16). For this tutorial, we set partials to FP32 just to showcase how it can be done. +We use the option `setPartialsType` to do it: + ```python -opts.Precision.setPartialsType(torch.float) +if partials_half: + opts.Precision.setPartialsType(torch.half) +else: + opts.Precision.setPartialsType(torch.float) ``` ## Train the model -We can now train the model. After we have set all our options, we reuse our `poptorch.Options` instance for the training `poptorch.DataLoader` that we will be using: +We can now train the model. After we have set all our options, we reuse our `poptorch.Options` instance for +the training `poptorch.DataLoader` that we will be using: + ```python train_dataloader = poptorch.DataLoader(opts, @@ -219,8 +293,8 @@ train_dataloader = poptorch.DataLoader(opts, We first make sure our model is in training mode, and then wrap it with `poptorch.trainingModel`. + ```python -model.train() poptorch_model = poptorch.trainingModel(model, options=opts, optimizer=optimizer) @@ -228,6 +302,7 @@ poptorch_model = poptorch.trainingModel(model, Let's run the training loop for 10 epochs. + ```python epochs = 10 for epoch in tqdm(range(epochs), desc="epochs"): @@ -235,46 +310,69 @@ for epoch in tqdm(range(epochs), desc="epochs"): for data, labels in tqdm(train_dataloader, desc="batches", leave=False): output, loss = poptorch_model(data, labels) total_loss += loss +poptorch_model.detachFromDevice() ``` Our new model is now trained and we can start its evaluation. ## Evaluate the model -Some PyTorch's operations, such as CNNs, are not supported in FP16 on the CPU, so we will evaluate our fine-tuned model in mixed precision on an IPU using `poptorch.inferenceModel`. +Some PyTorch's operations, such as CNNs, are not supported in FP16 on the CPU, so we will evaluate our fine-tuned model +in mixed precision on an IPU using `poptorch.inferenceModel`. + ```python model.eval() poptorch_model_inf = poptorch.inferenceModel(model, options=opts) - test_dataloader = poptorch.DataLoader(opts, test_dataset, batch_size=32, num_workers=40) +``` + +Run inference on the labelled data + +```python predictions, labels = [], [] for data, label in test_dataloader: - predictions += poptorch_model_inf(data).data.max(dim=1).indices + predictions += poptorch_model_inf(data).data.float().max(dim=1).indices labels += label - -print(f"Eval accuracy on IPU: {100 * (1 - torch.count_nonzero(torch.sub(torch.tensor(labels), torch.tensor(predictions))) / len(labels)):.2f}%") +poptorch_model_inf.detachFromDevice() ``` We obtained an accuracy of approximately 84% on the test dataset. + +```python +print(f"""Eval accuracy on IPU: {100 * + (1 - torch.count_nonzero(torch.sub(torch.tensor(labels), + torch.tensor(predictions))) / len(labels)):.2f}%""") +``` + + Eval accuracy on IPU: 83.13% + + # Visualise the memory footprint -We can visually compare the memory footprint on the IPU of the model trained in FP16 and FP32, thanks to Graphcore's [PopVision Graph Analyser](https://docs.graphcore.ai/projects/graphcore-popvision-user-guide/en/latest/graph/graph.html). +We can visually compare the memory footprint on the IPU of the model trained in FP16 and FP32, +thanks to Graphcore's [PopVision Graph Analyser](https://docs.graphcore.ai/projects/graphcore-popvision-user-guide/en/latest/graph/graph.html). -We generated memory reports of the same training session as covered in this tutorial for both cases: with and without downcasting the model with `model.half()`. Here is the figure of both memory footprints, where "source" and "target" represent the model trained in FP16 and FP32 respectively: +We generated memory reports of the same training session as covered in this tutorial for both cases: with and without +downcasting the model with `model.half()`. Here is the figure of both memory footprints, where "source" and "target" +represent the model trained in FP16 and FP32 respectively: ![Comparison of memory footprints](static/MemoryDiffReport.png) -We observed a ~26% reduction in memory usage with the settings of this tutorial, including from peak to peak. The impact on the accuracy was also small, with less than 1% lost! +We observed a ~26% reduction in memory usage with the settings of this tutorial, including from peak to peak. +The impact on the accuracy was also small, with less than 1% lost! # Debug floating-point exceptions -Floating-point issues can be difficult to debug because the model will simply appear to not be training without specific information about what went wrong. For more detailed information on the issue we set `debug.floatPointOpException` to true in the environment variable `POPLAR_ENGINE_OPTIONS`. To set this, you can add the folowing before the command you use to run your model: +Floating-point issues can be difficult to debug because the model will simply appear to not be training without specific + information about what went wrong. For more detailed information on the issue we set `debug.floatPointOpException` + to true in the environment variable `POPLAR_ENGINE_OPTIONS`. To set this, you can add the folowing before the command + you use to run your model: ```python POPLAR_ENGINE_OPTIONS='{"debug.floatPointOpException": "true"}' @@ -282,9 +380,18 @@ POPLAR_ENGINE_OPTIONS='{"debug.floatPointOpException": "true"}' # PopTorch tracing and casting -Because PopTorch relies on the `torch.jit.trace` API, it is limited to tracing operations which run on the CPU. Many of these operations do not support FP16 inputs due to numerical stability issues. To allow the full range of operations, PopTorch converts all FP16 inputs to FP32 before tracing and then restores them to FP16. This is because the model must always be traced with FP16 inputs converted to FP32. +Because PopTorch relies on the `torch.jit.trace` API, it is limited to tracing operations which run on the CPU. +Many of these operations do not support FP16 inputs due to numerical stability issues. +To allow the full range of operations, PopTorch converts all FP16 inputs to FP32 before tracing and then restores +them to FP16. This is because the model must always be traced with FP16 inputs converted to FP32. + +PopTorch’s default casting functionality is to output in FP16 if any input of the operation is FP16. +This is opposite to PyTorch, which outputs in FP32 if any input of the operations is in FP32. +To achieve the same behaviour in PopTorch, one can use: +`opts.Precision.halfFloatCasting(poptorch.HalfFloatCastingBehavior.HalfUpcastToFloat)`. +Below you can see the difference between native PyTorch and PopTorch (with and without the option mentioned above): + -PopTorch’s default casting functionality is to output in FP16 if any input of the operation is FP16. This is opposite to PyTorch, which outputs in FP32 if any input of the operations is in FP32. To achieve the same behaviour in PopTorch, one can use `opts.Precision.halfFloatCasting(poptorch.HalfFloatCastingBehavior.HalfUpcastToFloat)`. Below you can see the difference between native PyTorch and PopTorch (with and without the option mentioned above): ```python class Model(torch.nn.Module): @@ -311,6 +418,7 @@ opts.Precision.halfFloatCasting( # The option above makes the same PopTorch example result in an FP32 tensor poptorch_model = poptorch.inferenceModel(native_model, opts) assert poptorch_model(float32_tensor, float16_tensor).dtype == torch.float32 +poptorch_model.detachFromDevice() ``` # Summary diff --git a/tutorials/pytorch/tut3_mixed_precision/walkthrough.ipynb b/tutorials/pytorch/tut3_mixed_precision/walkthrough.ipynb new file mode 100644 index 0000000..c8c3e50 --- /dev/null +++ b/tutorials/pytorch/tut3_mixed_precision/walkthrough.ipynb @@ -0,0 +1,757 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "6c00937c", + "metadata": {}, + "source": [ + "Copyright (c) 2021 Graphcore Ltd. All rights reserved." + ] + }, + { + "cell_type": "markdown", + "id": "0a1701dd", + "metadata": {}, + "source": [ + "# Half and mixed precision in PopTorch\n", + "This tutorial shows how to use half and mixed precision in PopTorch with the\n", + "example task of training a simple CNN model on a single\n", + "Graphcore IPU (Mk1 or Mk2)." + ] + }, + { + "cell_type": "markdown", + "id": "d5a15c2c", + "metadata": {}, + "source": [ + "Requirements:\n", + "- an installed Poplar SDK. See the Getting Started guide for your IPU\n", + "hardware for details of how to install the SDK;\n", + "- Other Python modules: `pip install -r requirements.txt`" + ] + }, + { + "cell_type": "markdown", + "id": "7786c1dc", + "metadata": {}, + "source": [ + "# General\n", + "\n", + "## Motives for half precision\n", + "\n", + "Data is stored in memory, and some formats to store that data require less memory than others. In a device's memory, \n", + "when it comes to numerical data, we use either integers or real numbers. Real numbers are represented by one of several \n", + "floating point formats, which vary in how many bits they use to represent each number. Using more bits allows for \n", + "greater precision and a wider range of representable numbers, whereas using fewer bits allows for faster calculations \n", + "and reduces memory and power usage. In deep learning applications, where less precise calculations are acceptable and \n", + "throughput is critical, using a lower precision format can provide substantial gains in performance.\n", + "\n", + "The Graphcore IPU provides native support for two floating-point formats:\n", + "\n", + "- IEEE single-precision, which uses 32 bits for each number (FP32)\n", + "- IEEE half-precision, which uses 16 bits for each number (FP16)\n", + "\n", + "Some applications which use FP16 do all calculations in FP16, whereas others use a mix of FP16 and FP32. The latter \n", + "approach is known as *mixed precision*.\n", + "\n", + "In this tutorial, we are going to talk about real numbers represented in FP32 and FP16, and how to use these data types \n", + "(dtypes) in PopTorch in order to reduce the memory requirements of a model.\n", + "\n", + "## Numerical stability\n", + "\n", + "Numeric stability refers to how a model's performance is affected by the use of a lower-precision dtype. We say an \n", + "operation is \"numerically unstable\" in FP16 if running it in this dtype causes the model to have worse accuracy compared\n", + " to running the operation in FP32. Two techniques that can be used to increase the numerical stability of a model are \n", + " loss scaling and stochastic rounding.\n", + "\n", + "### Loss scaling\n", + "\n", + "A numerical issue that can occur when training a model in half-precision is that the gradients can underflow. This can \n", + "be difficult to debug because the model will simply appear to not be training, and can be especially damaging because \n", + "any gradients which underflow will propagate a value of 0 backwards to other gradient calculations.\n", + "\n", + "The standard solution to this is known as *loss scaling*, which consists of scaling up the loss value right before the \n", + "start of backpropagation to prevent numerical underflow of the gradients. Instructions on how to use loss scaling will \n", + "be discussed later in this tutorial.\n", + "\n", + "### Stochastic rounding\n", + "\n", + "When training in half or mixed precision, numbers multiplied by each other will need to be rounded in order to fit into \n", + "the floating point format used. Stochastic rounding is the process of using a probabilistic equation for the rounding. \n", + "Instead of always rounding to the nearest representable number, we round up or down with a probability such that the \n", + "expected value after rounding is equal to the value before rounding. Since the expected value of an addition after \n", + "rounding is equal to the exact result of the addition, the expected value of a sum is also its exact value.\n", + "\n", + "This means that on average, the values of the parameters of a network will be close to the values they would have had if \n", + "a higher-precision format had been used. The added bonus of using stochastic rounding is that the parameters can be \n", + "stored in FP16, which means the parameters can be stored using half as much memory. This can be especially helpful when \n", + "training with small batch sizes, where the memory used to store the parameters is proportionally greater than the memory \n", + "used to store parameters when training with large batch sizes.\n", + "\n", + "It is highly recommended that you enable this feature when training neural networks with FP16 weights. The instructions \n", + "to enable it in PopTorch are presented later in this tutorial." + ] + }, + { + "cell_type": "markdown", + "id": "9e30b953", + "metadata": {}, + "source": [ + "Import the packages" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7153eb67", + "metadata": {}, + "outputs": [], + "source": [ + "import torch\n", + "import torch.nn as nn\n", + "import torchvision\n", + "import torchvision.transforms as transforms\n", + "import poptorch\n", + "from tqdm.auto import tqdm" + ] + }, + { + "cell_type": "markdown", + "id": "6269ba64", + "metadata": {}, + "source": [ + "## Build the model\n", + "\n", + "We use the same model as in [the previous tutorials on PopTorch](../). \n", + "Just like in the [previous tutorial](../tut2_efficient_data_loading), we are using larger images (128x128) to simulate \n", + "a heavier data load. This will make the difference in memory between FP32 and FP16 meaningful enough to showcase \n", + "in this tutorial." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1c9bf2ed", + "metadata": {}, + "outputs": [], + "source": [ + "class CustomModel(nn.Module):\n", + " def __init__(self):\n", + " super().__init__()\n", + " self.conv1 = nn.Conv2d(1, 5, 3)\n", + " self.pool = nn.MaxPool2d(2, 2)\n", + " self.conv2 = nn.Conv2d(5, 12, 5)\n", + " self.norm = nn.GroupNorm(3, 12)\n", + " self.fc1 = nn.Linear(41772, 100)\n", + " self.relu = nn.ReLU()\n", + " self.fc2 = nn.Linear(100, 10)\n", + " self.log_softmax = nn.LogSoftmax(dim=0)\n", + " self.loss = nn.NLLLoss()\n", + "\n", + " def forward(self, x, labels=None):\n", + " x = self.pool(self.relu(self.conv1(x)))\n", + " x = self.norm(self.relu(self.conv2(x)))\n", + " x = torch.flatten(x, start_dim=1)\n", + " x = self.relu(self.fc1(x))\n", + " x = self.log_softmax(self.fc2(x))\n", + " # The model is responsible for the calculation\n", + " # of the loss when using an IPU. We do it this way:\n", + " if self.training:\n", + " return x, self.loss(x, labels)\n", + " return x" + ] + }, + { + "cell_type": "markdown", + "id": "c15dab7a", + "metadata": {}, + "source": [ + ">**NOTE:** The model inherits `self.training` from `torch.nn.Module` which initialises its value to True. \n", + ">Use `model.eval()` to set it to False and `model.train()` to switch it back to True." + ] + }, + { + "cell_type": "markdown", + "id": "aea94554", + "metadata": {}, + "source": [ + "Choose parameters. " + ] + }, + { + "cell_type": "markdown", + "id": "7aad03e3", + "metadata": {}, + "source": [ + ">**NOTE** If you wish to modify these parameters for educational purposes, make sure you re-run all the cells below\n", + ">this one, including this entire cell as well:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f8093d45", + "metadata": {}, + "outputs": [], + "source": [ + "# Cast the model parameters to FP16\n", + "model_half = True\n", + "\n", + "# Cast the data to FP16\n", + "data_half = True\n", + "\n", + "# Cast the accumulation of gradients values types of the optimiser to FP16\n", + "optimizer_half = True\n", + "\n", + "# Use stochasting rounding\n", + "stochastic_rounding = True\n", + "\n", + "# Set partials data type to FP16\n", + "partials_half = True" + ] + }, + { + "cell_type": "markdown", + "id": "7d2a245e", + "metadata": {}, + "source": [ + "### Casting a model's parameters\n", + "\n", + "The default data type of the parameters of a PyTorch module is FP32 (`torch.float32`). To convert all the parameters \n", + "of a model to be represented in FP16 (`torch.float16`), an operation we will call _downcasting_, we simply do:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "37cc94e0", + "metadata": {}, + "outputs": [], + "source": [ + "model = CustomModel()\n", + "\n", + "if model_half:\n", + " model = model.half()" + ] + }, + { + "cell_type": "markdown", + "id": "88804694", + "metadata": {}, + "source": [ + "For this tutorial, we will cast all the model's parameters to FP16." + ] + }, + { + "cell_type": "markdown", + "id": "1b64f1f0", + "metadata": {}, + "source": [ + "### Casting a single layer's parameters\n", + "\n", + "For bigger or more complex models, downcasting all the layers may generate numerical instabilities and cause underflows. \n", + "While the PopTorch and the IPU offer features to alleviate those issues, it is still sensible for those models to cast \n", + "only the parameters of certain layers and observe how it affects the overall training job. To downcast the parameters of\n", + " a single layer, we select the layer by its _name_ and use `half()`:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "616aacde", + "metadata": {}, + "outputs": [], + "source": [ + "model.conv1 = model.conv1.half()" + ] + }, + { + "cell_type": "markdown", + "id": "d4b4e50d", + "metadata": {}, + "source": [ + "If you would like to upcast a layer instead, you can use `model.conv1.float()`.\n", + ">**NOTE**: One can print out a list of the components of a PyTorch model, with their names, by doing `print(model)`." + ] + }, + { + "cell_type": "markdown", + "id": "86fed196", + "metadata": {}, + "source": [ + "## Prepare the data\n", + "\n", + "We will use the FashionMNIST dataset that we download from `torchvision`. The last stage of the pipeline will have to \n", + "convert the data type of the tensors representing the images to `torch.half` (equivalent to `torch.float16`) so that \n", + "our input data is also in FP16. This has the advantage of reducing the bandwidth needed between the host and the IPU." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "496e9294", + "metadata": { + "tags": [ + "sst_hide_output" + ] + }, + "outputs": [], + "source": [ + "if data_half:\n", + " transform = transforms.Compose([transforms.Resize(128),\n", + " transforms.ToTensor(),\n", + " transforms.Normalize((0.5,), (0.5,)),\n", + " transforms.ConvertImageDtype(torch.half)])\n", + "else:\n", + " transform = transforms.Compose([transforms.Resize(128),\n", + " transforms.ToTensor(),\n", + " transforms.Normalize((0.5,), (0.5,))])\n", + "\n", + "train_dataset = torchvision.datasets.FashionMNIST(\"./datasets/\",\n", + " transform=transform,\n", + " download=True,\n", + " train=True)\n", + "test_dataset = torchvision.datasets.FashionMNIST(\"./datasets/\",\n", + " transform=transform,\n", + " download=True,\n", + " train=False)" + ] + }, + { + "cell_type": "markdown", + "id": "42cb87a5", + "metadata": {}, + "source": [ + "If the model has not been converted to half precision, but the input data has, then some layers of the model may be \n", + "converted to use FP16. Conversely, if the input data has not been converted, but the model has, then the input tensors \n", + "will be converted to FP16 on the IPU. This behaviour is the opposite of PyTorch's default behaviour.\n", + "\n", + ">**NOTE**: To stop PopTorch automatically downcasting tensors and parameters, so that it preserves PyTorch's default \n", + ">behaviour (upcasting), use the option:\n", + ">`opts.Precision.halfFloatCasting(poptorch.HalfFloatCastingBehavior.HalfUpcastToFloat)`." + ] + }, + { + "cell_type": "markdown", + "id": "04790544", + "metadata": {}, + "source": [ + "## Optimizers and loss scaling\n", + "\n", + "The value of the loss scaling factor can be passed as a parameter to the optimisers in `poptorch.optim`. In this \n", + "tutorial, we will set it to 1024 for an AdamW optimizer. For all optimisers (except `poptorch.optim.SGD`), using \n", + "a model in FP16 requires the argument `accum_type` to be set to `torch.float16` as well:\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8d89ec61", + "metadata": {}, + "outputs": [], + "source": [ + "if optimizer_half:\n", + " optimizer = poptorch.optim.AdamW(model.parameters(),\n", + " lr=0.001,\n", + " loss_scaling=1024,\n", + " accum_type=torch.float16)\n", + "else:\n", + " optimizer = poptorch.optim.AdamW(model.parameters(),\n", + " lr=0.001,\n", + " accum_type=torch.float32)" + ] + }, + { + "cell_type": "markdown", + "id": "6e9a2139", + "metadata": {}, + "source": [ + "While higher values of `loss_scaling` minimize underflows, values that are too high can also generate overflows as well \n", + "as hurt convergence of the loss. The optimal value depends on the model and the training job. \n", + "This is therefore a hyperparameter for you to tune." + ] + }, + { + "cell_type": "markdown", + "id": "9b80a3c9", + "metadata": {}, + "source": [ + "## Set PopTorch's options\n", + "\n", + "To configure some features of the IPU and to be able to use PopTorch's classes in the next sections, we will need to \n", + "create an instance of `poptorch.Options` which stores the options we will be using. \n", + "We covered some of the available options in:\n", + "[introductory tutorial for PopTorch](https://github.com/graphcore/examples/tree/master/tutorials/pytorch/tut1_basics).\n", + "\n", + "Let's initialise our options object before we talk about the options we will use:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "009a6c69", + "metadata": {}, + "outputs": [], + "source": [ + "opts = poptorch.Options()" + ] + }, + { + "cell_type": "markdown", + "id": "72d4fe24", + "metadata": {}, + "source": [ + ">**NOTE**: This tutorial has been designed to be run on a single IPU. If you do not have access to an IPU, you can use \n", + ">the option [`useIpuModel`](https://docs.graphcore.ai/projects/poptorch-user-guide/en/latest/overview.html#poptorch.Options.useIpuModel) to run a simulation on CPU instead. \n", + ">You can read more on the IPU Model and its limitations [here](https://docs.graphcore.ai/projects/poplar-user-guide/en/latest/poplar_programs.html#programming-with-poplar)." + ] + }, + { + "cell_type": "markdown", + "id": "4c3e0d47", + "metadata": {}, + "source": [ + "### Stochastic rounding\n", + "\n", + "With the IPU, stochastic rounding is implemented directly in the hardware and only requires you to enable it. \n", + "To do so, there is the option `enableStochasticRounding` in the `Precision` namespace of `poptorch.Options`. \n", + "This namespace holds other options for using mixed precision that we will talk about. \n", + "To enable stochastic rounding, we do:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a3762a5b", + "metadata": {}, + "outputs": [], + "source": [ + "if stochastic_rounding:\n", + " opts.Precision.enableStochasticRounding(True)" + ] + }, + { + "cell_type": "markdown", + "id": "f3443bfb", + "metadata": {}, + "source": [ + "With the IPU Model, this option won't change anything since stochastic rounding is implemented on the IPU." + ] + }, + { + "cell_type": "markdown", + "id": "104f4b2c", + "metadata": {}, + "source": [ + "### Partials data type\n", + "\n", + "Matrix multiplications and convolutions have intermediate states we call _partials_. Those partials can be stored \n", + "in FP32 or FP16. There is a memory benefit to using FP16 partials but the main benefit is that it can increase \n", + "the throughput for some models without affecting accuracy. However there is a risk of increasing numerical instability \n", + "if the values being multiplied are small, due to underflows. The default data type of partials is the input's \n", + "data type(FP16). For this tutorial, we set partials to FP32 just to showcase how it can be done. \n", + "We use the option `setPartialsType` to do it:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "57081e06", + "metadata": {}, + "outputs": [], + "source": [ + "if partials_half:\n", + " opts.Precision.setPartialsType(torch.half)\n", + "else:\n", + " opts.Precision.setPartialsType(torch.float)" + ] + }, + { + "cell_type": "markdown", + "id": "e3dbb24d", + "metadata": {}, + "source": [ + "## Train the model\n", + "\n", + "We can now train the model. After we have set all our options, we reuse our `poptorch.Options` instance for \n", + "the training `poptorch.DataLoader` that we will be using:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d565f93f", + "metadata": {}, + "outputs": [], + "source": [ + "train_dataloader = poptorch.DataLoader(opts,\n", + " train_dataset,\n", + " batch_size=12,\n", + " shuffle=True,\n", + " num_workers=40)" + ] + }, + { + "cell_type": "markdown", + "id": "195ced87", + "metadata": {}, + "source": [ + "We first make sure our model is in training mode, and then wrap it with `poptorch.trainingModel`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2e5a4c6d", + "metadata": {}, + "outputs": [], + "source": [ + "poptorch_model = poptorch.trainingModel(model,\n", + " options=opts,\n", + " optimizer=optimizer)" + ] + }, + { + "cell_type": "markdown", + "id": "4adc8e04", + "metadata": {}, + "source": [ + "Let's run the training loop for 10 epochs." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1e0c0caf", + "metadata": { + "tags": [ + "sst_hide_output" + ] + }, + "outputs": [], + "source": [ + "epochs = 10\n", + "for epoch in tqdm(range(epochs), desc=\"epochs\"):\n", + " total_loss = 0.0\n", + " for data, labels in tqdm(train_dataloader, desc=\"batches\", leave=False):\n", + " output, loss = poptorch_model(data, labels)\n", + " total_loss += loss\n", + "poptorch_model.detachFromDevice()" + ] + }, + { + "cell_type": "markdown", + "id": "e0ce81d4", + "metadata": {}, + "source": [ + "Our new model is now trained and we can start its evaluation." + ] + }, + { + "cell_type": "markdown", + "id": "314f2371", + "metadata": {}, + "source": [ + "## Evaluate the model\n", + "\n", + "Some PyTorch's operations, such as CNNs, are not supported in FP16 on the CPU, so we will evaluate our fine-tuned model \n", + "in mixed precision on an IPU using `poptorch.inferenceModel`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "884e7fe1", + "metadata": {}, + "outputs": [], + "source": [ + "model.eval()\n", + "poptorch_model_inf = poptorch.inferenceModel(model, options=opts)\n", + "test_dataloader = poptorch.DataLoader(opts,\n", + " test_dataset,\n", + " batch_size=32,\n", + " num_workers=40)" + ] + }, + { + "cell_type": "markdown", + "id": "14597b63", + "metadata": {}, + "source": [ + "Run inference on the labelled data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "04544e24", + "metadata": { + "tags": [ + "sst_hide_output" + ] + }, + "outputs": [], + "source": [ + "predictions, labels = [], []\n", + "for data, label in test_dataloader:\n", + " predictions += poptorch_model_inf(data).data.float().max(dim=1).indices\n", + " labels += label\n", + "poptorch_model_inf.detachFromDevice()" + ] + }, + { + "cell_type": "markdown", + "id": "ee136484", + "metadata": {}, + "source": [ + "We obtained an accuracy of approximately 84% on the test dataset." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b8abe154", + "metadata": {}, + "outputs": [], + "source": [ + "print(f\"\"\"Eval accuracy on IPU: {100 *\n", + " (1 - torch.count_nonzero(torch.sub(torch.tensor(labels),\n", + " torch.tensor(predictions))) / len(labels)):.2f}%\"\"\")" + ] + }, + { + "cell_type": "markdown", + "id": "9e30f196", + "metadata": {}, + "source": [ + "# Visualise the memory footprint\n", + "\n", + "We can visually compare the memory footprint on the IPU of the model trained in FP16 and FP32, \n", + "thanks to Graphcore's [PopVision Graph Analyser](https://docs.graphcore.ai/projects/graphcore-popvision-user-guide/en/latest/graph/graph.html).\n", + "\n", + "We generated memory reports of the same training session as covered in this tutorial for both cases: with and without \n", + "downcasting the model with `model.half()`. Here is the figure of both memory footprints, where \"source\" and \"target\" \n", + "represent the model trained in FP16 and FP32 respectively:\n", + "\n", + "![Comparison of memory footprints](static/MemoryDiffReport.png)\n", + "\n", + "We observed a ~26% reduction in memory usage with the settings of this tutorial, including from peak to peak. \n", + "The impact on the accuracy was also small, with less than 1% lost!\n", + "\n", + "# Debug floating-point exceptions\n", + "\n", + "Floating-point issues can be difficult to debug because the model will simply appear to not be training without specific\n", + " information about what went wrong. For more detailed information on the issue we set `debug.floatPointOpException` \n", + " to true in the environment variable `POPLAR_ENGINE_OPTIONS`. To set this, you can add the folowing before the command \n", + " you use to run your model:\n", + "\n", + "```python\n", + "POPLAR_ENGINE_OPTIONS='{\"debug.floatPointOpException\": \"true\"}'\n", + "```" + ] + }, + { + "cell_type": "markdown", + "id": "d9c401cf", + "metadata": {}, + "source": [ + "# PopTorch tracing and casting\n", + "\n", + "Because PopTorch relies on the `torch.jit.trace` API, it is limited to tracing operations which run on the CPU. \n", + "Many of these operations do not support FP16 inputs due to numerical stability issues. \n", + "To allow the full range of operations, PopTorch converts all FP16 inputs to FP32 before tracing and then restores \n", + "them to FP16. This is because the model must always be traced with FP16 inputs converted to FP32.\n", + "\n", + "PopTorch’s default casting functionality is to output in FP16 if any input of the operation is FP16. \n", + "This is opposite to PyTorch, which outputs in FP32 if any input of the operations is in FP32. \n", + "To achieve the same behaviour in PopTorch, one can use:\n", + "`opts.Precision.halfFloatCasting(poptorch.HalfFloatCastingBehavior.HalfUpcastToFloat)`.\n", + "Below you can see the difference between native PyTorch and PopTorch (with and without the option mentioned above):\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "913350ab", + "metadata": { + "tags": [ + "sst_hide_output" + ] + }, + "outputs": [], + "source": [ + "class Model(torch.nn.Module):\n", + " def forward(self, x, y):\n", + " return x + y\n", + "\n", + "native_model = Model()\n", + "\n", + "float16_tensor = torch.tensor([1.0], dtype=torch.float16)\n", + "float32_tensor = torch.tensor([1.0], dtype=torch.float32)\n", + "\n", + "# Native PyTorch results in a FP32 tensor\n", + "assert native_model(float32_tensor, float16_tensor).dtype == torch.float32\n", + "\n", + "opts = poptorch.Options()\n", + "\n", + "# PopTorch results in a FP16 tensor\n", + "poptorch_model = poptorch.inferenceModel(native_model, opts)\n", + "assert poptorch_model(float32_tensor, float16_tensor).dtype == torch.float16\n", + "\n", + "opts.Precision.halfFloatCasting(\n", + " poptorch.HalfFloatCastingBehavior.HalfUpcastToFloat)\n", + "\n", + "# The option above makes the same PopTorch example result in an FP32 tensor\n", + "poptorch_model = poptorch.inferenceModel(native_model, opts)\n", + "assert poptorch_model(float32_tensor, float16_tensor).dtype == torch.float32\n", + "poptorch_model.detachFromDevice()" + ] + }, + { + "cell_type": "markdown", + "id": "f06eeb17", + "metadata": { + "pycharm": { + "name": "#%% md\n" + } + }, + "source": [ + "# Summary\n", + "- Use half and mixed precision when you need to save memory on the IPU.\n", + "- You can cast a PyTorch model or a specific layer to FP16 using:\n", + " ```python\n", + " # Model\n", + " model.half()\n", + " # Layer\n", + " model.layer.half()\n", + " ```\n", + "- Several features are available in PopTorch to improve the numerical stability of a model in FP16:\n", + " - Loss scaling: `poptorch.optim.SGD(..., loss_scaling=1000)`\n", + " - Stochastic rounding: `opts.Precision.enableStochasticRounding(True)`\n", + " - Upcast partials data types: `opts.Precision.setPartialsType(torch.float)`\n", + "- The [PopVision Graph Analyser](https://docs.graphcore.ai/projects/graphcore-popvision-user-guide/en/latest/graph/graph.html) can be used to inspect the memory usage of a model and to help debug issues." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.9" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} \ No newline at end of file diff --git a/tutorials/pytorch/tut3_mixed_precision/walkthrough.py b/tutorials/pytorch/tut3_mixed_precision/walkthrough.py index f8b8c64..c666c49 100644 --- a/tutorials/pytorch/tut3_mixed_precision/walkthrough.py +++ b/tutorials/pytorch/tut3_mixed_precision/walkthrough.py @@ -1,30 +1,93 @@ #!/usr/bin/env python3 -# Copyright (c) 2021 Graphcore Ltd. All rights reserved. - +""" +Copyright (c) 2021 Graphcore Ltd. All rights reserved. +""" +""" # Half and mixed precision in PopTorch +This tutorial shows how to use half and mixed precision in PopTorch with the +example task of training a simple CNN model on a single +Graphcore IPU (Mk1 or Mk2). +""" +""" +Requirements: +- an installed Poplar SDK. See the Getting Started guide for your IPU +hardware for details of how to install the SDK; +- Other Python modules: `pip install -r requirements.txt` +""" +""" +# General + +## Motives for half precision + +Data is stored in memory, and some formats to store that data require less memory than others. In a device's memory, +when it comes to numerical data, we use either integers or real numbers. Real numbers are represented by one of several +floating point formats, which vary in how many bits they use to represent each number. Using more bits allows for +greater precision and a wider range of representable numbers, whereas using fewer bits allows for faster calculations +and reduces memory and power usage. In deep learning applications, where less precise calculations are acceptable and +throughput is critical, using a lower precision format can provide substantial gains in performance. + +The Graphcore IPU provides native support for two floating-point formats: + +- IEEE single-precision, which uses 32 bits for each number (FP32) +- IEEE half-precision, which uses 16 bits for each number (FP16) + +Some applications which use FP16 do all calculations in FP16, whereas others use a mix of FP16 and FP32. The latter +approach is known as *mixed precision*. + +In this tutorial, we are going to talk about real numbers represented in FP32 and FP16, and how to use these data types +(dtypes) in PopTorch in order to reduce the memory requirements of a model. + +## Numerical stability + +Numeric stability refers to how a model's performance is affected by the use of a lower-precision dtype. We say an +operation is "numerically unstable" in FP16 if running it in this dtype causes the model to have worse accuracy compared + to running the operation in FP32. Two techniques that can be used to increase the numerical stability of a model are + loss scaling and stochastic rounding. + +### Loss scaling + +A numerical issue that can occur when training a model in half-precision is that the gradients can underflow. This can +be difficult to debug because the model will simply appear to not be training, and can be especially damaging because +any gradients which underflow will propagate a value of 0 backwards to other gradient calculations. + +The standard solution to this is known as *loss scaling*, which consists of scaling up the loss value right before the +start of backpropagation to prevent numerical underflow of the gradients. Instructions on how to use loss scaling will +be discussed later in this tutorial. -# This tutorial shows how to use half and mixed precision in PopTorch with the -# example task of training a simple CNN model on a single -# Graphcore IPU (Mk1 or Mk2). +### Stochastic rounding -# Requirements: -# - an installed Poplar SDK. See the Getting Started guide for your IPU -# hardware for details of how to install the SDK; -# - Other Python modules: `pip install -r requirements.txt` +When training in half or mixed precision, numbers multiplied by each other will need to be rounded in order to fit into +the floating point format used. Stochastic rounding is the process of using a probabilistic equation for the rounding. +Instead of always rounding to the nearest representable number, we round up or down with a probability such that the +expected value after rounding is equal to the value before rounding. Since the expected value of an addition after +rounding is equal to the exact result of the addition, the expected value of a sum is also its exact value. -# Import the packages +This means that on average, the values of the parameters of a network will be close to the values they would have had if +a higher-precision format had been used. The added bonus of using stochastic rounding is that the parameters can be +stored in FP16, which means the parameters can be stored using half as much memory. This can be especially helpful when +training with small batch sizes, where the memory used to store the parameters is proportionally greater than the memory +used to store parameters when training with large batch sizes. + +It is highly recommended that you enable this feature when training neural networks with FP16 weights. The instructions +to enable it in PopTorch are presented later in this tutorial. +""" +""" +Import the packages +""" import torch import torch.nn as nn - import torchvision import torchvision.transforms as transforms - import poptorch -import argparse -from tqdm import tqdm +from tqdm.auto import tqdm +""" +## Build the model - -# Build the model +We use the same model as in [the previous tutorials on PopTorch](../). +Just like in the [previous tutorial](../tut2_efficient_data_loading), we are using larger images (128x128) to simulate +a heavier data load. This will make the difference in memory between FP32 and FP16 meaningful enough to showcase +in this tutorial. +""" class CustomModel(nn.Module): def __init__(self): super().__init__() @@ -49,23 +112,65 @@ def forward(self, x, labels=None): if self.training: return x, self.loss(x, labels) return x +""" +>**NOTE:** The model inherits `self.training` from `torch.nn.Module` which initialises its value to True. +>Use `model.eval()` to set it to False and `model.train()` to switch it back to True. +""" +""" +Choose parameters. +""" +""" +>**NOTE** If you wish to modify these parameters for educational purposes, make sure you re-run all the cells below +>this one, including this entire cell as well: +""" +# Cast the model parameters to FP16 +model_half = True -model = CustomModel() +# Cast the data to FP16 +data_half = True + +# Cast the accumulation of gradients values types of the optimiser to FP16 +optimizer_half = True -parser = argparse.ArgumentParser() -parser.add_argument('--model-half', dest='model_half', action='store_true', help='Cast the model parameters to FP16') -parser.add_argument('--data-half', dest='data_half', action='store_true', help='Cast the data to FP16') -parser.add_argument('--optimizer-half', dest='optimizer_half', action='store_true', help='Cast the accumulation type of the optimiser to FP16') -parser.add_argument('--stochastic-rounding', dest='stochastic_rounding', action='store_true', help='Use stochasting rounding') -parser.add_argument('--partials-half', dest='partials_half', action='store_true', help='Set partials data type to FP16') -args = parser.parse_args() +# Use stochasting rounding +stochastic_rounding = True -# Casting a model's parameters -if args.model_half: +# Set partials data type to FP16 +partials_half = True +""" +### Casting a model's parameters + +The default data type of the parameters of a PyTorch module is FP32 (`torch.float32`). To convert all the parameters +of a model to be represented in FP16 (`torch.float16`), an operation we will call _downcasting_, we simply do: +""" +model = CustomModel() + +if model_half: model = model.half() +""" +For this tutorial, we will cast all the model's parameters to FP16. +""" +""" +### Casting a single layer's parameters -# Prepare the data -if args.data_half: +For bigger or more complex models, downcasting all the layers may generate numerical instabilities and cause underflows. +While the PopTorch and the IPU offer features to alleviate those issues, it is still sensible for those models to cast +only the parameters of certain layers and observe how it affects the overall training job. To downcast the parameters of + a single layer, we select the layer by its _name_ and use `half()`: +""" +model.conv1 = model.conv1.half() +""" +If you would like to upcast a layer instead, you can use `model.conv1.float()`. +>**NOTE**: One can print out a list of the components of a PyTorch model, with their names, by doing `print(model)`. +""" +""" +## Prepare the data + +We will use the FashionMNIST dataset that we download from `torchvision`. The last stage of the pipeline will have to +convert the data type of the tensors representing the images to `torch.half` (equivalent to `torch.float16`) so that +our input data is also in FP16. This has the advantage of reducing the bandwidth needed between the host and the IPU. +""" +if data_half: transform = transforms.Compose([transforms.Resize(128), transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,)), @@ -83,9 +188,25 @@ def forward(self, x, labels=None): transform=transform, download=True, train=False) +# sst_hide_output +""" +If the model has not been converted to half precision, but the input data has, then some layers of the model may be +converted to use FP16. Conversely, if the input data has not been converted, but the model has, then the input tensors +will be converted to FP16 on the IPU. This behaviour is the opposite of PyTorch's default behaviour. + +>**NOTE**: To stop PopTorch automatically downcasting tensors and parameters, so that it preserves PyTorch's default +>behaviour (upcasting), use the option: +>`opts.Precision.halfFloatCasting(poptorch.HalfFloatCastingBehavior.HalfUpcastToFloat)`. +""" +""" +## Optimizers and loss scaling -# Optimizer and loss scaling -if args.optimizer_half: +The value of the loss scaling factor can be passed as a parameter to the optimisers in `poptorch.optim`. In this +tutorial, we will set it to 1024 for an AdamW optimizer. For all optimisers (except `poptorch.optim.SGD`), using +a model in FP16 requires the argument `accum_type` to be set to `torch.float16` as well: + +""" +if optimizer_half: optimizer = poptorch.optim.AdamW(model.parameters(), lr=0.001, loss_scaling=1024, @@ -94,50 +215,194 @@ def forward(self, x, labels=None): optimizer = poptorch.optim.AdamW(model.parameters(), lr=0.001, accum_type=torch.float32) +""" +While higher values of `loss_scaling` minimize underflows, values that are too high can also generate overflows as well +as hurt convergence of the loss. The optimal value depends on the model and the training job. +This is therefore a hyperparameter for you to tune. +""" +""" +## Set PopTorch's options +To configure some features of the IPU and to be able to use PopTorch's classes in the next sections, we will need to +create an instance of `poptorch.Options` which stores the options we will be using. +We covered some of the available options in: +[introductory tutorial for PopTorch](https://github.com/graphcore/examples/tree/master/tutorials/pytorch/tut1_basics). -# Set PopTorch's options +Let's initialise our options object before we talk about the options we will use: +""" opts = poptorch.Options() +""" +>**NOTE**: This tutorial has been designed to be run on a single IPU. If you do not have access to an IPU, you can use +>the option [`useIpuModel`](https://docs.graphcore.ai/projects/poptorch-user-guide/en/latest/overview.html#poptorch.Options.useIpuModel) to run a simulation on CPU instead. +>You can read more on the IPU Model and its limitations [here](https://docs.graphcore.ai/projects/poplar-user-guide/en/latest/poplar_programs.html#programming-with-poplar). +""" +""" +### Stochastic rounding -# Stochastic rounding -if args.stochastic_rounding: +With the IPU, stochastic rounding is implemented directly in the hardware and only requires you to enable it. +To do so, there is the option `enableStochasticRounding` in the `Precision` namespace of `poptorch.Options`. +This namespace holds other options for using mixed precision that we will talk about. +To enable stochastic rounding, we do: +""" +if stochastic_rounding: opts.Precision.enableStochasticRounding(True) -# Partials data type -if args.partials_half: +""" +With the IPU Model, this option won't change anything since stochastic rounding is implemented on the IPU. +""" +""" +### Partials data type + +Matrix multiplications and convolutions have intermediate states we call _partials_. Those partials can be stored +in FP32 or FP16. There is a memory benefit to using FP16 partials but the main benefit is that it can increase +the throughput for some models without affecting accuracy. However there is a risk of increasing numerical instability +if the values being multiplied are small, due to underflows. The default data type of partials is the input's +data type(FP16). For this tutorial, we set partials to FP32 just to showcase how it can be done. +We use the option `setPartialsType` to do it: +""" +if partials_half: opts.Precision.setPartialsType(torch.half) else: opts.Precision.setPartialsType(torch.float) +""" +## Train the model -# Train the model +We can now train the model. After we have set all our options, we reuse our `poptorch.Options` instance for +the training `poptorch.DataLoader` that we will be using: +""" train_dataloader = poptorch.DataLoader(opts, train_dataset, batch_size=12, shuffle=True, num_workers=40) +""" +We first make sure our model is in training mode, and then wrap it with `poptorch.trainingModel`. +""" poptorch_model = poptorch.trainingModel(model, options=opts, optimizer=optimizer) - +""" +Let's run the training loop for 10 epochs. +""" epochs = 10 for epoch in tqdm(range(epochs), desc="epochs"): total_loss = 0.0 for data, labels in tqdm(train_dataloader, desc="batches", leave=False): output, loss = poptorch_model(data, labels) total_loss += loss +poptorch_model.detachFromDevice() +# sst_hide_output +""" +Our new model is now trained and we can start its evaluation. +""" +""" +## Evaluate the model -# Evaluate the model +Some PyTorch's operations, such as CNNs, are not supported in FP16 on the CPU, so we will evaluate our fine-tuned model +in mixed precision on an IPU using `poptorch.inferenceModel`. +""" model.eval() poptorch_model_inf = poptorch.inferenceModel(model, options=opts) test_dataloader = poptorch.DataLoader(opts, test_dataset, batch_size=32, num_workers=40) - +""" +Run inference on the labelled data +""" predictions, labels = [], [] for data, label in test_dataloader: predictions += poptorch_model_inf(data).data.float().max(dim=1).indices labels += label +poptorch_model_inf.detachFromDevice() +# sst_hide_output +""" +We obtained an accuracy of approximately 84% on the test dataset. +""" print(f"""Eval accuracy on IPU: {100 * (1 - torch.count_nonzero(torch.sub(torch.tensor(labels), torch.tensor(predictions))) / len(labels)):.2f}%""") +""" +# Visualise the memory footprint + +We can visually compare the memory footprint on the IPU of the model trained in FP16 and FP32, +thanks to Graphcore's [PopVision Graph Analyser](https://docs.graphcore.ai/projects/graphcore-popvision-user-guide/en/latest/graph/graph.html). + +We generated memory reports of the same training session as covered in this tutorial for both cases: with and without +downcasting the model with `model.half()`. Here is the figure of both memory footprints, where "source" and "target" +represent the model trained in FP16 and FP32 respectively: + +![Comparison of memory footprints](static/MemoryDiffReport.png) + +We observed a ~26% reduction in memory usage with the settings of this tutorial, including from peak to peak. +The impact on the accuracy was also small, with less than 1% lost! + +# Debug floating-point exceptions + +Floating-point issues can be difficult to debug because the model will simply appear to not be training without specific + information about what went wrong. For more detailed information on the issue we set `debug.floatPointOpException` + to true in the environment variable `POPLAR_ENGINE_OPTIONS`. To set this, you can add the folowing before the command + you use to run your model: + +```python +POPLAR_ENGINE_OPTIONS='{"debug.floatPointOpException": "true"}' +``` +""" +""" +# PopTorch tracing and casting + +Because PopTorch relies on the `torch.jit.trace` API, it is limited to tracing operations which run on the CPU. +Many of these operations do not support FP16 inputs due to numerical stability issues. +To allow the full range of operations, PopTorch converts all FP16 inputs to FP32 before tracing and then restores +them to FP16. This is because the model must always be traced with FP16 inputs converted to FP32. + +PopTorch’s default casting functionality is to output in FP16 if any input of the operation is FP16. +This is opposite to PyTorch, which outputs in FP32 if any input of the operations is in FP32. +To achieve the same behaviour in PopTorch, one can use: +`opts.Precision.halfFloatCasting(poptorch.HalfFloatCastingBehavior.HalfUpcastToFloat)`. +Below you can see the difference between native PyTorch and PopTorch (with and without the option mentioned above): + +""" +class Model(torch.nn.Module): + def forward(self, x, y): + return x + y + +native_model = Model() + +float16_tensor = torch.tensor([1.0], dtype=torch.float16) +float32_tensor = torch.tensor([1.0], dtype=torch.float32) + +# Native PyTorch results in a FP32 tensor +assert native_model(float32_tensor, float16_tensor).dtype == torch.float32 + +opts = poptorch.Options() + +# PopTorch results in a FP16 tensor +poptorch_model = poptorch.inferenceModel(native_model, opts) +assert poptorch_model(float32_tensor, float16_tensor).dtype == torch.float16 + +opts.Precision.halfFloatCasting( + poptorch.HalfFloatCastingBehavior.HalfUpcastToFloat) + +# The option above makes the same PopTorch example result in an FP32 tensor +poptorch_model = poptorch.inferenceModel(native_model, opts) +assert poptorch_model(float32_tensor, float16_tensor).dtype == torch.float32 +poptorch_model.detachFromDevice() +# sst_hide_output +""" +# Summary +- Use half and mixed precision when you need to save memory on the IPU. +- You can cast a PyTorch model or a specific layer to FP16 using: + ```python + # Model + model.half() + # Layer + model.layer.half() + ``` +- Several features are available in PopTorch to improve the numerical stability of a model in FP16: + - Loss scaling: `poptorch.optim.SGD(..., loss_scaling=1000)` + - Stochastic rounding: `opts.Precision.enableStochasticRounding(True)` + - Upcast partials data types: `opts.Precision.setPartialsType(torch.float)` +- The [PopVision Graph Analyser](https://docs.graphcore.ai/projects/graphcore-popvision-user-guide/en/latest/graph/graph.html) can be used to inspect the memory usage of a model and to help debug issues. + +""" diff --git a/tutorials/pytorch/tut3_mixed_precision/walkthrough_code_only.py b/tutorials/pytorch/tut3_mixed_precision/walkthrough_code_only.py new file mode 100644 index 0000000..e7520d9 --- /dev/null +++ b/tutorials/pytorch/tut3_mixed_precision/walkthrough_code_only.py @@ -0,0 +1,154 @@ +# Copyright (c) 2021 Graphcore Ltd. All rights reserved. +import torch +import torch.nn as nn +import torchvision +import torchvision.transforms as transforms +import poptorch +from tqdm.auto import tqdm + +class CustomModel(nn.Module): + def __init__(self): + super().__init__() + self.conv1 = nn.Conv2d(1, 5, 3) + self.pool = nn.MaxPool2d(2, 2) + self.conv2 = nn.Conv2d(5, 12, 5) + self.norm = nn.GroupNorm(3, 12) + self.fc1 = nn.Linear(41772, 100) + self.relu = nn.ReLU() + self.fc2 = nn.Linear(100, 10) + self.log_softmax = nn.LogSoftmax(dim=0) + self.loss = nn.NLLLoss() + + def forward(self, x, labels=None): + x = self.pool(self.relu(self.conv1(x))) + x = self.norm(self.relu(self.conv2(x))) + x = torch.flatten(x, start_dim=1) + x = self.relu(self.fc1(x)) + x = self.log_softmax(self.fc2(x)) + # The model is responsible for the calculation + # of the loss when using an IPU. We do it this way: + if self.training: + return x, self.loss(x, labels) + return x + +# Cast the model parameters to FP16 +model_half = True + +# Cast the data to FP16 +data_half = True + +# Cast the accumulation of gradients values types of the optimiser to FP16 +optimizer_half = True + +# Use stochasting rounding +stochastic_rounding = True + +# Set partials data type to FP16 +partials_half = True + +model = CustomModel() + +if model_half: + model = model.half() + +model.conv1 = model.conv1.half() + +if data_half: + transform = transforms.Compose([transforms.Resize(128), + transforms.ToTensor(), + transforms.Normalize((0.5,), (0.5,)), + transforms.ConvertImageDtype(torch.half)]) +else: + transform = transforms.Compose([transforms.Resize(128), + transforms.ToTensor(), + transforms.Normalize((0.5,), (0.5,))]) + +train_dataset = torchvision.datasets.FashionMNIST("./datasets/", + transform=transform, + download=True, + train=True) +test_dataset = torchvision.datasets.FashionMNIST("./datasets/", + transform=transform, + download=True, + train=False) + +if optimizer_half: + optimizer = poptorch.optim.AdamW(model.parameters(), + lr=0.001, + loss_scaling=1024, + accum_type=torch.float16) +else: + optimizer = poptorch.optim.AdamW(model.parameters(), + lr=0.001, + accum_type=torch.float32) + +opts = poptorch.Options() + +if stochastic_rounding: + opts.Precision.enableStochasticRounding(True) + +if partials_half: + opts.Precision.setPartialsType(torch.half) +else: + opts.Precision.setPartialsType(torch.float) + +train_dataloader = poptorch.DataLoader(opts, + train_dataset, + batch_size=12, + shuffle=True, + num_workers=40) + +poptorch_model = poptorch.trainingModel(model, + options=opts, + optimizer=optimizer) + +epochs = 10 +for epoch in tqdm(range(epochs), desc="epochs"): + total_loss = 0.0 + for data, labels in tqdm(train_dataloader, desc="batches", leave=False): + output, loss = poptorch_model(data, labels) + total_loss += loss +poptorch_model.detachFromDevice() + +model.eval() +poptorch_model_inf = poptorch.inferenceModel(model, options=opts) +test_dataloader = poptorch.DataLoader(opts, + test_dataset, + batch_size=32, + num_workers=40) + +predictions, labels = [], [] +for data, label in test_dataloader: + predictions += poptorch_model_inf(data).data.float().max(dim=1).indices + labels += label +poptorch_model_inf.detachFromDevice() + +print(f"""Eval accuracy on IPU: {100 * + (1 - torch.count_nonzero(torch.sub(torch.tensor(labels), + torch.tensor(predictions))) / len(labels)):.2f}%""") + +class Model(torch.nn.Module): + def forward(self, x, y): + return x + y + +native_model = Model() + +float16_tensor = torch.tensor([1.0], dtype=torch.float16) +float32_tensor = torch.tensor([1.0], dtype=torch.float32) + +# Native PyTorch results in a FP32 tensor +assert native_model(float32_tensor, float16_tensor).dtype == torch.float32 + +opts = poptorch.Options() + +# PopTorch results in a FP16 tensor +poptorch_model = poptorch.inferenceModel(native_model, opts) +assert poptorch_model(float32_tensor, float16_tensor).dtype == torch.float16 + +opts.Precision.halfFloatCasting( + poptorch.HalfFloatCastingBehavior.HalfUpcastToFloat) + +# The option above makes the same PopTorch example result in an FP32 tensor +poptorch_model = poptorch.inferenceModel(native_model, opts) +assert poptorch_model(float32_tensor, float16_tensor).dtype == torch.float32 +poptorch_model.detachFromDevice()