diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml index 70d2455d0aeaaa..989d005aaab645 100644 --- a/docs/source/_toctree.yml +++ b/docs/source/_toctree.yml @@ -1,4 +1,4 @@ -- sections: +- sections: - local: index title: 🤗 Transformers - local: quicktour @@ -63,6 +63,20 @@ title: 'Performance and Scalability: How To Fit a Bigger Model and Train It Faster' - local: parallelism title: Model Parallelism + - local: perf_infer + title: Performance - Inference + - local: perf_infer_gpu_one + title: Performance - Inference on one GPU + - local: perf_infer_gpu_many + title: Performance - Inference on many GPUs + - local: perf_infer_cpu + title: Performance - Inference on CPU + - local: perf_train + title: Performance - Training + - local: perf_train_gpu_one + title: Performance - Training on one GPU + - local: perf_train_gpu_many + title: Performance - Training on many GPUs - local: testing title: Testing - local: debugging diff --git a/docs/source/perf_infer.mdx b/docs/source/perf_infer.mdx new file mode 100644 index 00000000000000..0d5f4f5556c780 --- /dev/null +++ b/docs/source/perf_infer.mdx @@ -0,0 +1,22 @@ + + +# Efficient Inference + +## Memory Needs During Inference + +4-6x params + +## Choose Your Scale + +- [One GPU](perf_infer_gpu_one) +- [Many GPUs](perf_infer_gpu_many) +- [CPU](perf_infer_cpu) diff --git a/docs/source/perf_infer_cpu.mdx b/docs/source/perf_infer_cpu.mdx new file mode 100644 index 00000000000000..c0cf0956e16491 --- /dev/null +++ b/docs/source/perf_infer_cpu.mdx @@ -0,0 +1,30 @@ + + +# Efficient Inference on CPU + + +## Less Memory + + + +## Faster Speed + + + + +## Scalability Strategy + +* Deepspeed-ZeRO Stage 3 + CPU/NVMe Offload + +* Sagemaker + +* Deepspeed-Inference diff --git a/docs/source/perf_infer_gpu_many.mdx b/docs/source/perf_infer_gpu_many.mdx new file mode 100644 index 00000000000000..d631020edfcd42 --- /dev/null +++ b/docs/source/perf_infer_gpu_many.mdx @@ -0,0 +1,47 @@ + + +# Efficient Inference on Multiple GPUs + + +## Less Memory + +### fp16 + +### bf16 + +### Quantization + + + +## Faster Speed + +### DP vs DDP + +### ONNX + +### Infinity, Inference API + + + + + +## Scalability Strategy + +* Deepspeed-ZeRO Stage 3 + CPU/NVMe Offload + +* Sagemaker + +* Deepspeed-Inference + + + +## Hardware diff --git a/docs/source/perf_infer_gpu_one.mdx b/docs/source/perf_infer_gpu_one.mdx new file mode 100644 index 00000000000000..4676dfeb011d77 --- /dev/null +++ b/docs/source/perf_infer_gpu_one.mdx @@ -0,0 +1,44 @@ + + +# Efficient Inference on a Single GPU + + + +## Less Memory + +### fp16 + +### bf16 + +### Quantization + + + + + +## Faster Speed + +### Batch sizes + +### ONNX + +### Infinity, Inference API + + + +## Scalability Strategy + +* Deepspeed-ZeRO Stage 3 + CPU/NVMe Offload + +* Sagemaker + +* Deepspeed-Inference diff --git a/docs/source/perf_train.mdx b/docs/source/perf_train.mdx new file mode 100644 index 00000000000000..70f6454ac59717 --- /dev/null +++ b/docs/source/perf_train.mdx @@ -0,0 +1,23 @@ + + +# Efficient Training + + + +## Memory Needs During Training + +16-18x number of model params + +## Choose Your Scale + +- [One GPU](perf_train_gpu_one) +- [Many GPUs](perf_train_gpu_many) diff --git a/docs/source/perf_train_gpu_many.mdx b/docs/source/perf_train_gpu_many.mdx new file mode 100644 index 00000000000000..595d37fe75e8e9 --- /dev/null +++ b/docs/source/perf_train_gpu_many.mdx @@ -0,0 +1,80 @@ + + +# Efficient Training on Multiple GPU + + + +## Less Memory + + +### fp16 + +### bf16 + +### Gradient Accumulation + +### Gradient Checkpointing + +### Optimizer + + +## Faster Speed + +### DP vs DDP + +### Gradient Accumulation + +### Batch sizes + + + +## Scalability Strategy + +**⇨ Single Node / Multi-GPU** + +* Model fits onto a single GPU: + + 1. DDP - Distributed DP + 2. ZeRO - may or may not be faster depending on the situation and configuration used + +* Model doesn't fit onto a single GPU: + + 1. PP + 2. ZeRO + 3. TP + + With very fast intra-node connectivity of NVLINK or NVSwitch all three should be mostly on par, without these PP will be faster than TP or ZeRO. The degree of TP may also make a difference. Best to experiment to find the winner on your particular setup. + + TP is almost always used within a single node. That is TP size <= gpus per node. + +* Largest Layer not fitting into a single GPU: + + 1. If not using ZeRO - must use TP, as PP alone won't be able to fit. + 2. With ZeRO see the same entry for "Single GPU" above + + +**⇨ Multi-Node / Multi-GPU** + +* When you have fast inter-node connectivity: + + 1. ZeRO - as it requires close to no modifications to the model + 2. PP+TP+DP - less communications, but requires massive changes to the model + +* when you have slow inter-node connectivity and still low on GPU memory: + + 1. DP+PP+TP+ZeRO-1 + + + + + +## Hardware diff --git a/docs/source/perf_train_gpu_one.mdx b/docs/source/perf_train_gpu_one.mdx new file mode 100644 index 00000000000000..c8673a3515ecb2 --- /dev/null +++ b/docs/source/perf_train_gpu_one.mdx @@ -0,0 +1,184 @@ + + +# Efficient Training on a Single GPU + + + +## Less Memory + +The following techniques will help you reduce memory usage. + +### fp16 / bf16 + +Enabling mixed precision will make your training both faster and use less memory. The science of it is explained [here](performance#fp16) and [here]((performance#bf16). + +bf16 can only be used with Ampere based NVIDIA gpus (or newer). bf16 makes the training more stable than fp16 since it has almost the same numerical range as fp32. + +Activation: + +- HF Trainer-based examples: add `--fp16` or `--bf16` to the command line arguments. +- Custom HF Trainer-based program, pass one of: + + ```python + TrainingArguments(fp16=True) + # TrainingArguments(bf16=True) + ``` +- `accelerate`: use: ... (XXX: Sylvain/Leandro?) + +- Custom training loop: enable [`torch.cuda.amp`](https://pytorch.org/docs/stable/amp.html). For example for bf16: + + ```python + from torch.cuda.amp import autocast + with autocast(dtype=torch.bfloat16): + loss, outputs = ... + ``` + + +### Gradient Accumulation + +Gradient accumulation allows you to train a much larger batch than can fit into your GPU's memory. It also speeds up the training because it updates the weights less frequently. + +The science of gradient accumulation is explained [here](performance#gradient-accumulation). + +You will need to experiment to find the best number of steps to accumulate for. In the following examples we will use 4 steps. + +Activation: + +- HF Trainer-based examples: add `--gradient_accumulation_steps 4` to the command line arguments. +- Custom HF Trainer-based program, pass one of: + + ```python + TrainingArguments(gradient_accumulation_steps=4) + ``` +- `accelerate`: use: ... (XXX: Sylvain/Leandro?) + +- Custom training loop: This is somewhat complex but you can study how this is implemented in [HF Trainer]( +https://github.com/huggingface/transformers/blob/master/src/transformers/trainer.py) - simply search for `gradient_accumulation_steps` in the code. + + +### Gradient Checkpointing + +Gradient checkpointing saves memory while trading off some 20-30% of throughput. + +The science of gradient checkpointing is explained [here](performance#gradient-checkpointing). + +Activation: + +- HF Trainer-based examples: add `--gradient_checkpointing` to the command line arguments. +- Custom HF Trainer-based program, pass one of: + + ```python + TrainingArguments(gradient_checkpointing=True) + ``` +- `accelerate`: use: ... (XXX: Sylvain/Leandro?) + +- Custom training loop: To implement see [torch.utils.checkpoint.checkpoint](https://pytorch.org/docs/stable/checkpoint.html). + + + +### Optimizer + +Some optimizers require a lot more memory than others. + +The science of optimizer's memory usage and which optimizers to choose when are explained [here](performance#optimizer). + + +Activation: + +- HF Trainer-based examples: add `--optim` to the command line arguments, followed by the desired optimizer. e.g. one of `adamw_hf`, `adamw_torch`, `adamw_torch_xla`, `adamw_apex_fused`, `adafactor` +- Custom HF Trainer-based program, pass one of the optimizers listed above like so: + + ```python + TrainingArguments(optim="adam_torch") + ``` +- `accelerate`: use: ... (XXX: Sylvain/Leandro?) + +- Custom training loop: This is somewhat complex but you can study how this is implemented in [HF Trainer]( +https://github.com/huggingface/transformers/blob/master/src/transformers/trainer.py) - simply search for `optimizer` in the code. + + +### Deepspeed ZeRO + +The in-depth details on how to use Deepspeed can be found [here](main_classes/deepspeed). + +First, a quick decision tree: + +1. Model fits onto a single GPU and you have enough space to fit a small batch size - you don't need to use Deepspeed as it'll only slow things down in this use case. +2. Model doesn't fit onto a single GPU or you can't fit a small batch - use DeepSpeed ZeRO + CPU Offload and for much larger models NVMe Offload. + +Now if the decision tree suggested you use DeepSpeed first you need to [install it](main_classes/deepspeed#installation), then follow one of the following guides to create a configuration file and launch DeepSpeed. + +Activation: + +- HF Trainer-based examples: see this [guide](main_classes/deepspeed#deployment-with-one-gpu). +- Custom HF Trainer-based program: Same as above, but pass: + + ```python + TrainingArguments(deepspeed="/path/to/ds_config.json") + ``` +- Deployment in Notebooks: see this [guide](main_classes/deepspeed#deployment-in-notebooks). + +- `accelerate`: use: ... (XXX: Sylvain/Leandro?) + +- Custom training loop: This is somewhat complex but you can study how this is implemented in [HF Trainer]( +https://github.com/huggingface/transformers/blob/master/src/transformers/trainer.py) - simply search for `deepspeed` in the code. + + + + + + +## Faster Speed + +There are times where you have plenty of GPU memory and you want the training to go faster either because you're in a rush or because you're paying per hour for the hardware usage. + +The following techniques will make the training faster. + + +### Gradient Accumulation + +Gradient accumulation allows you to speed up training because it updates the weights less frequently. + +The science of gradient accumulation is explained [here](performance#gradient-accumulation). + +To activate please see the "Gradient Accumulation" section in the "Less Memory" section of this document (XXX: how to link?) + + +### Batch sizes + +A large batch size allows for a much better GPU utilization and thus often may lead to significant speed ups in performance. + +The science of batch sizes is explained [here](performance#batch-sizes). + +Activation: + +- HF Trainer-based examples: set `--per_device_train_batch_size PER_DEVICE_TRAIN_BATCH_SIZE` for training and `--per_device_eval_batch_size PER_DEVICE_EVAL_BATCH_SIZE` for evaluation. + +- Custom HF Trainer-based program, pass one of the optimizers listed above like so: + + ```python + TrainingArguments(per_device_train_batch_size=4, per_device_eval_batch_size=4) + ``` +- `accelerate`: use: ... (XXX: Sylvain/Leandro?) + +- Custom training loop: This is somewhat complex but you can study how this is implemented in [HF Trainer]( +https://github.com/huggingface/transformers/blob/master/src/transformers/trainer.py) - simply search for `batch_size` in the code. + + + +### Optimizer + +A choice of optimizer can impact the throughput. For example, using the fused AdamW optimizer from [NVIDIA/apex](https://github.com/NVIDIA/apex) will be faster than the same optimizer from `torch`. + +The science of optimizer's speed and which optimizers to choose when are explained [here](performance#optimizer). + +To activate please see the Optimizer section in the "Less Memory" section of this document (XXX: how to link?) diff --git a/docs/source/performance.mdx b/docs/source/performance.mdx index 25d78ee326a3d1..0f93de71d9b9a7 100644 --- a/docs/source/performance.mdx +++ b/docs/source/performance.mdx @@ -106,7 +106,7 @@ nvidia-smi ``` ```bash -Tue Jan 11 08:58:05 2022 +Tue Jan 11 08:58:05 2022 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 460.91.03 Driver Version: 460.91.03 CUDA Version: 11.2 | |-------------------------------+----------------------+----------------------+ @@ -118,7 +118,7 @@ Tue Jan 11 08:58:05 2022 | N/A 37C P0 39W / 300W | 2631MiB / 16160MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ - + +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | @@ -198,7 +198,7 @@ Next we have a look at another trick to save a little bit more GPU memory called Even when we set the batch size to 1 and use gradient accumulation we can still run out of memory when working with large models. In order to compute the gradients during the backward pass all activations from the forward pass are normally saved. This can create a big memory overhead. Alternatively, one could forget all activations during the forward pass and recompute them on demand during the backward pass. This would however add a significant computational overhead and slow down training. -Gradient checkpointing strikes a compromise between the two approaches and saves strategically selected activations throughout the computational graph so only a fraction of the activations need to be re-computed for the gradients. See [this great article](https://medium.com/tensorflow/fitting-larger-networks-into-memory-583e3c758ff9) explaining the ideas behind gradient checkpointing. +Gradient checkpointing strikes a compromise between the two approaches and saves strategically selected activations throughout the computational graph so only a fraction of the activations need to be re-computed for the gradients. See [this great article](https://medium.com/tensorflow/fitting-larger-networks-into-memory-583e3c758ff9) explaining the ideas behind gradient checkpointing. To enable gradient checkpointing in the [`Trainer`] we only need ot pass it as a flag to the [`TrainingArguments`]. Everything else is handled under the hood: @@ -265,7 +265,27 @@ We can see that with these tweaks we use about half the GPU memory as at the beg ## Optimizer -The most common optimizer used to train transformer model is Adam or AdamW (Adam with weight decay). Adam achieves good convergence by storing the rolling average of the previous gradients which, however, adds an additional memory footprint of the order of the number of model parameters. One remedy to this is to use an alternative optimizer such as Adafactor. +The most common optimizer used to train transformer model is Adam or AdamW (Adam with weight decay). Adam achieves good convergence by storing the rolling average of the previous gradients which, however, adds an additional memory footprint of the order of the number of model parameters. One remedy to this is to use an alternative optimizer such as Adafactor, which works well for some models but often it has instability issues. + +HF Trainer integrates a variety of optimisers that can be used out of box. To activate the desired optimizer simply pass the `--optim` flag to the command line. + +To see which optimizers are currently supported: + +```bash +$ python examples/pytorch/translation/run_translation.py -h | grep "\-optim" + [--optim {adamw_hf,adamw_torch,adamw_torch_xla,adamw_apex_fused,adafactor}] +``` + +For example, if you have [NVIDIA/apex](https://github.com/NVIDIA/apex) installed `--optim adamw_apex_fused` will give you the fastest training experience among all supported AdamW optimizers. + +On the other hand [8bit BNB optimizer](https://github.com/facebookresearch/bitsandbytes) can save 3/4 of memory normally used by a typical AdamW optimizer if it is configured to quantize all optimizer states, but in some situations only some optimizer states are quintized and then more memory is used. XXX: update once https://github.com/huggingface/transformers/pull/15622 is merged. + +Let's get a feel for the numbers and use for example use a 3B-parameter model, like `t5-3b`: + +- A standard AdamW uses 8 bytes for each parameter, here the optimizer will need (`8*3`) 24GB of GPU memory. +- Adafactor uses slightly more than 4 bytes, so (`4*3`) 12GB and then some extra. +- 8bit BNB quantized optimizer will use only (`2*3`) 6GB if all optimizer states are quantized. + ### Adafactor @@ -314,7 +334,7 @@ We went from 15 GB memory usage to 5 GB - a 3x improvement while maintaining the ### 8-bit Adam -Instead of aggregating optimizer states like Adafactor, 8-bit Adam keeps the full state and quantizes it. Quantization means that it stores the state with lower precision and dequantizes it only for the optimization. This is similar to the idea behind FP16 training where using variables with lower precision saves memory. +Instead of aggregating optimizer states like Adafactor, 8-bit Adam keeps the full state and quantizes it. Quantization means that it stores the state with lower precision and dequantizes it only for the optimization. This is similar to the idea behind FP16 training where using variables with lower precision saves memory. In contrast to the previous approaches is this one not integrated into the [`Trainer`] as a simple flag. We need to install the 8-bit optimizer and then pass it as a custom optimizer to the [`Trainer`]. Follow the installation guide in the Github [repo](https://github.com/facebookresearch/bitsandbytes) to install the `bitsandbytes` library that implements the 8-bit Adam optimizer. @@ -394,7 +414,7 @@ GPU memory occupied: 5363 MB. ``` Again, we get about a 3x memory improvement and even slightly higher throughput as using Adafactor. So we have seen how we can optimize the memory footprint of large models. The following plot summarizes all our experiments: - + ![png](https://huggingface.co/datasets/lvwerra/repo-images/raw/main/gpu-memory-savings.png) ## Using 🤗 Accelerate @@ -412,7 +432,7 @@ training_args = TrainingArguments( ) ``` -The full example training loop with 🤗 Accelerate is only a handful of lines of code long: +The full example training loop with 🤗 Accelerate is only a handful of lines of code long: ```py @@ -459,7 +479,7 @@ When we train models there are a two aspects we want to optimize at the same tim - Data throughput/training time - Model performance -We have seen that each method changes the memory usage and throughput. In general we want to maximize the throughput (samples/second) to minimize the training cost. This is generally achieved by utilizing the GPU as much as possible and thus filling GPU memory to its limit. For example, as mentioned earlier, we only employ gradient accumulation when we want to use a batch size beyond the size of the GPU memory. If the desired batch size fits into memory then there is no reason to apply gradient accumulation which will only slow down training. +We have seen that each method changes the memory usage and throughput. In general we want to maximize the throughput (samples/second) to minimize the training cost. This is generally achieved by utilizing the GPU as much as possible and thus filling GPU memory to its limit. For example, as mentioned earlier, we only employ gradient accumulation when we want to use a batch size beyond the size of the GPU memory. If the desired batch size fits into memory then there is no reason to apply gradient accumulation which will only slow down training. The second objective is model performance. Just because we can does not mean we should use a large batch size. As part of hyperparameter tuning you should determine which batch size yields the best result and then optimize the throughput accordingly. @@ -467,7 +487,7 @@ Sometimes, even when applying all the above tweaks the throughput on a given GPU ## Multi-GPU Training -If your model fits on a single GPU scaling to many GPUs can be achieved fairly easily with data parallelism. The idea is very similar to gradient accumulation with the distinction that instead of running the forward and backward passes during the accumulation in sequence on a single machine they are performed in parallel on multiple machines. So each GPU gets a small batch, runs the forward and backward passes and then the gradients from all machines are aggregated and the model is optimized. You can combine this with all the methods we described before. For example, if you have 4 GPUs and use `per_device_train_batch_size=12` and `gradient_accumulation_steps=3` you will have an effective batch size of `4*12*3=144`. +If your model fits on a single GPU scaling to many GPUs can be achieved fairly easily with data parallelism. The idea is very similar to gradient accumulation with the distinction that instead of running the forward and backward passes during the accumulation in sequence on a single machine they are performed in parallel on multiple machines. So each GPU gets a small batch, runs the forward and backward passes and then the gradients from all machines are aggregated and the model is optimized. You can combine this with all the methods we described before. For example, if you have 4 GPUs and use `per_device_train_batch_size=12` and `gradient_accumulation_steps=3` you will have an effective batch size of `4*12*3=144`. The [`Trainer`] allows for distributed training and if you execute your [`Trainer`] training script on a machine with multiple GPUs it will automatically utilize all of them, hence the name `per_device_train_batch_size`. In 🤗 Accelerate you can configure the infrastructure setup with the following command: @@ -479,7 +499,7 @@ Until now we have opperated under the assumption that we can fit the model onto ## What if my model still does not fit? -If the model does not fit on a single GPU with all the mentioned tricks there are still more methods we can apply although life starts to get a bit more complicated. This usually involves some form of pipeline or tensor parallelism where the model itself is distributed across several GPUs. One can also make use of DeepSpeed which implements some of these parallelism strategies along with some more optimization to reduce the memory footprint such as partitioning the optimizer states. You can read more about this in the ["Model Parallelism" section](parallelism). +If the model does not fit on a single GPU with all the mentioned tricks there are still more methods we can apply although life starts to get a bit more complicated. This usually involves some form of pipeline or tensor parallelism where the model itself is distributed across several GPUs. One can also make use of DeepSpeed which implements some of these parallelism strategies along with some more optimization to reduce the memory footprint such as partitioning the optimizer states. You can read more about this in the ["Model Parallelism" section](parallelism). This concludes the practical part of this guide for scaling the training of large models. The following section goes into more details on some of the aspects discussed above. @@ -915,7 +935,9 @@ It's important to remember that using gradient accumulation you may end up with ### Gradient Checkpointing -One way to use significantly less GPU memory is to enabled "Gradient Checkpointing" (also known as "activation checkpointing"). When enabled, a lot of memory can be freed at the cost of small decrease in the training speed due to recomputing parts of the graph during back-propagation. The slowdown will depend on the model but quite often it is around 20-30%. +Normally each layer of the model has to keep a copy of activations for `backwards` path which when a model is deep can require a huge amount of memory. Gradient checkpointing trades speed for memory by recalculating the forward path instead of storing the results. This can save a lot of memory but will make the training about 20-30% slower, with the actual slow down depending on the model. + +Gradient checkpointing is also known as activation checkpointing. This technique was first shared in the paper: [Training Deep Nets with Sublinear Memory Cost](https://arxiv.org/abs/1604.06174). The paper will also give you the exact details on the savings, but it's in the ballpark of `O(sqrt(n))`, where `n` is the number of feed-forward layers. @@ -943,6 +965,12 @@ for [RTX-3090](https://github.com/huggingface/transformers/issues/14608#issuecom and [A100](https://github.com/huggingface/transformers/issues/15026#issuecomment-1005033957). +To select the batch size for [`Trainer`] use `--per_device_train_batch_size PER_DEVICE_TRAIN_BATCH_SIZE` for training and `--per_device_eval_batch_size PER_DEVICE_EVAL_BATCH_SIZE` for evaluation. + + + + + ### DP vs DDP `DistributedDataParallel` (DDP) is typically faster than `DataParallel` (DP), but it is not always the case: