-
Notifications
You must be signed in to change notification settings - Fork 27.4k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[WIP] [doc] performance/scalability revamp (#15723)
* [doc] performance/scalability revamp * link the new docs * no : * mixed precision * work on the first doc * expand the main doc * Trigger CI * style * revamp single GPU training section * work on training performance * remove files not used anymore or will be added later * final touches * fix rebase * Add hardware section to toctree * fix toctree again * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * remove `fast_tokenizers` entry that was copied in rebase * add warning about DP vs DDP * remove todo * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * fix missing closure of codeblock * Update docs/source/en/perf_train_gpu_many.mdx Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * sync with #16860 * update toc Co-authored-by: leandro <leandro.vonwerra@spoud.io> Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
- Loading branch information
1 parent
d3d87b4
commit 71abd3a
Showing
5 changed files
with
1,043 additions
and
1,071 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,150 @@ | ||
<!--- | ||
Copyright 2022 The HuggingFace Team. All rights reserved. | ||
|
||
Licensed under the Apache License, Version 2.0 (the "License"); | ||
you may not use this file except in compliance with the License. | ||
You may obtain a copy of the License at | ||
|
||
http://www.apache.org/licenses/LICENSE-2.0 | ||
|
||
Unless required by applicable law or agreed to in writing, software | ||
distributed under the License is distributed on an "AS IS" BASIS, | ||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
See the License for the specific language governing permissions and | ||
limitations under the License. | ||
--> | ||
|
||
|
||
# Custom hardware for training | ||
|
||
The hardware you use to run model training and inference can have a big effect on performance. For a deep dive into GPUs make sure to check out Tim Dettmer's excellent [blog post](https://timdettmers.com/2020/09/07/which-gpu-for-deep-learning/). | ||
|
||
Let's have a look at some practical advice for GPU setups. | ||
|
||
## GPU | ||
When you train bigger models you have essentially three options: | ||
- bigger GPUs | ||
- more GPUs | ||
- more CPU and NVMe (offloaded to by [DeepSpeed-Infinity](main_classes/deepspeed#nvme-support)) | ||
|
||
Let's start at the case where you have a single GPU. | ||
|
||
### Power and Cooling | ||
|
||
If you bought an expensive high end GPU make sure you give it the correct power and sufficient cooling. | ||
|
||
**Power**: | ||
|
||
Some high end consumer GPU cards have 2 and sometimes 3 PCI-E 8-Pin power sockets. Make sure you have as many independent 12V PCI-E 8-Pin cables plugged into the card as there are sockets. Do not use the 2 splits at one end of the same cable (also known as pigtail cable). That is if you have 2 sockets on the GPU, you want 2 PCI-E 8-Pin cables going from your PSU to the card and not one that has 2 PCI-E 8-Pin connectors at the end! You won't get the full performance out of your card otherwise. | ||
|
||
Each PCI-E 8-Pin power cable needs to be plugged into a 12V rail on the PSU side and can supply up to 150W of power. | ||
|
||
Some other cards may use a PCI-E 12-Pin connectors, and these can deliver up to 500-600W of power. | ||
|
||
Low end cards may use 6-Pin connectors, which supply up to 75W of power. | ||
|
||
Additionally you want the high-end PSU that has stable voltage. Some lower quality ones may not give the card the stable voltage it needs to function at its peak. | ||
|
||
And of course the PSU needs to have enough unused Watts to power the card. | ||
|
||
**Cooling**: | ||
|
||
When a GPU gets overheated it will start throttling down and will not deliver full performance and it can even shutdown if it gets too hot. | ||
|
||
It's hard to tell the exact best temperature to strive for when a GPU is heavily loaded, but probably anything under +80C is good, but lower is better - perhaps 70-75C is an excellent range to be in. The throttling down is likely to start at around 84-90C. But other than throttling performance a prolonged very high temperature is likely to reduce the lifespan of a GPU. | ||
|
||
Next let's have a look at one of the most important aspects when having multiple GPUs: connectivity. | ||
|
||
### Multi-GPU Connectivity | ||
|
||
If you use multiple GPUs the way cards are inter-connected can have a huge impact on the total training time. If the GPUs are on the same physical node, you can run: | ||
|
||
``` | ||
nvidia-smi topo -m | ||
``` | ||
|
||
and it will tell you how the GPUs are inter-connected. On a machine with dual-GPU and which are connected with NVLink, you will most likely see something like: | ||
|
||
``` | ||
GPU0 GPU1 CPU Affinity NUMA Affinity | ||
GPU0 X NV2 0-23 N/A | ||
GPU1 NV2 X 0-23 N/A | ||
``` | ||
|
||
on a different machine w/o NVLink we may see: | ||
``` | ||
GPU0 GPU1 CPU Affinity NUMA Affinity | ||
GPU0 X PHB 0-11 N/A | ||
GPU1 PHB X 0-11 N/A | ||
``` | ||
|
||
The report includes this legend: | ||
|
||
``` | ||
X = Self | ||
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) | ||
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node | ||
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) | ||
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) | ||
PIX = Connection traversing at most a single PCIe bridge | ||
NV# = Connection traversing a bonded set of # NVLinks | ||
``` | ||
|
||
So the first report `NV2` tells us the GPUs are interconnected with 2 NVLinks, and the second report `PHB` we have a typical consumer-level PCIe+Bridge setup. | ||
|
||
Check what type of connectivity you have on your setup. Some of these will make the communication between cards faster (e.g. NVLink), others slower (e.g. PHB). | ||
|
||
Depending on the type of scalability solution used, the connectivity speed could have a major or a minor impact. If the GPUs need to sync rarely, as in DDP, the impact of a slower connection will be less significant. If the GPUs need to send messages to each other often, as in ZeRO-DP, then faster connectivity becomes super important to achieve faster training. | ||
|
||
#### NVlink | ||
|
||
[NVLink](https://en.wikipedia.org/wiki/NVLink) is a wire-based serial multi-lane near-range communications link developed by Nvidia. | ||
|
||
Each new generation provides a faster bandwidth, e.g. here is a quote from [Nvidia Ampere GA102 GPU Architecture](https://www.nvidia.com/content/dam/en-zz/Solutions/geforce/ampere/pdf/NVIDIA-ampere-GA102-GPU-Architecture-Whitepaper-V1.pdf): | ||
|
||
> Third-Generation NVLink® | ||
> GA102 GPUs utilize NVIDIA’s third-generation NVLink interface, which includes four x4 links, | ||
> with each link providing 14.0625 GB/sec bandwidth in each direction between two GPUs. Four | ||
> links provide 56.25 GB/sec bandwidth in each direction, and 112.5 GB/sec total bandwidth | ||
> between two GPUs. Two RTX 3090 GPUs can be connected together for SLI using NVLink. | ||
> (Note that 3-Way and 4-Way SLI configurations are not supported.) | ||
So the higher `X` you get in the report of `NVX` in the output of `nvidia-smi topo -m` the better. The generation will depend on your GPU architecture. | ||
|
||
Let's compare the execution of a gpt2 language model training over a small sample of wikitext. | ||
|
||
The results are: | ||
|
||
|
||
| NVlink | Time | | ||
| ----- | ---: | | ||
| Y | 101s | | ||
| N | 131s | | ||
|
||
|
||
You can see that NVLink completes the training ~23% faster. In the second benchmark we use `NCCL_P2P_DISABLE=1` to tell the GPUs not to use NVLink. | ||
|
||
Here is the full benchmark code and outputs: | ||
|
||
```bash | ||
# DDP w/ NVLink | ||
|
||
rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch \ | ||
--nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py --model_name_or_path gpt2 \ | ||
--dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --do_train \ | ||
--output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200 | ||
|
||
{'train_runtime': 101.9003, 'train_samples_per_second': 1.963, 'epoch': 0.69} | ||
|
||
# DDP w/o NVLink | ||
|
||
rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 NCCL_P2P_DISABLE=1 python -m torch.distributed.launch \ | ||
--nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py --model_name_or_path gpt2 \ | ||
--dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --do_train | ||
--output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200 | ||
|
||
{'train_runtime': 131.4367, 'train_samples_per_second': 1.522, 'epoch': 0.69} | ||
``` | ||
|
||
Hardware: 2x TITAN RTX 24GB each + NVlink with 2 NVLinks (`NV2` in `nvidia-smi topo -m`) | ||
Software: `pytorch-1.8-to-be` + `cuda-11.0` / `transformers==4.3.0.dev0` |
Oops, something went wrong.