Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] [doc] performance/scalability revamp #15213

Closed
wants to merge 9 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 15 additions & 1 deletion docs/source/_toctree.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
- sections:
- sections:
- local: index
title: 🤗 Transformers
- local: quicktour
Expand Down Expand Up @@ -63,6 +63,20 @@
title: 'Performance and Scalability: How To Fit a Bigger Model and Train It Faster'
- local: parallelism
title: Model Parallelism
- local: perf_infer
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make a subfolder instead of having so many document names prefixed with perf.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great idea

title: Performance - Inference
- local: perf_infer_gpu_one
title: Performance - Inference on one GPU
- local: perf_infer_gpu_many
title: Performance - Inference on many GPUs
- local: perf_infer_cpu
title: Performance - Inference on CPU
- local: perf_train
title: Performance - Training
- local: perf_train_gpu_one
title: Performance - Training on one GPU
- local: perf_train_gpu_many
title: Performance - Training on many GPUs
- local: testing
title: Testing
- local: debugging
Expand Down
22 changes: 22 additions & 0 deletions docs/source/perf_infer.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
<!--Copyright 2020 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-->

# Efficient Inference

## Memory Needs During Inference

4-6x params

## Choose Your Scale

- [One GPU](perf_infer_gpu_one)
- [Many GPUs](perf_infer_gpu_many)
- [CPU](perf_infer_cpu)
30 changes: 30 additions & 0 deletions docs/source/perf_infer_cpu.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
<!--Copyright 2020 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-->

# Efficient Inference on CPU


## Less Memory



## Faster Speed




## Scalability Strategy

* Deepspeed-ZeRO Stage 3 + CPU/NVMe Offload

* Sagemaker

* Deepspeed-Inference
47 changes: 47 additions & 0 deletions docs/source/perf_infer_gpu_many.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
<!--Copyright 2020 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-->

# Efficient Inference on Multiple GPUs


## Less Memory

### fp16

### bf16

### Quantization



## Faster Speed

### DP vs DDP

### ONNX

### Infinity, Inference API





## Scalability Strategy

* Deepspeed-ZeRO Stage 3 + CPU/NVMe Offload

* Sagemaker

* Deepspeed-Inference



## Hardware
44 changes: 44 additions & 0 deletions docs/source/perf_infer_gpu_one.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
<!--Copyright 2020 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-->

# Efficient Inference on a Single GPU



## Less Memory

### fp16

### bf16

### Quantization





## Faster Speed

### Batch sizes

### ONNX

### Infinity, Inference API



## Scalability Strategy

* Deepspeed-ZeRO Stage 3 + CPU/NVMe Offload

* Sagemaker

* Deepspeed-Inference
23 changes: 23 additions & 0 deletions docs/source/perf_train.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
<!--Copyright 2020 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-->

# Efficient Training



## Memory Needs During Training

16-18x number of model params

## Choose Your Scale

- [One GPU](perf_train_gpu_one)
- [Many GPUs](perf_train_gpu_many)
80 changes: 80 additions & 0 deletions docs/source/perf_train_gpu_many.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
<!--Copyright 2020 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-->

# Efficient Training on Multiple GPU



## Less Memory


### fp16

### bf16

### Gradient Accumulation

### Gradient Checkpointing

### Optimizer


## Faster Speed

### DP vs DDP

### Gradient Accumulation

### Batch sizes



## Scalability Strategy

**⇨ Single Node / Multi-GPU**

* Model fits onto a single GPU:

1. DDP - Distributed DP
2. ZeRO - may or may not be faster depending on the situation and configuration used

* Model doesn't fit onto a single GPU:

1. PP
2. ZeRO
3. TP

With very fast intra-node connectivity of NVLINK or NVSwitch all three should be mostly on par, without these PP will be faster than TP or ZeRO. The degree of TP may also make a difference. Best to experiment to find the winner on your particular setup.

TP is almost always used within a single node. That is TP size <= gpus per node.

* Largest Layer not fitting into a single GPU:

1. If not using ZeRO - must use TP, as PP alone won't be able to fit.
2. With ZeRO see the same entry for "Single GPU" above


**⇨ Multi-Node / Multi-GPU**

* When you have fast inter-node connectivity:

1. ZeRO - as it requires close to no modifications to the model
2. PP+TP+DP - less communications, but requires massive changes to the model

* when you have slow inter-node connectivity and still low on GPU memory:

1. DP+PP+TP+ZeRO-1





## Hardware
Loading