From 5fb8e1519f9e90c27e3a32ef85e22d6599ba5fa5 Mon Sep 17 00:00:00 2001
From: leandro <leandro.vonwerra@spoud.io>
Date: Wed, 13 Apr 2022 10:57:18 +0200
Subject: [PATCH] final touches

---
 docs/source/perf_train_gpu_one.mdx | 14 +++++++-------
 docs/source/performance.mdx        | 18 ++++++++++++------
 2 files changed, 19 insertions(+), 13 deletions(-)

diff --git a/docs/source/perf_train_gpu_one.mdx b/docs/source/perf_train_gpu_one.mdx
index e8d009007d4c6c..89ccf5ad5b0298 100644
--- a/docs/source/perf_train_gpu_one.mdx
+++ b/docs/source/perf_train_gpu_one.mdx
@@ -17,11 +17,13 @@ In this section we have a look at a few tricks to reduce the memory footprint an
 
 |Method|Speed|Memory|
 |:-----|:----|:-----|
-|Gradient accumulation| No | Yes |
-|Gradient checkpointing| No| Yes |
-|Mixed precision training| Yes | (No) |
-|Batch size| Yes | Yes |
-|Optimizer choice| (No) | Yes |
+| Gradient accumulation | No | Yes |
+| Gradient checkpointing | No| Yes |
+| Mixed precision training | Yes | (No) |
+| Batch size | Yes | Yes |
+| Optimizer choice | Yes | Yes |
+| DataLoader | Yes | No |
+| DeepSpeed Zero | No | Yes |
 
 A bracket means that it might not be strictly the case but is usually either not a main concern or negligable. Before we start make sure you have installed the following libraries:
 
@@ -648,8 +650,6 @@ Activation:
     ```
 - Deployment in Notebooks: see this [guide](main_classes/deepspeed#deployment-in-notebooks).
 
-- `accelerate`:  use: ... (XXX: Sylvain/Leandro?) _CUSTOM CONFIG NOT SUPPORTED, YET_
-
 - Custom training loop: This is somewhat complex but you can study how this is implemented in [HF Trainer](
 https://github.com/huggingface/transformers/blob/master/src/transformers/trainer.py) - simply search for `deepspeed` in the code.
 
diff --git a/docs/source/performance.mdx b/docs/source/performance.mdx
index db6cfb9d1e3789..336d4d53f8713a 100644
--- a/docs/source/performance.mdx
+++ b/docs/source/performance.mdx
@@ -14,11 +14,13 @@ See the License for the specific language governing permissions and
 limitations under the License.
 -->
 
-# Performance
+# Performance and Scalability
 
 Training larger and larger transformer models and deploying them to production comes with a range of challenges. During training your model can require more GPU memory than is available or be very slow to train and when you deploy it for inference it can be overwhelmed with the throughput that is required in the production environment. This documentation is designed to help you navigate these challenges and find the best setting for your use-case. We split the guides into training and inference as they come with different challenges and solutions. Then within each of them we have separate guides for different kinds of hardware setting (e.g. single vs. multi-GPU for training or CPU vs. GPU for infrence).
 
-This document serves as an overview entry point for the methods that could be useful for your scenario. 
+![perf_overview](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/perf_overview.png)
+
+This document serves as an overview and entry point for the methods that could be useful for your scenario. 
 
 ## Training
 
@@ -48,18 +50,17 @@ _Coming soon_
 
 Efficient inference with large models in a production environment can be as challenging as training them. In the following sections we go through the steps to run inference on CPU and single/multi-GPU setups.
 
-
 ### CPU
 
-_TODO_
+_Coming soon_
 
 ### Single GPU
 
-_TODO_
+_Coming soon_
 
 ### Multi-GPU
 
-_TODO_
+_Coming soon_
 
 ### Specialized Hardware
 
@@ -67,6 +68,11 @@ _Coming soon_
 
 ## Hardware
 
+In the hardware section you can find tips and tricks when building your own deep learning rig. 
+
+[Go to hardware section](perf_hardware)
+
+
 ## Contribute
 
 This document is far from being complete and a lot more needs to be added, so if you have additions or corrections to make please don't hesitate to open a PR or if you aren't sure start an Issue and we can discuss the details there.