Skip to content

Commit

Permalink
Fix italics in blog post
Browse files Browse the repository at this point in the history
Signed-off-by: Chris Abraham <cjyabraham@gmail.com>
  • Loading branch information
cjyabraham committed Dec 18, 2023
1 parent f08d45b commit b9d2680
Showing 1 changed file with 6 additions and 6 deletions.
12 changes: 6 additions & 6 deletions _posts/2023-12-18-training-production-ai-models.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,20 +8,20 @@ author: CK Luk, Daohang Shi, Yuzhen Huang, Jackie (Jiaqi) Xu, Jade Nie, Zhou Wan

## 1. Introduction

[PyTorch 2.0](https://pytorch.org/get-started/pytorch-2.0/) (abbreviated as PT2) can significantly improve the training and inference performance of an AI model using a compiler called_ torch.compile_ while being 100% backward compatible with PyTorch 1.x. There have been reports on how PT2 improves the performance of common _benchmarks_ (e.g., [huggingface’s diffusers](https://huggingface.co/docs/diffusers/optimization/torch2.0)). In this blog, we discuss our experiences in applying PT2 to _production _AI models_ _at Meta.
[PyTorch 2.0](https://pytorch.org/get-started/pytorch-2.0/) (abbreviated as PT2) can significantly improve the training and inference performance of an AI model using a compiler called _torch.compile_ while being 100% backward compatible with PyTorch 1.x. There have been reports on how PT2 improves the performance of common _benchmarks_ (e.g., [huggingface’s diffusers](https://huggingface.co/docs/diffusers/optimization/torch2.0)). In this blog, we discuss our experiences in applying PT2 to _production_ AI models at Meta.


## 2. Background


### 2.1 Why is automatic performance optimization important for production?

Performance is particularly important for production—e.g, even a 5% reduction in the training time of a heavily used model can translate to substantial savings in GPU cost and data-center _power_. Another important metric is _development efficiency_, which measures how many engineer-months are required to bring a model to production. Typically, a significant part of this bring-up effort is spent on _manual _performance tuning such as rewriting GPU kernels to improve the training speed. By providing _automatic _performance optimization, PT2 can improve _both_ cost and development efficiency.
Performance is particularly important for production—e.g, even a 5% reduction in the training time of a heavily used model can translate to substantial savings in GPU cost and data-center _power_. Another important metric is _development efficiency_, which measures how many engineer-months are required to bring a model to production. Typically, a significant part of this bring-up effort is spent on _manual_ performance tuning such as rewriting GPU kernels to improve the training speed. By providing _automatic_ performance optimization, PT2 can improve _both_ cost and development efficiency.


### 2.2 How PT2 improves performance

As a compiler, PT2 can view_ multiple_ operations in the training graph captured from a model (unlike in PT1.x, where only one operation is executed at a time). Consequently, PT2 can exploit a number of performance optimization opportunities, including:
As a compiler, PT2 can view _multiple_ operations in the training graph captured from a model (unlike in PT1.x, where only one operation is executed at a time). Consequently, PT2 can exploit a number of performance optimization opportunities, including:



Expand Down Expand Up @@ -88,7 +88,7 @@ One useful technique for debugging a torch.compile-related numeric issue is to a

### 3.2 Autotuning in production

By default, the autotuning in torch.inductor is done _online _while the model is executed. For some production models, we find that the autotuning time can take several hours, which is not acceptable for production. Therefore, we add _offline autotuning_ which works as depicted in Figure 4. The very first time that a model is run, the details (e.g., input tensor shape, data type etc) on all ops that require tuning will be logged to a database. Then, a tuning process for these ops is run overnight to search for the most performant implementation of each op; the search result is updated to a persistent cache (implemented as a source file of torch.inductor). Next time when the model is run again, the tuned implementation of each op will be found in the cache and chosen for execution.
By default, the autotuning in torch.inductor is done _online_ while the model is executed. For some production models, we find that the autotuning time can take several hours, which is not acceptable for production. Therefore, we add _offline autotuning_ which works as depicted in Figure 4. The very first time that a model is run, the details (e.g., input tensor shape, data type etc) on all ops that require tuning will be logged to a database. Then, a tuning process for these ops is run overnight to search for the most performant implementation of each op; the search result is updated to a persistent cache (implemented as a source file of torch.inductor). Next time when the model is run again, the tuned implementation of each op will be found in the cache and chosen for execution.


![Fig.4 The offline autotuning used in production.](/assets/images/training-production-ai-models/blog-fig4.jpg){:style="width:100%;"}
Expand Down Expand Up @@ -124,9 +124,9 @@ In this section, we use three production models to evaluate PT2. First we show t
Figure 7 reports the training-time speedup with PT2. For each model, we show four cases: (i) no-compile with bf16, (ii) compile with fp32, (iii) compile with bf16, (iv) compile with bf16 and autotuning. The y-axis is the speedup over the baseline, which is no-compile with fp32. Note that no-compile with bf16 is actually slower than no-compile with fp32, due to the type conversion overhead. In contrast, compiling with bf16 achieves much larger speedups by reducing much of this overhead. Overall, given that these models are already heavily optimized by hand, we are excited to see that torch.compile can still provide 1.14-1.24x speedup.


![Fig.7 Training-time speedup with torch.compile (note: the baseline, no-compile/fp32, is_ omitted _in this figure).](/assets/images/training-production-ai-models/blog-fig7.jpg){:style="width:100%;"}
![Fig.7 Training-time speedup with torch.compile (note: the baseline, no-compile/fp32, is omitted in this figure).](/assets/images/training-production-ai-models/blog-fig7.jpg){:style="width:100%;"}

<p style="line-height: 1.05"><small><em><strong>Fig. 7</strong>: Training-time speedup with torch.compile (note: the baseline, no-compile/fp32, is_ omitted _in this figure).</em></small></p>
<p style="line-height: 1.05"><small><em><strong>Fig. 7</strong>: Training-time speedup with torch.compile (note: the baseline, no-compile/fp32, is omitted in this figure).</em></small></p>



Expand Down

0 comments on commit b9d2680

Please sign in to comment.