Skip to content

Commit

Permalink
Added CPU offloading docs (#9479)
Browse files Browse the repository at this point in the history
* Added CPU offloading docs

Signed-off-by: Selvaraj Anandaraj <selvaraja@login-eos02.eos.clusters.nvidia.com>

* Tech writer review

Signed-off-by: Selvaraj Anandaraj <selvaraja@login-eos02.eos.clusters.nvidia.com>

---------

Signed-off-by: Selvaraj Anandaraj <selvaraja@login-eos02.eos.clusters.nvidia.com>
Co-authored-by: Selvaraj Anandaraj <selvaraja@login-eos02.eos.clusters.nvidia.com>
Co-authored-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com>
  • Loading branch information
3 people authored Jul 10, 2024
1 parent b4821e1 commit 14d42dc
Showing 1 changed file with 21 additions and 0 deletions.
21 changes: 21 additions & 0 deletions docs/source/features/memory_optimizations.rst
Original file line number Diff line number Diff line change
Expand Up @@ -105,3 +105,24 @@ Implement MQA or GQA
NeMo's support for GQA and MQA is enabled through the integration of Megatron Core's Attention mechanism. The underlying implementation details can be explored within the Attention class of Megatron Core, which provides the functional backbone for these advanced attention methods. To understand the specific modifications and implementations of MQA and GQA, refer to the source code in the Attention class:

Check implementation details from Attention Class in Megatron Core Repo: https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/transformer/attention.py#L49


CPU Offloading
--------------

Overview
^^^^^^^^

CPU Offloading in NeMo is a feature that reduces the peak memory usage of the GPU by offloading activations and inactive weights to CPU storage. NeMo supports offloading at the transformer layer level, allowing users to specify the number of transformer layers in their language model that require CPU offloading. During the forward pass, NeMo offloads activations at the optimal time and reloads them as needed during the backward pass.

Features
^^^^^^^^
> Supports training models with long sequence lengths by managing activation memory efficiently.
> Enables high batch sizes per GPU by offloading activation memory.
> Overlaps computation with data transfers (Host2Device and Device2Host) during offloading and reloading.

Usage
^^^^^
> Set cpu_offloading to True to enable CPU offloading.
> Set cpu_offloading_num_layers to a value between 0 and the total number of layers in the model minus one.
> Set cpu_offloading_activations and cpu_offloading_weights based on your needs to offload activations only, weights only, or both.

0 comments on commit 14d42dc

Please sign in to comment.