Activation offloading for fullfinetuning + fix tied embedding #1847

felipemello1 · 2024-10-15T20:26:36Z

Context

What is the purpose of this PR? Is it to

add a new feature
fix a bug
update tests and/or documentation
other (please add here)

Changelog

Added activation offloading for single/distributed full finetuning
Found that we didnt address the tied embedding issue. Fixed it by adding a dummy nn.Module so that tensor hooks would work with it
Moved the context manager logic to a function
Made the recipes more aligned

Test plan

pytorch-bot · 2024-10-15T20:26:40Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1847

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 7853938 with merge base d3039da ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

felipemello1 · 2024-10-15T20:28:12Z

recipes/lora_finetune_distributed.py

-        # training attributes
-        self._enable_activation_checkpointing = cfg.enable_activation_checkpointing
-        self._enable_activation_offloading = cfg.get(
-            "enable_activation_offloading", False
-        )
-        if self._enable_activation_offloading and self._device.type != "cuda":
-            raise RuntimeError(
-                "enable_activation_offloading should only be enabled for training on CUDA"
-            )


removed from init. This is handled in setup_model

Wait I'm confused.. we are still doing this in __init__, no?

I added it back after this:

felipemello1 · 2024-10-15T20:28:55Z

recipes/full_finetune_distributed.py

+            opt_state_dict=(
+                checkpoint_dict[training.OPT_KEY]
+                if self._resume_from_checkpoint
+                else None
+            ),


pre commit hook

felipemello1 · 2024-10-15T20:29:15Z

recipes/full_finetune_distributed.py

+            collate_fn=(
+                partial(
+                    collate_fn,
+                    padding_idx=self._tokenizer.pad_id,
+                    ignore_idx=self._loss_fn.ignore_index,
+                )
+                if not packed
+                else padded_collate_packed
+            ),


pre commit hook

felipemello1 · 2024-10-15T20:30:21Z

recipes/full_finetune_distributed.py

+        if enable_activation_offloading:
+            if self._device.type != "cuda":
+                raise RuntimeError(
+                    "enable_activation_offloading should only be True for training on CUDA"
+                )
+            if not enable_activation_checkpointing:
+                raise RuntimeError(
+                    "enable_activation_offloading should only be True when enable_activation_checkpointing is True"
+                )
+        elif enable_activation_checkpointing:
+            log.info(
+                "Hint: enable_activation_checkpointing is True, but enable_activation_offloading isn't. "
+                "Enabling activation offloading should reduce memory further."
+            )


the only think that i dont like about this is that it could fail much faster if we added it to the init. But i like that the checks are near the code where it matters. So i am not sure which one to pick.

I think failing much faster is better

felipemello1 · 2024-10-15T20:30:36Z

recipes/full_finetune_distributed.py

+        self.activations_handling_ctx = training.get_act_offloading_ctx_manager(
+            model, enable_activation_offloading
+        )


added function to handle NoOp / getting context

felipemello1 · 2024-10-15T20:31:14Z

recipes/full_finetune_distributed.py

+            if not enable_activation_checkpointing:
+                raise RuntimeError(
+                    "enable_activation_offloading should only be True when enable_activation_checkpointing is True"


no point in running offloading when AC is off. Its extremely slow.

felipemello1 · 2024-10-15T20:31:36Z

torchtune/modules/tied_linear.py

@@ -9,13 +9,33 @@
 import torch.nn.functional as F


+class Linear(nn.Module):


docstring explains why i had to add this

Sorry I'm a bit confused by this change on two fronts:

(1) Does this not change the key names in the state dict?
(2) Now that we have a module again, how do we not wind up right back where we started with to_empty?

(1) No. The Linear nn.Module doesnt have any weights. The weight is passed in the forward only.

Class Linear(nn.Module): def forward(x, weight): return F.Linear(x, weight)

TiedLinear is still a regular python class, and the key name is still model.TiedLinear.tok_embedding.weight

(2) i tested it with FSDP (ran the script and added assertion in the training loop), and confirmed that the memory pointers are the same in model.tok_embedding.weight and model.output.weight. So things are fine. Is that what you were referencing too?

felipemello1 · 2024-10-15T20:34:55Z

Will add the option to all the configs for full finetuning

ebsmothers · 2024-10-16T01:23:57Z

recipes/configs/llama3/8B_qlora_single_device.yaml

@@ -83,7 +83,7 @@ dtype: bf16

 # Activations Memory
 enable_activation_checkpointing: True
-enable_activation_offloading: True
+enable_activation_offloading: True  # True reduces memory


nit but do we want to say "True reduces memory" even if we already set it to True?

Also I'm curious why we choose to enable it here (other than the fact that it was already enabled). Seems like the general rule of thumb is to enable for low-memory configs? But this one I'm not clear on

great catch! I intended to only set it to True in low memory configs. Not sure how this one happened.

With that being said, i think that we should keep the comment, even when its True already. Do you disagree?

ebsmothers · 2024-10-16T01:31:52Z

recipes/configs/llama3_2_vision/11B_full_single_device.yaml

@@ -68,6 +68,7 @@ device: cuda

 # Memory management
 enable_activation_checkpointing: True
+enable_activation_offloading: False  # True reduces memory


Can you remind me.. did we test activation offloading on the vision models?

i didnt include it in my tests. Let me do it tomorrow.

ebsmothers · 2024-10-16T01:35:06Z

recipes/full_finetune_distributed.py

+            back during the backward pass. As always, there is a tradeoff--these savings in memory can
+            come at the cost of training performance and CPU resources. To recover some runtime cost,
+            we've added an option to enable offloading on a different stream to permit overlapping with
+            the computation. This option is currently only available on PyTorch nightly 2.5.0.dev20240907


Minor point but in two days 2.5 will be stable so we may not need this comment about nightlies by the time this lands anyways (fine to keep it in, just pointing it out)

ebsmothers · 2024-10-16T01:35:45Z

recipes/lora_finetune_distributed.py

-        # training attributes
-        self._enable_activation_checkpointing = cfg.enable_activation_checkpointing
-        self._enable_activation_offloading = cfg.get(
-            "enable_activation_offloading", False
-        )
-        if self._enable_activation_offloading and self._device.type != "cuda":
-            raise RuntimeError(
-                "enable_activation_offloading should only be enabled for training on CUDA"
-            )


Wait I'm confused.. we are still doing this in __init__, no?

ebsmothers · 2024-10-16T01:41:06Z

tests/recipes/test_full_finetune_distributed.py

@@ -34,6 +34,7 @@ def _get_test_config_overrides(self):
            "batch_size=4",
            "dtype=fp32",
            "enable_activation_checkpointing=False",
+            "enable_activation_offloading=False",


Should we set it to True in at least one of our test cases somewhere?

we should. I will see if i can quickly add it tomorrow

ebsmothers · 2024-10-16T02:12:32Z

torchtune/modules/tied_linear.py

@@ -9,13 +9,33 @@
 import torch.nn.functional as F


+class Linear(nn.Module):


Sorry I'm a bit confused by this change on two fronts:

(1) Does this not change the key names in the state dict?
(2) Now that we have a module again, how do we not wind up right back where we started with to_empty?

felipemello1 · 2024-10-16T17:43:52Z

torchtune/training/_activation_offloading.py

+    if enable_activation_offloading:
+        activations_handling_ctx = OffloadActivations()
+
+        # Below is our hack to disable offloading the last output Linear in every
+        # step, as the cost for offloading the activation and then soon after bringing
+        # it back is expensive. Moreover, due to heuristics in our streaming API,
+        # we actually use more memory if we offload it as it interferes with chunkedCE.
+        output_head_detected = False
+        if hasattr(model, "output"):
+            noop_ctx = NoOpManager()
+            if isinstance(model.output, nn.Module):
+                model.output.register_forward_pre_hook(
+                    lambda *args: noop_ctx.__enter__()
+                )
+                model.output.register_forward_hook(
+                    lambda *args: noop_ctx.__exit__(), always_call=True
+                )
+                output_head_detected = True
+            elif isinstance(model.output, TiedLinear):
+                model.output.linear.register_forward_pre_hook(
+                    lambda *args: noop_ctx.__enter__()
+                )
+                model.output.linear.register_forward_hook(
+                    lambda *args: noop_ctx.__exit__(), always_call=True
+                )
+                output_head_detected = True
+
+        elif hasattr(model, "decoder"):
+            noop_ctx = NoOpManager()
+            if isinstance(model.decoder, nn.Module):
+                model.decoder.output.register_forward_pre_hook(
+                    lambda *args: noop_ctx.__enter__()
+                )
+                model.decoder.output.register_forward_hook(
+                    lambda *args: noop_ctx.__exit__(), always_call=True
+                )
+                output_head_detected = True
+
+        if not output_head_detected:
+            log.warning(
+                "During activation offloading, no output head was detected. "
+                "If your model has an output head, it will be offloaded. "
+                "This usually greatly slows training, given the large vocabulary size. "
+                "To change this behavior, set your output head as model.output and make it "
+                "an nn.Module."
+            )
+
+    else:
+        activations_handling_ctx = contextlib.nullcontext()


this covers all of our cases, but I dont like too much how it looks. In the future change it to identify tensor size, and if larger than threshold, make it a non op

felipemello1 · 2024-10-16T17:44:48Z

tests/recipes/test_lora_finetune_distributed.py

@@ -173,6 +175,7 @@ def test_training_state_on_resume(
            resume_from_checkpoint=True \
            metric_logger.filename={log_file} \
            enable_activation_checkpointing=True \
+            enable_activation_offloading=True \


I enabled it for some tests in this file only. It tests lora and distributed. Ideally, we should have it vs many other parameters, like compile, vision and tiedembeddings. I wont address those in this PR. This should be part of the testing improvement, IMO

torchtune/modules/tied_linear.py

SalmanMohammadi · 2024-10-29T17:59:13Z

recipes/full_finetune_single_device.py

@@ -569,7 +613,8 @@ def _loss_step(self, batch: Dict[str, torch.Tensor]) -> torch.Tensor:
        # Shape [b, s], needed for the loss not the model
        labels = batch.pop("labels")

-        logits = self._model(**batch)
+        with self.activations_handling_ctx:


Curiosity brings me here. If I gather correctly, every model forward pass for which which we want to offload activations for needs to sit inside this context manager?

felipemello1 · 2024-10-30T15:15:06Z

torchtune/training/_activation_offloading.py

+        elif hasattr(model, "decoder"):
+            # TODO: it errors out. Needs debugging.
+            # assert_size_stride(rsqrt_2, (4, 32, 1601, 1), (52224, 1632, 1, 1))
+            # AssertionError: expected size 4==4, stride 51232==52224 at dim=0;
+            # # expected size 32==32, stride 1601==1632 at dim=1
+            raise NotImplementedError(
+                "Multimodal model does not support activation offloading yet. Please set enable_activation_offloading=False"
+            )
+            # if isinstance(model.decoder, nn.Module):
+            #     model.decoder.output.register_forward_pre_hook(
+            #         lambda *args: noop_ctx.__enter__()
+            #     )
+            #     model.decoder.output.register_forward_hook(
+            #         lambda *args: noop_ctx.__exit__(), always_call=True
+            #     )
+            #     output_head_detected = True
+


this needs debugging in a follow up pr

haha can you remove the commented out code?

joecummings

We're not savages, remove the commented out code.

Felipe Mello added 3 commits October 14, 2024 12:27

first comit

7c540b9

things work

e31a10f

fix logging and typing

9005c0d

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 15, 2024

felipemello1 requested a review from ebsmothers October 15, 2024 20:26

felipemello1 commented Oct 15, 2024

View reviewed changes

moved raise error to init

b523cdc

joecummings mentioned this pull request Oct 15, 2024

v0.4.0 release tracker #1747

Closed

34 tasks

Felipe Mello added 6 commits October 15, 2024 15:31

add qat

3e02a45

update configs

6ae18ed

remove offloading from qat

69a506e

update unit tests

cc67901

typos

387fc97

update configs

cbc58d8

ebsmothers reviewed Oct 16, 2024

View reviewed changes

Felipe Mello added 5 commits October 16, 2024 07:49

update config

438738e

add offloading to test

047456e

Merge branch 'main' into offloading_single_device

2b38b7c

update docstring

4913fc1

update vision

ef56d57

felipemello1 commented Oct 16, 2024

View reviewed changes

joecummings reviewed Oct 18, 2024

View reviewed changes

torchtune/modules/tied_linear.py Outdated Show resolved Hide resolved

Felipe Mello added 6 commits October 28, 2024 12:50

Merge branch 'main' into offloading_single_device

5fd89df

add back type hint

1523138

merge conflict

2c24d48

added missing logger

82238f8

raise not implemented error for multimodal

f43297d

update error msg

7853938

SalmanMohammadi reviewed Oct 29, 2024

View reviewed changes

felipemello1 commented Oct 30, 2024

View reviewed changes

joecummings approved these changes Oct 30, 2024

View reviewed changes

felipemello1 merged commit e99b890 into pytorch:main Oct 30, 2024
17 checks passed

felipemello1 deleted the offloading_single_device branch October 30, 2024 23:53

felipemello1 mentioned this pull request Nov 7, 2024

implement activation offloading and opt_in_bwd in knowledge_distillation recipes #1959

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Activation offloading for fullfinetuning + fix tied embedding #1847

Activation offloading for fullfinetuning + fix tied embedding #1847

felipemello1 commented Oct 15, 2024

pytorch-bot bot commented Oct 15, 2024 •

edited

Loading

felipemello1 Oct 15, 2024

ebsmothers Oct 16, 2024

felipemello1 Oct 16, 2024

felipemello1 Oct 15, 2024

felipemello1 Oct 15, 2024

felipemello1 Oct 15, 2024

janeyx99 Oct 15, 2024

felipemello1 Oct 15, 2024

felipemello1 Oct 15, 2024

felipemello1 Oct 15, 2024

ebsmothers Oct 16, 2024

felipemello1 Oct 16, 2024 •

edited

Loading

felipemello1 commented Oct 15, 2024

ebsmothers Oct 16, 2024

ebsmothers Oct 16, 2024

felipemello1 Oct 16, 2024

ebsmothers Oct 16, 2024

felipemello1 Oct 16, 2024

ebsmothers Oct 16, 2024

felipemello1 Oct 16, 2024

ebsmothers Oct 16, 2024

ebsmothers Oct 16, 2024

felipemello1 Oct 16, 2024

ebsmothers Oct 16, 2024

felipemello1 Oct 16, 2024 •

edited

Loading

felipemello1 Oct 16, 2024

SalmanMohammadi Oct 29, 2024

felipemello1 Oct 29, 2024

felipemello1 Oct 30, 2024

joecummings Oct 30, 2024

joecummings left a comment

		@@ -9,13 +9,33 @@
		import torch.nn.functional as F


		class Linear(nn.Module):

Activation offloading for fullfinetuning + fix tied embedding #1847

Activation offloading for fullfinetuning + fix tied embedding #1847

Conversation

felipemello1 commented Oct 15, 2024

Context

Changelog

Test plan

pytorch-bot bot commented Oct 15, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1847

✅ No Failures

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

felipemello1 Oct 16, 2024 • edited Loading

Choose a reason for hiding this comment

felipemello1 commented Oct 15, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

felipemello1 Oct 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joecummings left a comment

Choose a reason for hiding this comment

pytorch-bot bot commented Oct 15, 2024 •

edited

Loading

felipemello1 Oct 16, 2024 •

edited

Loading

felipemello1 Oct 16, 2024 •

edited

Loading