[`core`] Refactor of `gradient_checkpointing` #27020

younesbelkada · 2023-10-23T13:11:04Z

What does this PR do?

Alternative to #26917

This way we make set_gradient_checkpointing more modulable, as requested by some users - e.g. #21381 (comment)

Fixes some issues with DDP such as: huggingface/trl#835

Also removed GC support from TFSwin as in theory gradient_checkpointing is used only for PT models.

Added also a CI tests for that

For users that want to use gradient_checkpointing with use_reentrant=False:

...
model.enable_gradient_checkpointing(gradient_checkpointing_kwargs={"use_reentrant": False})

HuggingFaceDocBuilderDev · 2023-10-23T14:01:06Z

The documentation is not available anymore as the PR was closed or merged.

younesbelkada · 2023-10-23T14:43:21Z

src/transformers/models/swin/modeling_tf_swin.py

@@ -951,11 +951,6 @@ class TFSwinPreTrainedModel(TFPreTrainedModel):
    config_class = SwinConfig
    base_model_prefix = "swin"
    main_input_name = "pixel_values"
-    supports_gradient_checkpointing = True


Here I removed it because not relevant to TF models

ArthurZucker

Very nice cleanup!

ArthurZucker · 2023-10-24T07:28:02Z

src/transformers/modeling_utils.py

@@ -1845,7 +1858,7 @@ def gradient_checkpointing_disable(self):
        activations".
        """
        if self.supports_gradient_checkpointing:
-            self.apply(partial(self._set_gradient_checkpointing, value=False))
+            self.apply(partial(self._set_gradient_checkpointing, gradient_checkpointing_func=None))


WHen we disable gradient checkpointing, I think the module.gradient_checkpointing will still be True.
Let's make module.gradient_checkpointing into a property to be sure we always check if the function is none or not WDYT?

Property could go at the ModelMixin level ?

ArthurZucker · 2023-10-24T07:40:49Z

Can you add a test to make sure setting and unsetting both work as expected (specifically for the fix we are implementing in TRL)

LysandreJik · 2023-10-24T08:54:48Z

+1

younesbelkada · 2023-10-24T09:28:51Z

src/transformers/models/blip_2/modeling_blip_2.py

+        # Enable / disable GC for the language model as well
+        if hasattr(self, "language_model") and hasattr(self.language_model, "_set_gradient_checkpointing"):
+            self.language_model._set_gradient_checkpointing(module, gradient_checkpointing_func)


BLIP2 never propagated gradient_checkpointing to its language_model

younesbelkada · 2023-10-24T09:29:16Z

src/transformers/models/instructblip/modeling_instructblip.py

+        # Enable / disable GC for the language model as well
+        if hasattr(self, "language_model") and hasattr(self.language_model, "_set_gradient_checkpointing"):
+            self.language_model._set_gradient_checkpointing(module, gradient_checkpointing_func)


younesbelkada · 2023-10-24T09:29:53Z

src/transformers/models/vitmatte/modeling_vitmatte.py

+            for backbone_module in module.modules():
+                if hasattr(backbone_module, "gradient_checkpointing"):
+                    backbone_module.gradient_checkpointing_func = gradient_checkpointing_func
+                    backbone_module.gradient_checkpointing = gradient_checkpointing_func is not None


Another edge case here where the backbone has some modules that support GC but that attribute never being propagated

younesbelkada

It turns out ~30 architectures were not properly using gradient_checkpointing, I left 3 comments to be aware of

ArthurZucker

I think we should use the call rather than forward to have the hooks!

ArthurZucker

Thanks a lot, very nice cleanup! 🔥

ArthurZucker · 2023-10-25T09:36:53Z

src/transformers/models/align/modeling_align.py

-                layer_outputs = torch.utils.checkpoint.checkpoint(
-                    create_custom_forward(layer_module),
+                layer_outputs = self.gradient_checkpointing_func(
+                    layer_module.__call__,


let's document this in the gradient checkpointing doc (IMO important to know! why forward and call are different)

younesbelkada · 2023-10-25T10:06:19Z

Ran some training tests with PEFT + GC using this branch and everything seem to pass! Merging once the CI is green

HuggingFaceDocBuilderDev · 2023-10-25T10:21:53Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

* v1 * fix * remove `create_custom_forward` * fixup * fixup * add test and fix all failing GC tests * remove all remaining `create_custom_forward` methods * fix idefics bug * fixup * replace with `__call__` * add comment * quality

## Describe your changes The latest version of transformers (>= 4.35.0) is not compatible with the model. PRs: huggingface/transformers#27020, huggingface/transformers#27073 change the expected signature of `_set_gradient_checkpointing` which now doesn't match the model's https://huggingface.co/microsoft/phi-1_5/blob/main/modeling_mixformer_sequential.py#L802 ## Checklist before requesting a review - [ ] Add unit tests for this change. - [ ] Make sure all tests can pass. - [ ] Update documents if necessary. - [ ] Format your code by running `pre-commit run --all-files` - [ ] Is this a user-facing change? If yes, give a description of this change to be included in the release notes. ## (Optional) Issue link

* v1 * fix * remove `create_custom_forward` * fixup * fixup * add test and fix all failing GC tests * remove all remaining `create_custom_forward` methods * fix idefics bug * fixup * replace with `__call__` * add comment * quality

lucasjinreal · 2024-04-09T03:44:27Z

Whatis the difference of enable_gradient_checkpointing and gradient_checkpointing_enable??

younesbelkada · 2024-04-09T07:53:01Z

@lucasjinreal I can only see gradient_checkpointing_enable here: https://github.com/huggingface/transformers/blob/main/src/transformers/modeling_utils.py

lucasjinreal · 2024-04-09T08:15:53Z

I using model.enable_gradient_checkpointing no errors appear...

How to explain this

younesbelkada · 2024-04-09T08:32:24Z

🤯
I think this was a typo, it is weird that you don't get any error, you should use gradient_checkpointing_enable

transformers/src/transformers/modeling_utils.py

Line 2195 in af4c026

def gradient_checkpointing_enable(self, gradient_checkpointing_kwargs=None):

v1

b5b1441

younesbelkada requested a review from ArthurZucker October 23, 2023 13:13

younesbelkada added 2 commits October 23, 2023 13:15

Merge remote-tracking branch 'upstream/main' into final-fix-gc

be90c60

fix

449b4a4

younesbelkada added 3 commits October 23, 2023 14:37

remove create_custom_forward

6b4ab9f

Merge remote-tracking branch 'upstream/main' into final-fix-gc

88dc697

fixup

2b5a669

younesbelkada commented Oct 23, 2023

View reviewed changes

fixup

6fbe101

younesbelkada requested a review from LysandreJik October 23, 2023 14:47

ArthurZucker reviewed Oct 24, 2023

View reviewed changes

Merge branch 'main' into final-fix-gc

08ef3c6

younesbelkada added 2 commits October 24, 2023 09:17

add test and fix all failing GC tests

634b5e7

remove all remaining create_custom_forward methods

476d261

younesbelkada commented Oct 24, 2023

View reviewed changes

younesbelkada added 2 commits October 24, 2023 09:48

fix idefics bug

465849c

fixup

7e5eeda

This was referenced Oct 24, 2023

[core] Fix use_reentrant issues huggingface/peft#1036

Merged

[core / DDP] Fix RM trainer + DDP + quantization + propagate gradient_checkpointing_kwargs in SFT & DPO huggingface/trl#912

Merged

younesbelkada requested a review from ArthurZucker October 24, 2023 10:21

ArthurZucker mentioned this pull request Oct 24, 2023

Add CLVP #24745

Merged

5 tasks

ArthurZucker reviewed Oct 25, 2023

View reviewed changes

replace with __call__

967ed0d

younesbelkada requested a review from ArthurZucker October 25, 2023 09:08

Merge remote-tracking branch 'upstream/main' into final-fix-gc

f084291

ArthurZucker mentioned this pull request Oct 25, 2023

Add Kosmos-2 model #24709

Merged

2 tasks

ArthurZucker approved these changes Oct 25, 2023

View reviewed changes

younesbelkada added 2 commits October 25, 2023 09:43

add comment

cec602a

quality

ded70d4

younesbelkada merged commit 06e782d into huggingface:main Oct 25, 2023
22 checks passed

younesbelkada deleted the final-fix-gc branch October 25, 2023 10:16

younesbelkada mentioned this pull request Oct 25, 2023

gradient checkpointing disables requires_grad when freezing part of models (fix with use_reentrant=False) #21381

Closed

4 tasks

This was referenced Oct 25, 2023

[Trainer / GC] Add gradient_checkpointing_kwargs in trainer and training arguments #27068

Merged

[core/ gradient_checkpointing] Refactor GC - part 2 #27073

Merged

jambayk mentioned this pull request Nov 2, 2023

Pin transformers version in phi example microsoft/Olive#691

Merged

5 tasks

LZHgrla mentioned this pull request Nov 3, 2023

[Fix] Temporarily limit the version of transformers InternLM/xtuner#200

Merged

younesbelkada mentioned this pull request Jan 8, 2024

Pythia regression in transformers==4.36.2 vs transformers==4.30.1 #28316

Closed

4 tasks

ArthurZucker mentioned this pull request Jan 16, 2024

Gradient checkpointing throws use_reentrant warning on PyTorch 2.1 #28536

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[`core`] Refactor of `gradient_checkpointing` #27020

[`core`] Refactor of `gradient_checkpointing` #27020

younesbelkada commented Oct 23, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Oct 23, 2023 •

edited

Loading

younesbelkada Oct 23, 2023

ArthurZucker left a comment

ArthurZucker Oct 24, 2023

ArthurZucker Oct 24, 2023

ArthurZucker commented Oct 24, 2023

LysandreJik commented Oct 24, 2023

younesbelkada Oct 24, 2023

younesbelkada Oct 24, 2023

younesbelkada Oct 24, 2023

younesbelkada left a comment

ArthurZucker left a comment

ArthurZucker left a comment

ArthurZucker Oct 25, 2023

younesbelkada commented Oct 25, 2023

HuggingFaceDocBuilderDev commented Oct 25, 2023

lucasjinreal commented Apr 9, 2024

younesbelkada commented Apr 9, 2024

lucasjinreal commented Apr 9, 2024 •

edited

Loading

younesbelkada commented Apr 9, 2024

[core] Refactor of gradient_checkpointing #27020

[core] Refactor of gradient_checkpointing #27020

Conversation

younesbelkada commented Oct 23, 2023 • edited Loading

What does this PR do?

HuggingFaceDocBuilderDev commented Oct 23, 2023 • edited Loading

younesbelkada Oct 23, 2023

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker Oct 24, 2023

Choose a reason for hiding this comment

ArthurZucker Oct 24, 2023

Choose a reason for hiding this comment

ArthurZucker commented Oct 24, 2023

LysandreJik commented Oct 24, 2023

younesbelkada Oct 24, 2023

Choose a reason for hiding this comment

younesbelkada Oct 24, 2023

Choose a reason for hiding this comment

younesbelkada Oct 24, 2023

Choose a reason for hiding this comment

younesbelkada left a comment

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker Oct 25, 2023

Choose a reason for hiding this comment

younesbelkada commented Oct 25, 2023

HuggingFaceDocBuilderDev commented Oct 25, 2023

lucasjinreal commented Apr 9, 2024

younesbelkada commented Apr 9, 2024

lucasjinreal commented Apr 9, 2024 • edited Loading

younesbelkada commented Apr 9, 2024

[`core`] Refactor of `gradient_checkpointing` #27020

[`core`] Refactor of `gradient_checkpointing` #27020

younesbelkada commented Oct 23, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Oct 23, 2023 •

edited

Loading

lucasjinreal commented Apr 9, 2024 •

edited

Loading