Multihooks should not keep tensor alive in closure #102859

soulitzer · 2023-06-02T18:30:29Z

Stack from ghstack (oldest at bottom):

[ghstack-poisoned]

pytorch-bot · 2023-06-02T18:30:32Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/102859

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 8e8edde:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

albanD

Oups, nice catch

albanD

Oups, nice catch

This PR makes a first attempt at improving FSDP's fine-tuning support by adding hooks to reshard frozen parameters in the backward pass. - Without this, frozen parameters involved in gradient computation are kept as unsharded through the entire backward pass. - The approach is to register a multi-grad ~~post~~-hook on the _input_ activations to the FSDP module, where the hook performs the resharding after all gradients for the FSDP module must have been computed (meaning that we are safe to reshard). ~~This PR relies on adding a "multi-grad post-hook" that differs from the existing "multi-grad hook" from `register_multi_grad_hook()`. I find that with `register_multi_grad_hook()`, sometimes the unit test counting the number of times `_post_backward_reshard()` is called fails (due to it not being called).~~ This was resolved in #102859. Pull Request resolved: #101982 Approved by: https://github.com/rohan-varma

Multihooks should not keep tensor alive in closure

8e8edde

[ghstack-poisoned]

soulitzer requested a review from albanD as a code owner June 2, 2023 18:30

This was referenced Jun 2, 2023

Remove incorrect THP{Cpp,}Function_traverse PyObject traversals #102860

Closed

Upgrade LoggingTensorMode and add traceback collection #102309

Closed

Improve debuggability of activation checkpointing #102241

Closed

albanD approved these changes Jun 2, 2023

View reviewed changes

soulitzer added release notes: autograd release notes category topic: bug fixes topic category ciflow/trunk Trigger trunk jobs on your pull request labels Jun 2, 2023

pytorchmergebot added the Merged label Jun 2, 2023

pytorchmergebot closed this in 9866408 Jun 2, 2023

awgu mentioned this pull request Jun 8, 2023

[FSDP] Reshard frozen params in backward #101982

Closed

facebook-github-bot deleted the gh/soulitzer/212/head branch June 8, 2023 18:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multihooks should not keep tensor alive in closure #102859

Multihooks should not keep tensor alive in closure #102859

soulitzer commented Jun 2, 2023 •

edited

Loading

pytorch-bot bot commented Jun 2, 2023 •

edited

Loading

albanD left a comment

albanD left a comment

Multihooks should not keep tensor alive in closure #102859

Multihooks should not keep tensor alive in closure #102859

Conversation

soulitzer commented Jun 2, 2023 • edited Loading

pytorch-bot bot commented Jun 2, 2023 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/102859

✅ No Failures

albanD left a comment

Choose a reason for hiding this comment

albanD left a comment

Choose a reason for hiding this comment

soulitzer commented Jun 2, 2023 •

edited

Loading

pytorch-bot bot commented Jun 2, 2023 •

edited

Loading