Add grad norm logging single device #1451

lindawangg · 2024-08-29T17:58:06Z

Context

What is the purpose of this PR? Is it to

add a new feature
fix a bug
update tests and/or documentation
other (please add here)

Please link to any issues this PR addresses.

Addresses [Feature request] Add grad norm monitoring/logging #1407

Changelog

What are the changes made in this PR?

Adds optional clip grad norm
Default logs grad norm is clip_grad_norm is set

Test plan

Please make sure to do each of the following if applicable to your PR. (If you're not sure about any one of these just ask and we will happily help. We also have a contributing page for some guidance on contributing.)

run pre-commit hooks and linters (make sure you've first installed via pre-commit install)
add unit tests for any new functionality
update docstrings for any new or updated methods or classes
run unit tests via pytest tests
run recipe tests via pytest tests -m integration_test
manually run any new or modified recipes with sufficient proof of correctness
include relevant commands and any other artifacts in this summary (pastes of loss curves, eval results, etc.)

tune run lora_finetune_single_device --config llama2/7B_lora_single_device max_steps_per_epoch=100 clip_grad_norm=100 metric_logger._component_=torchtune.utils.metric_logging.TensorBoardLogger checkpointer.output_dir=/tmp/llama2_w_grad_norm

Did not observe any changes to overall compute time or loss.

orange is with log_grad_norm=True, blue is baseline

tune run full_finetune_single_device --config llama2/7B_full_low_memory metric_logger._component_=torchtune.utils.metric_logging.TensorBoardLogger checkpointer.output_dir=/tmp/llama2_w_grad_norm_full optimizer_in_bwd=False clip_grad_norm=100

Observed no increase in batch/s. Verified clip_grad_norm clips gradients (in orange)

green is baseline, purple is clip_grad_norm='inf', orange is clip_grad_norm=100

UX

I did not change any public API;
I have added an example to docs or docstrings;

pytorch-bot · 2024-08-29T17:58:09Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1451

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 2d46255 with merge base dfc69e2 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

lindawangg · 2024-08-29T20:46:42Z

recipes/lora_finetune_single_device.py

@@ -132,6 +136,10 @@ def __init__(self, cfg: DictConfig) -> None:
        self._resume_from_checkpoint = cfg.resume_from_checkpoint
        self._save_adapter_weights_only = cfg.get("save_adapter_weights_only", False)
        self._gradient_accumulation_steps = cfg.gradient_accumulation_steps
+        self._clip_grad_norm = cfg.get("clip_grad_norm", None)
+        self._log_grad_norm = cfg.get("log_grad_norm", False)


We don't really need log_grad_norm. If clip_grad_norm is set, we can log grad norm by default. Computing this each step doesn't increase the time for single device.

Yeah based on your results I agree, seems like no negative perf impact to just logging by default. Can you also make the same changes in full_finetune_single_device.py and confirm that no negative perf impact there? (I think we should enable for both recipes in one go.)

Distributed may be a different story if we have to sync, but we can consider that out of scope for this PR.

If the results look similar for full finetune, I agree it makes sense to go ahead and remove the log_grad_norm config. I would even bias towards just always logging grad norm (that way you don't have to awkwardly set it to 'inf' to turn on logging, imo that's a bit unintuitive).

ebsmothers

This looks good! Apart from my comment around enabling for full finetune, we can also run a version with clip_grad_norm set to some finite value to ensure that we see it reflected correctly in the logs.

ebsmothers

This looks great, thank you!

add grad norm logging

0615f43

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 29, 2024

lindawangg added 2 commits August 29, 2024 12:25

doc for gradient clipping

319d0bc

lint

aaea0ea

lindawangg commented Aug 29, 2024

View reviewed changes

lindawangg added 2 commits August 29, 2024 13:55

remove log_grad_norm from config

b497b4f

added clip grad norm to tests

61f9869

ebsmothers reviewed Aug 29, 2024

View reviewed changes

added clip grad norm to full finetune

06c2792

lindawangg marked this pull request as ready for review August 30, 2024 04:40

lindawangg changed the title ~~[WIP] Add grad norm logging single device~~ Add grad norm logging single device Aug 30, 2024

Merge branch 'main' into add-feature-grad-norm

2d46255

ebsmothers approved these changes Aug 31, 2024

View reviewed changes

ebsmothers merged commit d3df28b into pytorch:main Aug 31, 2024
20 checks passed

lindawangg deleted the add-feature-grad-norm branch September 1, 2024 04:59

RdoubleA mentioned this pull request Sep 6, 2024

Compute grad norm #897

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add grad norm logging single device #1451

Add grad norm logging single device #1451

lindawangg commented Aug 29, 2024 •

edited

Loading

pytorch-bot bot commented Aug 29, 2024 •

edited

Loading

lindawangg Aug 29, 2024

ebsmothers Aug 29, 2024

ebsmothers left a comment

ebsmothers left a comment

Add grad norm logging single device #1451

Add grad norm logging single device #1451

Conversation

lindawangg commented Aug 29, 2024 • edited Loading

Context

Changelog

Test plan

UX

pytorch-bot bot commented Aug 29, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1451

✅ No Failures

lindawangg Aug 29, 2024

Choose a reason for hiding this comment

ebsmothers Aug 29, 2024

Choose a reason for hiding this comment

ebsmothers left a comment

Choose a reason for hiding this comment

ebsmothers left a comment

Choose a reason for hiding this comment

lindawangg commented Aug 29, 2024 •

edited

Loading

pytorch-bot bot commented Aug 29, 2024 •

edited

Loading