Apply rope on k earlier for efficiency #1558

jackzhxng · 2024-09-12T15:21:34Z

Context

By performing rotational embeddings early, we can squeeze out a bit more performance. Comparing finetuning statistics on Llama3.1-Instruct 8B w LoRA on GPU (full stats are in the "Appendix" section at the end of this pr description):

Tokens per second increased from 1246.51852/s ->1316.82931/s (+70 tokens per second)
Loss stayed about the same (1.48158 -> 1.48607)
Peak active memory stayed the same (18.14733 -> 18.14733)

Changelog

Does rope on an unexpanded k that is still [b, s_y, n_kv, h_d] instead of [b, s_y, n_h, h_d].

Test plan

run pre-commit hooks and linters (make sure you've first installed via pre-commit install)
add unit tests for any new functionality
update docstrings for any new or updated methods or classes
run unit tests via pytest tests
run recipe tests via pytest tests -m integration_test
manually run any new or modified recipes with sufficient proof of correctness
include relevant commands and any other artifacts in this summary (pastes of loss curves, eval results, etc.)

UX

No public API changes

Appendix

Stats from one epoch of finetuning Llama3.1-Instruct 8B prior to this PR:

1|25|Loss: 1.481581211090088: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [12:49<00:00, 30.78s/it]
wandb:                                                                                
wandb: 
wandb: Run history:
wandb:               global_step ▁▁▂▂▂▂▃▃▃▄▄▄▅▅▅▅▆▆▆▇▇▇▇██
wandb:                      loss ▆▆▇▇▆▆▇▇▅▇▅▅█▅▆▆▅▅▄▃▃▃▂▂▁
wandb:                        lr ▁▁▂▂▂▂▃▃▃▄▄▄▄▅▅▅▆▆▆▇▇▇▇██
wandb:        peak_memory_active ▅▆▇█▆█▇▆▄██▆▅▆▁▆▆▇█▄▇▇▃▅▇
wandb:         peak_memory_alloc ▅▆▇█▆█▇▆▄██▆▅▆▁▆▆▇█▄▇▇▃▅▇
wandb:      peak_memory_reserved ▁▁▃██████████████████████
wandb: tokens_per_second_per_gpu ▄▅▅▅▅▃▂▁▆▃▅█▃▆▁▄▂▅▂▃▄▅▃▂▅
wandb: 
wandb: Run summary:
wandb:               global_step 25
wandb:                      loss 1.48158
wandb:                        lr 7e-05
wandb:        peak_memory_active 18.14733
wandb:         peak_memory_alloc 18.14733
wandb:      peak_memory_reserved 19.43555
wandb: tokens_per_second_per_gpu 1246.51852
wandb: 
wandb: 🚀 View run lilac-puddle-14 at: https://wandb.ai/dvorjackz-meta/torchtune/runs/21e7roul
wandb: ⭐️ View project at: https://wandb.ai/dvorjackz-meta/torchtune
wandb: Synced 5 W&B file(s), 0 media file(s), 2 artifact file(s) and 1 other file(s)
wandb: Find logs at: /tmp/wandb/run-20240916_071220-21e7roul/logs

Stats from one epoch of finetuning Llama3.1-Instruct 8B after this PR:

1|25|Loss: 1.48606538772583: 100%|██████████████████████████████| 25/25 [12:34<00:00, 30.17s/it]
wandb:                                                                                
wandb: 
wandb: Run history:
wandb:               global_step ▁▁▂▂▂▂▃▃▃▄▄▄▅▅▅▅▆▆▆▇▇▇▇██
wandb:                      loss ▆▆▇▆▆▆▇▇▅▇▅▅█▆▆▆▅▅▄▃▃▂▂▂▁
wandb:                        lr ▁▁▂▂▂▂▃▃▃▄▄▄▄▅▅▅▆▆▆▇▇▇▇██
wandb:        peak_memory_active ▅▆▇█▆█▇▆▄██▆▅▆▁▆▆▇█▄▇▇▃▅▇
wandb:         peak_memory_alloc ▅▆▇█▆█▇▆▄██▆▅▆▁▆▆▇█▄▇▇▃▅▇
wandb:      peak_memory_reserved ▁▁▃██████████████████████
wandb: tokens_per_second_per_gpu ▅▇▄▆▆▅▅▂▆▄▇█▂▆▂▄▁▄▃▃▄▅▄▄▆
wandb: 
wandb: Run summary:
wandb:               global_step 25
wandb:                      loss 1.48607
wandb:                        lr 7e-05
wandb:        peak_memory_active 18.14733
wandb:         peak_memory_alloc 18.14733
wandb:      peak_memory_reserved 19.43555
wandb: tokens_per_second_per_gpu 1316.82931
wandb: 
wandb: 🚀 View run azure-forest-13 at: https://wandb.ai/dvorjackz-meta/torchtune/runs/uw4eja2j
wandb: ⭐️ View project at: https://wandb.ai/dvorjackz-meta/torchtune
wandb: Synced 4 W&B file(s), 0 media file(s), 3 artifact file(s) and 1 other file(s)
wandb: Find logs at: /tmp/wandb/run-20240916_064527-uw4eja2j/logs

pytorch-bot · 2024-09-12T15:21:38Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1558

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit dd3bf44 with merge base d7fae96 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

codecov-commenter · 2024-09-12T15:59:03Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 73.32%. Comparing base (221031a) to head (75f6975).
Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1558      +/-   ##
==========================================
- Coverage   73.36%   73.32%   -0.04%     
==========================================
  Files         287      287              
  Lines       14142    14123      -19     
==========================================
- Hits        10375    10356      -19     
  Misses       3767     3767

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

torchtune/modules/attention.py

felipemello1 · 2024-09-13T01:54:44Z

Thanks for the PR! Nice catch! Do you think you can run it with vs without your changes and show the improvement + no change in loss? It will make it easy to approve the PR

If you use weights and biases, its probably the easiest way to take a screenshot (you can login with your gmail, get the token, pip install wandb, wandb login, insert your token)

then you can run your config like this:
tune run lora_finetune_single_device --config llama3_1/8B_lora_single_device max_steps_per_epoch=25 epochs=1 metric_logger=torchtune.training.metric_logging.WandBLogger log_peak_memory_stats=True

you can paste the loss, tokens per second, and active memory

kimishpatel

looks good to me and i will stamp but hopefully other folks in tune are ok with this

torchtune/modules/attention.py

ebsmothers · 2024-09-13T20:43:53Z

Yeah +1 to @felipemello1's suggestion. Otherwise this looks good to me, can stamp once you update the summary with the results

kimishpatel · 2024-09-13T22:09:39Z

@ebsmothers i tried to find your gh handle for tagging. now i know

jackzhxng · 2024-09-16T15:27:53Z

@felipemello1 @ebsmothers updated the summary with before and after statistics from finetuning

felipemello1 · 2024-09-16T15:37:09Z

cool! thanks! Lets merge it after tests are done

jackzhxng · 2024-09-16T16:44:52Z

@felipemello1 need another approval since there have been changes since Kimish's approval 👀

Apply rope on k earlier for efficiency

75f6975

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 12, 2024

tarun292 requested review from pbontrager and joecummings September 12, 2024 16:32

kimishpatel reviewed Sep 12, 2024

View reviewed changes

torchtune/modules/attention.py Outdated Show resolved Hide resolved

kimishpatel approved these changes Sep 13, 2024

View reviewed changes

ebsmothers reviewed Sep 13, 2024

View reviewed changes

torchtune/modules/attention.py Outdated Show resolved Hide resolved

Pr review

dd3bf44

jackzhxng requested a review from ebsmothers September 16, 2024 16:39

felipemello1 approved these changes Sep 16, 2024

View reviewed changes

felipemello1 merged commit bc2c013 into pytorch:main Sep 16, 2024
17 checks passed

jackzhxng deleted the jackxz/rewrite-attention branch September 16, 2024 18:00

ebsmothers mentioned this pull request Sep 27, 2024

Shape error when using torchtune.modules.RotaryPositionalEmbeddings #1157

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apply rope on k earlier for efficiency #1558

Apply rope on k earlier for efficiency #1558

jackzhxng commented Sep 12, 2024 •

edited

Loading

pytorch-bot bot commented Sep 12, 2024 •

edited

Loading

codecov-commenter commented Sep 12, 2024

felipemello1 commented Sep 13, 2024 •

edited

Loading

kimishpatel left a comment

ebsmothers commented Sep 13, 2024

kimishpatel commented Sep 13, 2024

jackzhxng commented Sep 16, 2024

felipemello1 commented Sep 16, 2024

jackzhxng commented Sep 16, 2024

Apply rope on k earlier for efficiency #1558

Apply rope on k earlier for efficiency #1558

Conversation

jackzhxng commented Sep 12, 2024 • edited Loading

Context

Changelog

Test plan

UX

Appendix

pytorch-bot bot commented Sep 12, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1558

✅ No Failures

codecov-commenter commented Sep 12, 2024

Codecov Report

felipemello1 commented Sep 13, 2024 • edited Loading

kimishpatel left a comment

Choose a reason for hiding this comment

ebsmothers commented Sep 13, 2024

kimishpatel commented Sep 13, 2024

jackzhxng commented Sep 16, 2024

felipemello1 commented Sep 16, 2024

jackzhxng commented Sep 16, 2024

jackzhxng commented Sep 12, 2024 •

edited

Loading

pytorch-bot bot commented Sep 12, 2024 •

edited

Loading

felipemello1 commented Sep 13, 2024 •

edited

Loading