config changes #1733

felipemello1 · 2024-10-01T20:53:41Z

Context

What is the purpose of this PR? Is it to

add a new feature
fix a bug
update tests and/or documentation
other (please add here)

Changelog

3 mains fixes:

llama 3.1 KD config loading wrong checkpoint
qwen2 config loading wrong checkpoint
llama 70b repeated fused argument + warning that it doesnt work on torch stable
llama 405B-lora warning to not use pytorch stable

extra:
added compile flag for llama. I should have done it in another PR, sorry/not sorry.

Test plan

ran the configs and saved ckpt

pytorch-bot · 2024-10-01T20:53:45Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1733

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 9a94468 with merge base 55b4814 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

RdoubleA · 2024-10-01T21:00:09Z

recipes/configs/llama3/70B_full.yaml

@@ -14,6 +14,12 @@
 #   tune run --nproc_per_node 8 full_finetune_distributed --config llama3/70B_full checkpointer.checkpoint_dir=<YOUR_CHECKPOINT_DIR>
 #
 # This config is only tested on an 8xA100 machine.
+#
+# !!!!!!!!!!!!!
+# !!!!!!!!!!!!!


nit: I would take the !!!! out personally

i am afraid that users would ignore it without the "!", and not understand why it crashes

RdoubleA · 2024-10-01T21:00:47Z

recipes/configs/qwen2/knowledge_distillation_single_device.yaml

  checkpoint_files: [
-    hf_model_0001_0.pt
+    model.safetensors
  ]
  recipe_checkpoint: null
  output_dir: /tmp/Qwen2-1.5B-Instruct-lora-finetune


should change output dir here

i thought about it. Indeed in all of our configs we use the same output dir. I will make the change

Co-authored-by: Felipe Mello <felipemello@fb.com>

Felipe Mello added 3 commits October 1, 2024 12:31

config changes

1385ef2

config changes

155927c

another one

237c09e

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 1, 2024

RdoubleA approved these changes Oct 1, 2024

View reviewed changes

update output dir

9a94468

felipemello1 merged commit 59dc1f4 into pytorch:main Oct 1, 2024
17 checks passed

felipemello1 deleted the fix_kd_config branch October 1, 2024 22:02

RdoubleA pushed a commit that referenced this pull request Oct 2, 2024

config changes (#1733)

e54fb36

Co-authored-by: Felipe Mello <felipemello@fb.com>

ebsmothers mentioned this pull request Oct 8, 2024

v0.3 regression, full_finetune_distributed slower ? #1718

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

config changes #1733

config changes #1733

felipemello1 commented Oct 1, 2024 •

edited

Loading

pytorch-bot bot commented Oct 1, 2024 •

edited

Loading

RdoubleA Oct 1, 2024

felipemello1 Oct 1, 2024

RdoubleA Oct 1, 2024

felipemello1 Oct 1, 2024

config changes #1733

config changes #1733

Conversation

felipemello1 commented Oct 1, 2024 • edited Loading

Context

Changelog

Test plan

pytorch-bot bot commented Oct 1, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1733

✅ No Failures

RdoubleA Oct 1, 2024

Choose a reason for hiding this comment

felipemello1 Oct 1, 2024

Choose a reason for hiding this comment

RdoubleA Oct 1, 2024

Choose a reason for hiding this comment

felipemello1 Oct 1, 2024

Choose a reason for hiding this comment

felipemello1 commented Oct 1, 2024 •

edited

Loading

pytorch-bot bot commented Oct 1, 2024 •

edited

Loading