Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rename configs for consistency #640

Closed
wants to merge 12 commits into from
Closed

Rename configs for consistency #640

wants to merge 12 commits into from

Conversation

joecummings
Copy link
Contributor

@joecummings joecummings commented Apr 2, 2024

Context

  • Rename configs for sake of consistency

Changelog

  • Rename configs
  • Add necessary params

Test plan

  • CI
  • Run each finetune
Distributed Full 13B ``` Step 0 | loss:1.975628137588501 lr:2e-05 gpu_resources:7446235648 Step 1 | loss:1.333866834640503 lr:2e-05 gpu_resources:20484339712 Step 2 | loss:0.8671443462371826 lr:2e-05 gpu_resources:20607488512 Step 3 | loss:0.8254014253616333 lr:2e-05 gpu_resources:20567572480 Step 4 | loss:0.5436609983444214 lr:2e-05 gpu_resources:20525571584 Step 5 | loss:0.9613887071609497 lr:2e-05 gpu_resources:20839943168 Step 6 | loss:0.8866332769393921 lr:2e-05 gpu_resources:20433629184 Step 7 | loss:1.0625394582748413 lr:2e-05 gpu_resources:20950252032 Step 8 | loss:1.1407074928283691 lr:2e-05 gpu_resources:20580147712 Step 9 | loss:0.9907525181770325 lr:2e-05 gpu_resources:20556594688 ```
Single Device Low Memory Step 0 | loss:1.7535516023635864 lr:2e-05 gpu_resources:13708965376 Step 1 | loss:1.5688252449035645 lr:2e-05 gpu_resources:13675462144 Step 2 | loss:0.9067410826683044 lr:2e-05 gpu_resources:13757010432 Step 3 | loss:0.9065065383911133 lr:2e-05 gpu_resources:13676431360 Step 4 | loss:0.7428079843521118 lr:2e-05 gpu_resources:13683441664 Step 5 | loss:0.7902029156684875 lr:2e-05 gpu_resources:13656572928 Step 6 | loss:0.9166772961616516 lr:2e-05 gpu_resources:13677983744 Step 7 | loss:0.7905868291854858 lr:2e-05 gpu_resources:13675191808 Step 8 | loss:0.6534633636474609 lr:2e-05 gpu_resources:13778076160 Step 9 | loss:0.9818300604820251 lr:2e-05 gpu_resources:13872566784 Step 10 | loss:0.9756331443786621 lr:2e-05 gpu_resources:13705963520 Step 11 | loss:1.0058780908584595 lr:2e-05 gpu_resources:13724141056 Step 12 | loss:0.9953532814979553 lr:2e-05 gpu_resources:13745547264 Step 13 | loss:0.7009839415550232 lr:2e-05 gpu_resources:13697622016 Step 14 | loss:0.6043163537979126 lr:2e-05 gpu_resources:13630823936 Step 15 | loss:0.8237800002098083 lr:2e-05 gpu_resources:13746426368 Step 16 | loss:0.7657855749130249 lr:2e-05 gpu_resources:13704583168 Step 17 | loss:0.9858065247535706 lr:2e-05 gpu_resources:13672973312 Step 18 | loss:0.6954495906829834 lr:2e-05 gpu_resources:13713632768 Step 19 | loss:0.913540244102478 lr:2e-05 gpu_resources:13754862080 Step 20 | loss:0.8516958951950073 lr:2e-05 gpu_resources:13755193856 Step 21 | loss:1.032060980796814 lr:2e-05 gpu_resources:13792598016 Step 22 | loss:1.0526211261749268 lr:2e-05 gpu_resources:13960436736 Step 23 | loss:0.964411735534668 lr:2e-05 gpu_resources:13730842112 Step 24 | loss:0.8686544895172119 lr:2e-05 gpu_resources:13657148416 Step 25 | loss:0.9268664121627808 lr:2e-05 gpu_resources:13751882240 Step 26 | loss:0.6310229897499084 lr:2e-05 gpu_resources:13630823936 Step 27 | loss:0.7998006939888 lr:2e-05 gpu_resources:13729932800 Step 28 | loss:1.1347177028656006 lr:2e-05 gpu_resources:14043738624 Step 29 | loss:1.1402934789657593 lr:2e-05 gpu_resources:13712254464 Step 30 | loss:0.7057677507400513 lr:2e-05 gpu_resources:13670933504 Step 31 | loss:0.417038232088089 lr:2e-05 gpu_resources:13631925760 Step 32 | loss:1.1646482944488525 lr:2e-05 gpu_resources:13755531776 Step 33 | loss:1.3732010126113892 lr:2e-05 gpu_resources:13793224704 Step 34 | loss:0.5704163908958435 lr:2e-05 gpu_resources:13755902464 Step 35 | loss:0.5280343294143677 lr:2e-05 gpu_resources:13674097664 Step 36 | loss:0.8417209982872009 lr:2e-05 gpu_resources:13629722112 Step 37 | loss:0.7497541904449463 lr:2e-05 gpu_resources:13704154112 Step 38 | loss:0.9346616864204407 lr:2e-05 gpu_resources:13737774592 Step 39 | loss:0.9737548828125 lr:2e-05 gpu_resources:13764952576 Step 40 | loss:1.0299153327941895 lr:2e-05 gpu_resources:13706856448 Step 41 | loss:1.294271469116211 lr:2e-05 gpu_resources:13810620416 Step 42 | loss:1.016019582748413 lr:2e-05 gpu_resources:13714587136 Step 43 | loss:0.6577346324920654 lr:2e-05 gpu_resources:13676910592 Step 44 | loss:1.0132160186767578 lr:2e-05 gpu_resources:13686069248 Step 45 | loss:1.1425808668136597 lr:2e-05 gpu_resources:13776425472 Step 46 | loss:0.5784585475921631 lr:2e-05 gpu_resources:13694982144 Step 47 | loss:0.7227046489715576 lr:2e-05 gpu_resources:13686933504 Step 48 | loss:0.45101112127304077 lr:2e-05 gpu_resources:13638820352 Step 49 | loss:0.9134987592697144 lr:2e-05 gpu_resources:13755593216 Step 50 | loss:0.9993022084236145 lr:2e-05 gpu_resources:13925936640 Step 51 | loss:1.1241962909698486 lr:2e-05 gpu_resources:13849367552 Step 52 | loss:1.1283633708953857 lr:2e-05 gpu_resources:13859287552 Step 53 | loss:0.721566915512085 lr:2e-05 gpu_resources:13704583168 Step 54 | loss:0.9560117721557617 lr:2e-05 gpu_resources:13779046912 Step 55 | loss:0.7151786684989929 lr:2e-05 gpu_resources:13765112320 Step 56 | loss:1.3303841352462769 lr:2e-05 gpu_resources:13804546048 Step 57 | loss:0.9961383938789368 lr:2e-05 gpu_resources:13706594304 Step 58 | loss:0.6961594223976135 lr:2e-05 gpu_resources:13702068224
Single Device Full ``` Step 0 | loss:1.7535516023635864 lr:2e-05 gpu_resources:13708965376 Step 1 | loss:1.5729001760482788 lr:2e-05 gpu_resources:40627453440 Step 2 | loss:0.9040272831916809 lr:2e-05 gpu_resources:40708667904 Step 3 | loss:0.905918538570404 lr:2e-05 gpu_resources:40627398656 Step 4 | loss:0.7422177195549011 lr:2e-05 gpu_resources:40635295744 Step 5 | loss:0.794807493686676 lr:2e-05 gpu_resources:40609094656 Step 6 | loss:0.9222156405448914 lr:2e-05 gpu_resources:40629516288 Step 7 | loss:0.7954462170600891 lr:2e-05 gpu_resources:40628090368 Step 8 | loss:0.6583808660507202 lr:2e-05 gpu_resources:40729608704 Step 9 | loss:0.9859774112701416 lr:2e-05 gpu_resources:40824099328 Step 10 | loss:0.9678923487663269 lr:2e-05 gpu_resources:40657102848 Step 11 | loss:0.9990934729576111 lr:2e-05 gpu_resources:40675751424 Step 12 | loss:0.9997169971466064 lr:2e-05 gpu_resources:40697368576 Step 13 | loss:0.7032765746116638 lr:2e-05 gpu_resources:40649154560 Step 14 | loss:0.6132537722587585 lr:2e-05 gpu_resources:40582356480 Step 15 | loss:0.823808491230011 lr:2e-05 gpu_resources:40699920896 Step 16 | loss:0.7733527421951294 lr:2e-05 gpu_resources:40656082944 Step 17 | loss:0.9709498286247253 lr:2e-05 gpu_resources:40625845248 Step 18 | loss:0.7007928490638733 lr:2e-05 gpu_resources:40665202176 Step 19 | loss:0.9106801152229309 lr:2e-05 gpu_resources:40706202112 Step 20 | loss:0.846969485282898 lr:2e-05 gpu_resources:40706202112 Step 21 | loss:1.0329310894012451 lr:2e-05 gpu_resources:40744130560 Step 22 | loss:1.0542248487472534 lr:2e-05 gpu_resources:40911068160 Step 23 | loss:0.9596423506736755 lr:2e-05 gpu_resources:40682374656 Step 24 | loss:0.8755935430526733 lr:2e-05 gpu_resources:40609152000 Step 25 | loss:0.9239263534545898 lr:2e-05 gpu_resources:40704227840 Step 26 | loss:0.6361761093139648 lr:2e-05 gpu_resources:40582356480 Step 27 | loss:0.796138346195221 lr:2e-05 gpu_resources:40686060032 Step 28 | loss:1.1315369606018066 lr:2e-05 gpu_resources:40995549696 Step 29 | loss:1.1409335136413574 lr:2e-05 gpu_resources:40665320960 Step 30 | loss:0.7138496041297913 lr:2e-05 gpu_resources:40621548544 Step 31 | loss:0.41565704345703125 lr:2e-05 gpu_resources:40583458304 Step 32 | loss:1.1632394790649414 lr:2e-05 gpu_resources:40707181056 Step 33 | loss:1.3730943202972412 lr:2e-05 gpu_resources:40744726528 Step 34 | loss:0.5690609216690063 lr:2e-05 gpu_resources:40707107328 Step 35 | loss:0.528704047203064 lr:2e-05 gpu_resources:40626600960 Step 36 | loss:0.8346728086471558 lr:2e-05 gpu_resources:40581254656 Step 37 | loss:0.7477327585220337 lr:2e-05 gpu_resources:40656160768 Step 38 | loss:0.9327971935272217 lr:2e-05 gpu_resources:40691136000 Step 39 | loss:0.974311113357544 lr:2e-05 gpu_resources:40717189632 Step 40 | loss:1.0257996320724487 lr:2e-05 gpu_resources:40658782208 Step 41 | loss:1.2980318069458008 lr:2e-05 gpu_resources:40762152960 Step 42 | loss:1.0182888507843018 lr:2e-05 gpu_resources:40666662400 Step 43 | loss:0.6646116375923157 lr:2e-05 gpu_resources:40627738624 Step 44 | loss:1.0156084299087524 lr:2e-05 gpu_resources:40637110272 ```
Distributed LoRA ``` Step 1 | loss:2.5964765548706055 lr:0.0 gpu_resources:5025339904 Step 2 | loss:1.074425458908081 lr:2.9999999999999997e-06 gpu_resources:7873718784 Step 3 | loss:1.4658085107803345 lr:5.999999999999999e-06 gpu_resources:9905593856 Step 4 | loss:1.6493797302246094 lr:8.999999999999999e-06 gpu_resources:6483382784 Step 5 | loss:1.423293948173523 lr:1.1999999999999999e-05 gpu_resources:8328661504 Step 6 | loss:1.389696717262268 lr:1.4999999999999999e-05 gpu_resources:7169611264 Step 7 | loss:1.5408376455307007 lr:1.7999999999999997e-05 gpu_resources:6955935232 Step 8 | loss:1.8456465005874634 lr:2.1e-05 gpu_resources:5635693056 Step 9 | loss:1.3270502090454102 lr:2.3999999999999997e-05 gpu_resources:9908739584 Step 10 | loss:1.8210784196853638 lr:2.6999999999999996e-05 gpu_resources:7394244096 Step 11 | loss:1.313981294631958 lr:2.9999999999999997e-05 gpu_resources:9908995584 Step 12 | loss:1.1491199731826782 lr:3.2999999999999996e-05 gpu_resources:9908739584 Step 13 | loss:1.4932918548583984 lr:3.5999999999999994e-05 gpu_resources:9105335808 Step 14 | loss:1.7311609983444214 lr:3.9e-05 gpu_resources:6235895296 Step 15 | loss:1.2506591081619263 lr:4.2e-05 gpu_resources:9036596736 Step 16 | loss:1.0093636512756348 lr:4.4999999999999996e-05 gpu_resources:9908995584 Step 17 | loss:1.1732327938079834 lr:4.7999999999999994e-05 gpu_resources:9875961344 Step 18 | loss:1.2508728504180908 lr:5.1e-05 gpu_resources:8749933056 Step 19 | loss:1.2209484577178955 lr:5.399999999999999e-05 gpu_resources:8399506944 Step 20 | loss:1.4867607355117798 lr:5.6999999999999996e-05 gpu_resources:8684214784 Step 21 | loss:1.6217559576034546 lr:5.9999999999999995e-05 gpu_resources:6421789184 Step 22 | loss:1.6948394775390625 lr:6.299999999999999e-05 gpu_resources:6262581760 Step 23 | loss:1.1575061082839966 lr:6.599999999999999e-05 gpu_resources:8783717376 Step 24 | loss:1.4123483896255493 lr:6.9e-05 gpu_resources:8773515776 Step 25 | loss:1.2287126779556274 lr:7.199999999999999e-05 gpu_resources:8203270656 Step 26 | loss:1.0684505701065063 lr:7.5e-05 gpu_resources:9848978944 Step 27 | loss:1.0486751794815063 lr:7.8e-05 gpu_resources:9908739584 ```
Single Device LoRA Run ``` $ tune run lora_finetune_single_device --config llama2/7B_lora checkpointer.checkpoint_dir=./model tokenizer.path=./tokenizer.model

Step 0 | loss:1.5095254182815552 lr:0.0 gpu_resources:15868745728
Step 1 | loss:1.5094150304794312 lr:2.9999999999999997e-06 gpu_resources:16379835392
Step 2 | loss:2.062058925628662 lr:5.999999999999999e-06 gpu_resources:15640255488
Step 3 | loss:1.2041356563568115 lr:8.999999999999999e-06 gpu_resources:16677767168
Step 4 | loss:1.5056698322296143 lr:1.1999999999999999e-05 gpu_resources:15996428288
Step 5 | loss:1.3346593379974365 lr:1.4999999999999999e-05 gpu_resources:17455480832
Step 6 | loss:0.946652889251709 lr:1.7999999999999997e-05 gpu_resources:17979030528
Step 7 | loss:1.978110909461975 lr:2.1e-05 gpu_resources:15257856000
Step 8 | loss:1.462736964225769 lr:2.3999999999999997e-05 gpu_resources:19525096448
Step 9 | loss:1.2792216539382935 lr:2.6999999999999996e-05 gpu_resources:17797661696
Step 10 | loss:1.5925408601760864 lr:2.9999999999999997e-05 gpu_resources:17827420160
Step 11 | loss:1.0488336086273193 lr:3.2999999999999996e-05 gpu_resources:18925244416
Step 12 | loss:1.3658620119094849 lr:3.5999999999999994e-05 gpu_resources:17565317120
Step 13 | loss:1.2458206415176392 lr:3.9e-05 gpu_resources:18421332992
Step 14 | loss:1.509882926940918 lr:4.2e-05 gpu_resources:16123260928
Step 15 | loss:1.4079318046569824 lr:4.4999999999999996e-05 gpu_resources:18271711232
Step 16 | loss:0.9930741190910339 lr:4.7999999999999994e-05 gpu_resources:18466159616
Step 17 | loss:1.2138861417770386 lr:5.1e-05 gpu_resources:18691720192
Step 18 | loss:1.4630460739135742 lr:5.399999999999999e-05 gpu_resources:16692729856
Step 19 | loss:1.7501559257507324 lr:5.6999999999999996e-05 gpu_resources:15579860992
Step 20 | loss:1.3850044012069702 lr:5.9999999999999995e-05 gpu_resources:16793435136
Step 21 | loss:1.3141872882843018 lr:6.299999999999999e-05 gpu_resources:17527858176
Step 22 | loss:1.7089552879333496 lr:6.599999999999999e-05 gpu_resources:15667201024
Step 23 | loss:1.1992567777633667 lr:6.9e-05 gpu_resources:17613946880
Step 24 | loss:1.4558281898498535 lr:7.199999999999999e-05 gpu_resources:19529290752
Step 25 | loss:1.5379853248596191 lr:7.5e-05 gpu_resources:16708468736
Step 26 | loss:1.506952166557312 lr:7.8e-05 gpu_resources:16581437440
Step 27 | loss:1.8689138889312744 lr:8.1e-05 gpu_resources:14741790208
Step 28 | loss:1.003369688987732 lr:8.4e-05 gpu_resources:18747243520
Step 29 | loss:1.259264349937439 lr:8.699999999999999e-05 gpu_resources:19529290752
Step 30 | loss:1.3746774196624756 lr:8.999999999999999e-05 gpu_resources:15266582528
Step 31 | loss:1.243005394935608 lr:9.3e-05 gpu_resources:17496827904
Step 32 | loss:1.1092965602874756 lr:9.599999999999999e-05 gpu_resources:19529290752
Step 33 | loss:1.6773215532302856 lr:9.9e-05 gpu_resources:14647905280

</details>

Copy link

pytorch-bot bot commented Apr 2, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/640

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 01cfdce with merge base 5f865b0 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 2, 2024
@joecummings joecummings marked this pull request as ready for review April 2, 2024 23:47
Copy link
Contributor

@kartikayk kartikayk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good - thanks for making these changes! The only question I have is around qlora - this only runs on single device. So should we make that explicit in the config?

#
# To launch on 4 devices, run the following command from root:
# tune run --nproc_per_node 4 full_finetune_distributed \
# --config llama2/13B_full \
# $ tune run --nproc_per_node 4 full_finetune_distributed \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to split out with an example single device command here too? (I guess for 13B this is maybe less likely anyways)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nvm I think these are not registered for the single-device recipes. Then maybe update the comment in 13B_lora.yaml?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update what comment? The topline comment says it's just for multi-device

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry it's in the other yaml, commented at the specific lines

@@ -75,6 +72,7 @@ loss:
# Training
epochs: 1
max_steps_per_epoch: null
gradient_accumulation_steps: 1
Copy link
Contributor

@ebsmothers ebsmothers Apr 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gradient accumulation is not enabled on LoRA distributed recipe yet (I guess it won't error out, but it's just a possible gotcha)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is why we should just enable it for the recipe, right? (Agree we still need to handle this missing/erroneous field issue more gracefully)

Comment on lines 1 to 2
# Config for multi-device with full_finetune_distributed.py or single-device full finetuning
# with full_finetune_single_device.py using a Llama2 7B model
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence hurts my brain, can you make it clearer?

@@ -47,7 +47,7 @@ tokenizer:

# Dataset and Sampler
dataset:
_component_: torchtune.datasets.alpaca_cleaned_dataset
_component_: torchtune.datasets.alpaca_dataset
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bad rebase

name="mistral/7B_full",
file_path="mistral/7B_full.yaml",
name="llama2/7B_full_low_memory",
file_path="llama2/7B_full_low_memory.yaml",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So mistral recipes are distributed only now?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can run them --nproc_per_node as 1 :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess, but is this really the intent? Why not just keep them supported for single device recipe if it's really all the same?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added back support

@joecummings
Copy link
Contributor Author

Generally looks good - thanks for making these changes! The only question I have is around qlora - this only runs on single device. So should we make that explicit in the config?

This is called out several times in the config - do you mean updating the file name?

@@ -168,7 +168,7 @@ def setup(self, cfg: DictConfig) -> None:
# checkpoint. Transforming the opt state dict is handled by this method
self._optimizer = self._setup_optimizer(
cfg_optimizer=cfg.optimizer,
optimizer_in_bwd=cfg.optimizer_in_bwd,
optimizer_in_bwd=self._optimizer_in_bwd,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a fun bug waiting to happen lol

# Config for multi-device LoRA in lora_finetune_distributed.py
# using a Llama2 13B model
# Config for multi-device with lora_finetune_distributed.py or single-device LoRA
# finetuning with lora_finetune_single_device.py using a Llama2 13B model
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not supported

Comment on lines 13 to 15
# To launch on a single device, run the following command:
# $ tune run lora_finetune_single_device \
# --config llama2/13B_lora
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not supported

Copy link
Member

@rohan-varma rohan-varma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I guess this was the case before this PR, but now that we're merging back the single and multi device configs, it's a bit confusing to see when FSDP is used versus not. In particular, there's no FSDP configuration in any of the configs, it's just that the distributed recipe wraps with FSDP by default.

This may hurt extensibility a bit, for example when we need to add DDP, pipeline parallel, or tensor parallel. For example, as a very basic thing we may add a flag called paralellism and allow it to be DDP or FSDP, this will exist in a config that can be used for single device finetuning, and it will just be an extra flag that adds complexity to user if they only want to do single device finetune, and it's not actually used.

@joecummings
Copy link
Contributor Author

Closing in favor of #670

@joecummings joecummings deleted the rename-configs branch April 11, 2024 15:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants