grad accum in LoRA distributed recipe #644

ebsmothers · 2024-04-03T02:20:30Z

Context

We have it in all our other finetune recipes, might as well add it to this one too

Implemented following the approach in full_finetune_distributed.py

Run without grad accumulation

CUDA_VISIBLE_DEVICES=5,6 tune run --nproc_per_node 2 --rdzv-backend=c10d --rdzv-endpoint=localhost:20000 lora_finetune_distributed --config llama2/7B_lora checkpointer=torchtune.utils.FullModelTorchTuneCheckpointer checkpointer.checkpoint_dir=/data/users/ebs/checkpoints checkpointer.checkpoint_files=['llama2-7b-torchtune.pt'] checkpointer.output_dir=/data/users/ebs/checkpoints/new_tokenizer tokenizer.path=/data/users/ebs/checkpoints/lora-debug/tokenizer.model max_steps_per_epoch=20 metric_logger=torchtune.utils.metric_logging.WandBLogger metric_logger.project=testing
...
1|20|Loss: 1.3451744318008423:   0%|▎

Can see from wandb that there are 20 iterations logged

Run with grad accumulation

CUDA_VISIBLE_DEVICES=5,6 tune run --nproc_per_node 2 --rdzv-backend=c10d --rdzv-endpoint=localhost:20000 lora_finetune_distributed --config llama2/7B_lora checkpointer=torchtune.utils.FullModelTorchTuneCheckpointer checkpointer.checkpoint_dir=/data/users/ebs/checkpoints checkpointer.checkpoint_files=['llama2-7b-torchtune.pt'] checkpointer.output_dir=/data/users/ebs/checkpoints/new_tokenizer tokenizer.path=/data/users/ebs/checkpoints/lora-debug/tokenizer.model max_steps_per_epoch=20 gradient_accumulation_steps=2 metric_logger=torchtune.utils.metric_logging.WandBLogger metric_logger.project=testing

...
1|40|Loss: 1.5071550607681274:   0%|▌

Since tqdm logs iteration number, not step number, we can see that both cases run for the expected number of iterations.

pytorch-bot · 2024-04-03T02:20:33Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/644

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 9d6b4c6 with merge base 32d66df ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

RdoubleA · 2024-04-03T04:09:42Z

recipes/lora_finetune_distributed.py

@@ -468,6 +472,17 @@ def save_checkpoint(
                intermediate_checkpoint=intermediate_checkpoint,
            )

+    def _should_update_weights(self, current_iteration: int) -> bool:


This should be a utility at this point since it's in every recipe

Yeah I just don't know where to put it tbh. None of the existing files feel relevant and I don't wanna create a new file just for this. Lmk if you have suggestions here

maybe torchtune/utils/training.py or torchtune/utils/optim_utils.py? I don't see any harm in making a new file

Tbh I'm inclined to go the other way and say that we should just do this inline. I definitely do not want to add even more indirection than we already have. And this is literally just doing a mod check

Also @joecummings @kartikayk or @rohan-varma if you have thoughts here

Do it inline.

hm yeah, that's cleaner than I thought it would be. excellent choice

kartikayk

Very clean - thanks for adding this!

kartikayk · 2024-04-04T00:22:04Z

Since tqdm logs iteration number, not step number

How motivated are you to fix this? :) I think it should be a two line change tbh:

pbar = tqdm(desc=f"Training Epoch: {epoch+1}", total=self._dataloader)
....
# update pbar whenever we take a step
pbar.update(1)

I think this should work and then we'll no longer have the discrepancy between tqdm and our wndb logs. Can happen in a follow up as well.

ebsmothers · 2024-04-04T02:18:59Z

Since tqdm logs iteration number, not step number

How motivated are you to fix this? :) I think it should be a two line change tbh:
pbar = tqdm(desc=f"Training Epoch: {epoch+1}", total=self._dataloader)
....
# update pbar whenever we take a step
pbar.update(1)
I think this should work and then we'll no longer have the discrepancy between tqdm and our wndb logs. Can happen in a follow up as well.

I will probably just punt to a follow-up if it's all the same

grad accum in LoRA recipe

a22b5fb

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 3, 2024

ebsmothers requested a review from joecummings April 3, 2024 02:20

ebsmothers changed the title ~~grad accum in LoRA recipe~~ grad accum in LoRA distributed recipe Apr 3, 2024

ebsmothers mentioned this pull request Apr 3, 2024

Rename configs for consistency #640

Closed

RdoubleA reviewed Apr 3, 2024

View reviewed changes

why write a function for that which you can do inline -- anonymous

9d6b4c6

kartikayk approved these changes Apr 4, 2024

View reviewed changes

ebsmothers merged commit aacaadd into main Apr 4, 2024
20 checks passed

ebsmothers deleted the lora-distributed-grad-accum branch April 4, 2024 02:19

ebsmothers mentioned this pull request Apr 5, 2024

DPO #645

Merged

tcapelle pushed a commit to tcapelle/torchtune that referenced this pull request Apr 5, 2024

grad accum in LoRA distributed recipe (pytorch#644)

77eb695

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

grad accum in LoRA distributed recipe #644

grad accum in LoRA distributed recipe #644

ebsmothers commented Apr 3, 2024

pytorch-bot bot commented Apr 3, 2024 •

edited

Loading

RdoubleA Apr 3, 2024

ebsmothers Apr 3, 2024

RdoubleA Apr 3, 2024

ebsmothers Apr 3, 2024

ebsmothers Apr 3, 2024

joecummings Apr 3, 2024

RdoubleA Apr 3, 2024

kartikayk left a comment

kartikayk commented Apr 4, 2024

ebsmothers commented Apr 4, 2024

grad accum in LoRA distributed recipe #644

grad accum in LoRA distributed recipe #644

Conversation

ebsmothers commented Apr 3, 2024

Context

pytorch-bot bot commented Apr 3, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/644

✅ No Failures

RdoubleA Apr 3, 2024

Choose a reason for hiding this comment

ebsmothers Apr 3, 2024

Choose a reason for hiding this comment

RdoubleA Apr 3, 2024

Choose a reason for hiding this comment

ebsmothers Apr 3, 2024

Choose a reason for hiding this comment

ebsmothers Apr 3, 2024

Choose a reason for hiding this comment

joecummings Apr 3, 2024

Choose a reason for hiding this comment

RdoubleA Apr 3, 2024

Choose a reason for hiding this comment

kartikayk left a comment

Choose a reason for hiding this comment

kartikayk commented Apr 4, 2024

ebsmothers commented Apr 4, 2024

pytorch-bot bot commented Apr 3, 2024 •

edited

Loading