-
Notifications
You must be signed in to change notification settings - Fork 441
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LoRA tutorial #368
LoRA tutorial #368
Conversation
✅ Deploy Preview for torchtune-preview ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
recipes/README.md
Outdated
@@ -65,5 +65,5 @@ tune --nnodes 1 --nproc_per_node 2 finetune_lora --config alpaca_llama2_lora_fin | |||
|
|||
To run the generation recipe, run this command from inside the main `/torchtune` directory: | |||
``` | |||
python -m recipes.alpaca_generate --native-checkpoint-path /tmp/finetune-llm/model_0.ckpt --tokenizer-path ~/llama/tokenizer.model --input "What is some cool music from the 1920s?" | |||
python -m recipes.alpaca_generate --native-checkpoint-path /tmp/finetune-llm/model_0.ckpt --tokenizer-path ~/llama/tokenizer.model --instruction "What is some cool music from the 1920s?" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is correct? At least based on the samples e.g. here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no this should remain input. instruction specifies the task
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep these examples are the same, the default instruction for the generate script is actually "Answer the question", and then the input is the question to be answered. That's the same as "Convert the given equation" (instruction) and "3x+5y=9" (input).
With your change the instruction becomes "What is some cool music from the 1920s?" with no input much like the first two examples, vs "Answer the question. What is some cool music from the 1920s?". They're both valid so this change is ok actually, but wanted to point out the slight nuance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah thanks for clarifying, I missed the default value of "Answer the question." for the instruction. Sounds like this is not technically wrong then? In that case I will revert the change, but imo this is kinda unintuitive and we should revisit.
the loaded :code:`state_dict` are as expected. TorchTune's LoRA recipe does this by default via | ||
:func:`torchtune.modules.peft.validate_state_dict_for_lora`. | ||
|
||
Once we've loaded the base model weights, we also want to set only LoRA parameters to trainable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why wouldn't we just do this automatically when initializing lora_model?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. So while that may save lines of code, my philosophy here is that (a) it is better to be explicit than implicit, and (b) we shouldn't integrate details of training into modeling components any more than is strictly necessary (otherwise our modeling components become hard to extend). So the model builder will return the architecture, but it doesn't do stuff like load weights, freeze base model parameters, wrap in FSDP, or any of that. All of that should be done in the recipe. This way a user who wants things to "just run" can use the recipe and not have to worry about which params are trainable, while a user who wants to customize or extend things can use our modeling components out of the box more easily.
|
||
.. code-block:: bash | ||
|
||
tune --nnodes 1 --nproc_per_node 2 lora_finetune --config alpaca_llama2_lora_finetune |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity why do our tutorials use 2 GPUs? Is it so that we can show off distributed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think we do not have a clear story here, we should come up with a better philosophy across all our tutorials. My heuristic here is basically "if you're running on reasonable hardware (read: 4090), this won't OOM". However, the full finetune tutorial suggests running on 4 devices which with a 4090 should OOM (I think). I guess there are two types of problems we want to avoid here:
(1) We make it seem like a given finetune won't work when it actually does (e.g. in this case things will run fine on 1x A100, but that may not be obvious from the command)
(2) We give a command that will OOM on certain hardware but don't make that clear enough
Unless we aggressively define supported hardware types or explicitly enumerate a ton of caveats, I feel like the best solution here is to continue beefing up the supported hardware table in our readme (maybe move some version to our tutorials page), and point to that. At the same time that's also one extra bit of indirection we have to do each time we give a CLI command.
Either way, maybe it's worth adding a separate tutorial around distributed and some of our other training utilities so that we explicitly show usage of e.g. single-device, no FSDP runs contrasted with multi-device runs with FSDP enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previous comment isn't a blocker, I was just curious.
That said, I tried running this command and got an error that wandb wasn't found. We should make sure it's included in our default install.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for flagging this. Actually I think we decided to not include wandb in our core install, so this is an issue with the YAML. I am changing the default logger to disk in #347
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we detail what type of GPUs are used here (esp VRAM)? also would be nice to see a comparison with the full finetune command to show that indeed you can use LoRA to fine-tune the same model with less resources
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we detail what type of GPUs are used here (esp VRAM)?
Yeah we could say "on two GPUs (each having VRAM >= 23GB)" instead of "on two GPUs", wdyt?
would be nice to see a comparison with the full finetune command to show that indeed you can use LoRA to fine-tune the same model with less resources
I agree. The problem is we don't have any profiling utilities that can be easily integrated/demonstrated in a tutorial (if you can think of a way to do it let me know). But this is absolutely something I want to include in a follow-up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great point. I think we should explicitly call out the hardware we use for the tutorial i.e. this assumes we use N A100s with 80GB memory. To map this to your setup, please look at this table
# and add to the original model's outputs | ||
return frozen_out + (self.alpha / self.rank) * lora_out | ||
|
||
There are some other details around initialization which we omit here, but otherwise that's |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Link to lora.py for users interested in the full implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall excellent tutorial and very pleasant to read and follow along. I just have many nit suggestions for beefing it up a bit
@@ -56,7 +56,7 @@ To run the recipe without any changes on 4 GPUs, launch a training run using Tun | |||
|
|||
.. code-block:: bash | |||
|
|||
tune --nnodes 1 --nproc_per_node 4 --config alpaca_llama2_full_finetune | |||
tune --nnodes 1 --nproc_per_node 4 full_finetune --config alpaca_llama2_full_finetune |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
guilty as charged :(
This guide will teach you about `LoRA <https://arxiv.org/abs/2106.09685>`_, a parameter-efficient finetuning technique, | ||
and show you how you can use TorchTune to finetune a Llama2 model with LoRA. | ||
If you already know what LoRA is and want to get straight to running | ||
your own LoRA finetune in TorchTune, you can jump to :ref:`this section<lora_recipe_label>`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
your own LoRA finetune in TorchTune, you can jump to :ref:`this section<lora_recipe_label>`. | |
your own LoRA finetune in TorchTune, you can jump to :ref:`the recipe<lora_recipe_label>`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm this isn't the recipe though, it's the section of the tutorial showing how to run the recipe. I could directly use the section title instead, e.g.
your own LoRA finetune in TorchTune, you can jump to :ref:`this section<lora_recipe_label>`. | |
your own LoRA finetune in TorchTune, you can jump to :ref:`LoRA finetuning recipe in TorchTune<lora_recipe_label>`. |
What is LoRA? | ||
------------- | ||
|
||
`LoRA <https://arxiv.org/abs/2106.09685>`_ is a parameter-efficient finetuning technique that adds a trainable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
`LoRA <https://arxiv.org/abs/2106.09685>`_ is a parameter-efficient finetuning technique that adds a trainable | |
`LoRA <https://arxiv.org/abs/2106.09685>`_ is a parameter-efficient finetuning technique that adds trainable |
------------- | ||
|
||
`LoRA <https://arxiv.org/abs/2106.09685>`_ is a parameter-efficient finetuning technique that adds a trainable | ||
low-rank decomposition to different layers of a neural network, then freezes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
low-rank decomposition to different layers of a neural network, then freezes | |
low-rank decomposition matrices to different layers of a neural network, then freezes |
low-rank decomposition to different layers of a neural network, then freezes | ||
the network's remaining parameters. LoRA is most commonly applied to | ||
transformer models, in which case it is common to add the low-rank matrices | ||
to some of the self-attention projections in each transformer layer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd want to emphasize that it's parallel to linear layers, since many don't know what attention projections are
to some of the self-attention projections in each transformer layer. | |
to some of the linear projections in each transformer layer's self attention. |
and V projections. This means a LoRA decomposition of rank :code:`r=8` will reduce the number of trainable | ||
parameters for a given projection from :math:`4096 * 4096 \approx 15M` to :math:`8 * 8192 \approx 65K`, a | ||
reduction of over 99%. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider adding a few sentences about why we shouldn't just always go with LoRA fine-tuning (idk this answer) and when you would want to do full-finetuning vs LoRA. may be a bit out of scope for the tutorial but I think it's important to convey since we have both these recipes that users will have to choose from
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry to be a broken record 😅. Again I don't wanna make any general claims or suggest best practices here, tbh I don't think we have trained enough models on enough datasets to give that guidance. I agree this would be useful, but maybe we can have another tutorial that is more focused on a modeling deep-dive (think something like this tutorial from Sebastian Raschka
(Feel free to verify this for yourself.) | ||
|
||
Why does this matter? TorchTune makes it easy to load checkpoints for LoRA directly from our Llama2 | ||
model without any wrappers or custom checkpoint conversion logic. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
love this point
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @rohan-varma 😄
|
||
.. code-block:: bash | ||
|
||
tune --nnodes 1 --nproc_per_node 2 lora_finetune --config alpaca_llama2_lora_finetune |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we detail what type of GPUs are used here (esp VRAM)? also would be nice to see a comparison with the full finetune command to show that indeed you can use LoRA to fine-tune the same model with less resources
|
||
tune --nnodes 1 --nproc_per_node 2 lora_finetune --config alpaca_llama2_lora_finetune\ | ||
--override lora_attn_modules='q_proj,k_proj,v_proj,output_proj'\ | ||
lora_rank=16 output_dir=./lora_experiment_1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
non-blocking, but it would be awesome if we could show comparable loss curves or eval metrics compared to full-finetuning. probably in a future followup once the eval story is more clear
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I was actually trying to do eval and/or generation as part of this but as I alluded to in the summary the eval is not ready and the generation UX is clunky. Re loss curves: I would like to do this, but then I introduce a tensorboard/wandb dep in the tutorial. Is this something we're OK with? I guess we can add them with the caveat "you can install this optional dep to reproduce these"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can show the curves and then comment on how we generated them without explicitly adding the dependency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep this sounds good to me
recipes/README.md
Outdated
@@ -65,5 +65,5 @@ tune --nnodes 1 --nproc_per_node 2 finetune_lora --config alpaca_llama2_lora_fin | |||
|
|||
To run the generation recipe, run this command from inside the main `/torchtune` directory: | |||
``` | |||
python -m recipes.alpaca_generate --native-checkpoint-path /tmp/finetune-llm/model_0.ckpt --tokenizer-path ~/llama/tokenizer.model --input "What is some cool music from the 1920s?" | |||
python -m recipes.alpaca_generate --native-checkpoint-path /tmp/finetune-llm/model_0.ckpt --tokenizer-path ~/llama/tokenizer.model --instruction "What is some cool music from the 1920s?" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no this should remain input. instruction specifies the task
|
||
.. grid-item-card:: :octicon:`list-unordered;1em;` Prerequisites | ||
|
||
* Be familiar with the :ref:`overview of TorchTune<overview_label>` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Be familiar with the :ref:`overview of TorchTune<overview_label>` | |
* Be familiar with :ref:` TorchTune<overview_label>` |
transformer models, in which case it is common to add the low-rank matrices | ||
to some of the self-attention projections in each transformer layer. | ||
|
||
By finetuning with LoRA (as opposed to finetuning all model parameters), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe link in the full finetuning tutorial here?
|
||
By finetuning with LoRA (as opposed to finetuning all model parameters), | ||
you can expect to see memory savings due to a substantial reduction in the | ||
number of gradient parameters. When using an optimizer with momentum, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm is gradient parameters a very common term? Why not "learnable parameters"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh sorry I mean the actual grad values, not the parameters. I am trying to explicitly distinguish between model params (the total # of which increase slightly when using LoRA) and memory used by gradients (which decreases dramatically)
How does LoRA work? | ||
------------------- | ||
|
||
LoRA replaces weight update matrices with a low-rank approximation. In general, weight updates |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extreme nit: "In general, weight updates for a given linear layer mapping dimension in_dim to dimension out_dim can have rank" reads a bit weirdly. At least I was confused. Maybe something like:
"In general, weight updates for a given linear layer Linear(in_dim, out_dim) can have rank" makes it easier to understand? Of course this assumes that people are aware of the PyTorch syntax. Feel free to disregard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think your version is clearer. Can also probably link to nn.Linear docs just in case it's not clear
# The default settings for lora_llama2_7b will match those for llama2_7b | ||
# We just need to define which layers we want LoRA applied to. | ||
# We can choose from ["q_proj", "k_proj", "v_proj", and "output_proj"] | ||
lora_model = lora_llama2_7b(lora_attn_modules=["q_proj", "v_proj"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry if I misremember, but I thought one of the tutorials claimed that applying lora to q, k and v should be the default. Is that not true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So yes, this tutorial says that the best performance comes when applied to all layers. At the same time, a lot of references use Q and V as defaults, e.g. lit-gpt, HF PEFT (that 2nd one took some digging). So I am torn on what to do here, but feel it is better to do the intuitive thing than the "best" thing for our default values. Lmk if you disagree here
Thanks for putting this together! Overall this looks great and is one of the higher quality tutorials we have. It would have been awesome to actually show how some of the params impact eval, memory footprint etc in greater detail. But I don't think we have the tooling setup for that. Something to add to the backlog though. I think LoRA and QLoRA are the show stoppers and so the easier we make it for users to understand these, the more we'll have people use them. Left some nits. Accepting the PR, but I'll let you address the ongoing comments. |
Changelog
First pass at a tutorial for LoRA fine-tuning. Basically the flow here is:
A couple things that I was hoping to include here but I don't think our repo is quite ready for:
Note: as is the LoRA configs given in the tutorial are not strictly in line with what's in our repo. We need to first land #347 to expose rank and alpha, but there is still some ongoing discussion there
Test plan