Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LoRA tutorial #368

Merged
merged 8 commits into from
Feb 14, 2024
Merged

LoRA tutorial #368

merged 8 commits into from
Feb 14, 2024

Conversation

ebsmothers
Copy link
Contributor

Changelog

First pass at a tutorial for LoRA fine-tuning. Basically the flow here is:

  • What is LoRA/how does it work
  • What does a PyTorch native LoRA look like, what components are available in TorchTune (more for hackers who wanna piece everything together)
  • How to run the LoRA finetune recipe, how to experiment with configs like LoRA modules and rank

A couple things that I was hoping to include here but I don't think our repo is quite ready for:

  • Actually compare results from the different LoRA configs properly (i.e. run generations or evaluations on some ckpts). But
    • we don't really have an eval story yet, and
    • imo the generation UX is too inconsistent with the rest of our recipes to include in a tutorial (we should fix this).
  • Do some simple memory profiling to show memory savings
    • If we had something easily configurable I would do this, started writing a script from scratch here but felt it was a bit long and would distract from the point of the tutorial.

Note: as is the LoRA configs given in the tutorial are not strictly in line with what's in our repo. We need to first land #347 to expose rank and alpha, but there is still some ongoing discussion there

Test plan

Screenshot 2024-02-11 at 10 59 17 AM

Copy link

netlify bot commented Feb 11, 2024

Deploy Preview for torchtune-preview ready!

Name Link
🔨 Latest commit 4062a46
🔍 Latest deploy log https://app.netlify.com/sites/torchtune-preview/deploys/65cc0984b1453e0008ecda36
😎 Deploy Preview https://deploy-preview-368--torchtune-preview.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 11, 2024
@@ -65,5 +65,5 @@ tune --nnodes 1 --nproc_per_node 2 finetune_lora --config alpaca_llama2_lora_fin

To run the generation recipe, run this command from inside the main `/torchtune` directory:
```
python -m recipes.alpaca_generate --native-checkpoint-path /tmp/finetune-llm/model_0.ckpt --tokenizer-path ~/llama/tokenizer.model --input "What is some cool music from the 1920s?"
python -m recipes.alpaca_generate --native-checkpoint-path /tmp/finetune-llm/model_0.ckpt --tokenizer-path ~/llama/tokenizer.model --instruction "What is some cool music from the 1920s?"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is correct? At least based on the samples e.g. here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no this should remain input. instruction specifies the task

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry maybe I am being dumb here, but what about the examples given in the HF dataset? See the examples below:

Screenshot 2024-02-12 at 4 32 13 PM

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep these examples are the same, the default instruction for the generate script is actually "Answer the question", and then the input is the question to be answered. That's the same as "Convert the given equation" (instruction) and "3x+5y=9" (input).

With your change the instruction becomes "What is some cool music from the 1920s?" with no input much like the first two examples, vs "Answer the question. What is some cool music from the 1920s?". They're both valid so this change is ok actually, but wanted to point out the slight nuance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah thanks for clarifying, I missed the default value of "Answer the question." for the instruction. Sounds like this is not technically wrong then? In that case I will revert the change, but imo this is kinda unintuitive and we should revisit.

the loaded :code:`state_dict` are as expected. TorchTune's LoRA recipe does this by default via
:func:`torchtune.modules.peft.validate_state_dict_for_lora`.

Once we've loaded the base model weights, we also want to set only LoRA parameters to trainable.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why wouldn't we just do this automatically when initializing lora_model?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. So while that may save lines of code, my philosophy here is that (a) it is better to be explicit than implicit, and (b) we shouldn't integrate details of training into modeling components any more than is strictly necessary (otherwise our modeling components become hard to extend). So the model builder will return the architecture, but it doesn't do stuff like load weights, freeze base model parameters, wrap in FSDP, or any of that. All of that should be done in the recipe. This way a user who wants things to "just run" can use the recipe and not have to worry about which params are trainable, while a user who wants to customize or extend things can use our modeling components out of the box more easily.


.. code-block:: bash

tune --nnodes 1 --nproc_per_node 2 lora_finetune --config alpaca_llama2_lora_finetune
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity why do our tutorials use 2 GPUs? Is it so that we can show off distributed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think we do not have a clear story here, we should come up with a better philosophy across all our tutorials. My heuristic here is basically "if you're running on reasonable hardware (read: 4090), this won't OOM". However, the full finetune tutorial suggests running on 4 devices which with a 4090 should OOM (I think). I guess there are two types of problems we want to avoid here:

(1) We make it seem like a given finetune won't work when it actually does (e.g. in this case things will run fine on 1x A100, but that may not be obvious from the command)
(2) We give a command that will OOM on certain hardware but don't make that clear enough

Unless we aggressively define supported hardware types or explicitly enumerate a ton of caveats, I feel like the best solution here is to continue beefing up the supported hardware table in our readme (maybe move some version to our tutorials page), and point to that. At the same time that's also one extra bit of indirection we have to do each time we give a CLI command.

Either way, maybe it's worth adding a separate tutorial around distributed and some of our other training utilities so that we explicitly show usage of e.g. single-device, no FSDP runs contrasted with multi-device runs with FSDP enabled.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previous comment isn't a blocker, I was just curious.

That said, I tried running this command and got an error that wandb wasn't found. We should make sure it's included in our default install.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for flagging this. Actually I think we decided to not include wandb in our core install, so this is an issue with the YAML. I am changing the default logger to disk in #347

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we detail what type of GPUs are used here (esp VRAM)? also would be nice to see a comparison with the full finetune command to show that indeed you can use LoRA to fine-tune the same model with less resources

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we detail what type of GPUs are used here (esp VRAM)?

Yeah we could say "on two GPUs (each having VRAM >= 23GB)" instead of "on two GPUs", wdyt?

would be nice to see a comparison with the full finetune command to show that indeed you can use LoRA to fine-tune the same model with less resources

I agree. The problem is we don't have any profiling utilities that can be easily integrated/demonstrated in a tutorial (if you can think of a way to do it let me know). But this is absolutely something I want to include in a follow-up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great point. I think we should explicitly call out the hardware we use for the tutorial i.e. this assumes we use N A100s with 80GB memory. To map this to your setup, please look at this table

# and add to the original model's outputs
return frozen_out + (self.alpha / self.rank) * lora_out

There are some other details around initialization which we omit here, but otherwise that's
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Link to lora.py for users interested in the full implementation?

Copy link
Contributor

@RdoubleA RdoubleA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall excellent tutorial and very pleasant to read and follow along. I just have many nit suggestions for beefing it up a bit

@@ -56,7 +56,7 @@ To run the recipe without any changes on 4 GPUs, launch a training run using Tun

.. code-block:: bash

tune --nnodes 1 --nproc_per_node 4 --config alpaca_llama2_full_finetune
tune --nnodes 1 --nproc_per_node 4 full_finetune --config alpaca_llama2_full_finetune
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

guilty as charged :(

This guide will teach you about `LoRA <https://arxiv.org/abs/2106.09685>`_, a parameter-efficient finetuning technique,
and show you how you can use TorchTune to finetune a Llama2 model with LoRA.
If you already know what LoRA is and want to get straight to running
your own LoRA finetune in TorchTune, you can jump to :ref:`this section<lora_recipe_label>`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
your own LoRA finetune in TorchTune, you can jump to :ref:`this section<lora_recipe_label>`.
your own LoRA finetune in TorchTune, you can jump to :ref:`the recipe<lora_recipe_label>`.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm this isn't the recipe though, it's the section of the tutorial showing how to run the recipe. I could directly use the section title instead, e.g.

Suggested change
your own LoRA finetune in TorchTune, you can jump to :ref:`this section<lora_recipe_label>`.
your own LoRA finetune in TorchTune, you can jump to :ref:`LoRA finetuning recipe in TorchTune<lora_recipe_label>`.

What is LoRA?
-------------

`LoRA <https://arxiv.org/abs/2106.09685>`_ is a parameter-efficient finetuning technique that adds a trainable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
`LoRA <https://arxiv.org/abs/2106.09685>`_ is a parameter-efficient finetuning technique that adds a trainable
`LoRA <https://arxiv.org/abs/2106.09685>`_ is a parameter-efficient finetuning technique that adds trainable

-------------

`LoRA <https://arxiv.org/abs/2106.09685>`_ is a parameter-efficient finetuning technique that adds a trainable
low-rank decomposition to different layers of a neural network, then freezes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
low-rank decomposition to different layers of a neural network, then freezes
low-rank decomposition matrices to different layers of a neural network, then freezes

low-rank decomposition to different layers of a neural network, then freezes
the network's remaining parameters. LoRA is most commonly applied to
transformer models, in which case it is common to add the low-rank matrices
to some of the self-attention projections in each transformer layer.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd want to emphasize that it's parallel to linear layers, since many don't know what attention projections are

Suggested change
to some of the self-attention projections in each transformer layer.
to some of the linear projections in each transformer layer's self attention.

and V projections. This means a LoRA decomposition of rank :code:`r=8` will reduce the number of trainable
parameters for a given projection from :math:`4096 * 4096 \approx 15M` to :math:`8 * 8192 \approx 65K`, a
reduction of over 99%.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider adding a few sentences about why we shouldn't just always go with LoRA fine-tuning (idk this answer) and when you would want to do full-finetuning vs LoRA. may be a bit out of scope for the tutorial but I think it's important to convey since we have both these recipes that users will have to choose from

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to be a broken record 😅. Again I don't wanna make any general claims or suggest best practices here, tbh I don't think we have trained enough models on enough datasets to give that guidance. I agree this would be useful, but maybe we can have another tutorial that is more focused on a modeling deep-dive (think something like this tutorial from Sebastian Raschka

(Feel free to verify this for yourself.)

Why does this matter? TorchTune makes it easy to load checkpoints for LoRA directly from our Llama2
model without any wrappers or custom checkpoint conversion logic.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

love this point

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rohan-varma 😄


.. code-block:: bash

tune --nnodes 1 --nproc_per_node 2 lora_finetune --config alpaca_llama2_lora_finetune
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we detail what type of GPUs are used here (esp VRAM)? also would be nice to see a comparison with the full finetune command to show that indeed you can use LoRA to fine-tune the same model with less resources


tune --nnodes 1 --nproc_per_node 2 lora_finetune --config alpaca_llama2_lora_finetune\
--override lora_attn_modules='q_proj,k_proj,v_proj,output_proj'\
lora_rank=16 output_dir=./lora_experiment_1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-blocking, but it would be awesome if we could show comparable loss curves or eval metrics compared to full-finetuning. probably in a future followup once the eval story is more clear

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I was actually trying to do eval and/or generation as part of this but as I alluded to in the summary the eval is not ready and the generation UX is clunky. Re loss curves: I would like to do this, but then I introduce a tensorboard/wandb dep in the tutorial. Is this something we're OK with? I guess we can add them with the caveat "you can install this optional dep to reproduce these"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can show the curves and then comment on how we generated them without explicitly adding the dependency?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep this sounds good to me

@@ -65,5 +65,5 @@ tune --nnodes 1 --nproc_per_node 2 finetune_lora --config alpaca_llama2_lora_fin

To run the generation recipe, run this command from inside the main `/torchtune` directory:
```
python -m recipes.alpaca_generate --native-checkpoint-path /tmp/finetune-llm/model_0.ckpt --tokenizer-path ~/llama/tokenizer.model --input "What is some cool music from the 1920s?"
python -m recipes.alpaca_generate --native-checkpoint-path /tmp/finetune-llm/model_0.ckpt --tokenizer-path ~/llama/tokenizer.model --instruction "What is some cool music from the 1920s?"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no this should remain input. instruction specifies the task


.. grid-item-card:: :octicon:`list-unordered;1em;` Prerequisites

* Be familiar with the :ref:`overview of TorchTune<overview_label>`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Be familiar with the :ref:`overview of TorchTune<overview_label>`
* Be familiar with :ref:` TorchTune<overview_label>`

transformer models, in which case it is common to add the low-rank matrices
to some of the self-attention projections in each transformer layer.

By finetuning with LoRA (as opposed to finetuning all model parameters),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe link in the full finetuning tutorial here?


By finetuning with LoRA (as opposed to finetuning all model parameters),
you can expect to see memory savings due to a substantial reduction in the
number of gradient parameters. When using an optimizer with momentum,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm is gradient parameters a very common term? Why not "learnable parameters"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh sorry I mean the actual grad values, not the parameters. I am trying to explicitly distinguish between model params (the total # of which increase slightly when using LoRA) and memory used by gradients (which decreases dramatically)

How does LoRA work?
-------------------

LoRA replaces weight update matrices with a low-rank approximation. In general, weight updates
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extreme nit: "In general, weight updates for a given linear layer mapping dimension in_dim to dimension out_dim can have rank" reads a bit weirdly. At least I was confused. Maybe something like:

"In general, weight updates for a given linear layer Linear(in_dim, out_dim) can have rank" makes it easier to understand? Of course this assumes that people are aware of the PyTorch syntax. Feel free to disregard.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your version is clearer. Can also probably link to nn.Linear docs just in case it's not clear

# The default settings for lora_llama2_7b will match those for llama2_7b
# We just need to define which layers we want LoRA applied to.
# We can choose from ["q_proj", "k_proj", "v_proj", and "output_proj"]
lora_model = lora_llama2_7b(lora_attn_modules=["q_proj", "v_proj"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry if I misremember, but I thought one of the tutorials claimed that applying lora to q, k and v should be the default. Is that not true?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So yes, this tutorial says that the best performance comes when applied to all layers. At the same time, a lot of references use Q and V as defaults, e.g. lit-gpt, HF PEFT (that 2nd one took some digging). So I am torn on what to do here, but feel it is better to do the intuitive thing than the "best" thing for our default values. Lmk if you disagree here

@kartikayk
Copy link
Contributor

Thanks for putting this together! Overall this looks great and is one of the higher quality tutorials we have. It would have been awesome to actually show how some of the params impact eval, memory footprint etc in greater detail. But I don't think we have the tooling setup for that. Something to add to the backlog though. I think LoRA and QLoRA are the show stoppers and so the easier we make it for users to understand these, the more we'll have people use them. Left some nits. Accepting the PR, but I'll let you address the ongoing comments.

@ebsmothers ebsmothers merged commit dd009da into pytorch:main Feb 14, 2024
17 checks passed
@ebsmothers ebsmothers deleted the lora-tutorial branch February 14, 2024 01:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants