Added adapter_only option to LoRA #1220

spider-man-tm · 2024-07-25T04:23:33Z

Context

What is the purpose of this PR? Is it to

add a new feature
fix a bug
update tests and/or documentation
other (please add here)

Changelog

In #1210, the "adapter_only" (boolean) option was added to the save_checkpoint method of each Checkpointer class. When this option is set to True, only the adapter weights are saved instead of the entire model weights.
This PR applies that change to LoRA fine-tuning.

Test plan

Please make sure to do each of the following if applicable to your PR. (If you're not sure about any one of these just ask and we will happily help.)

run pre-commit hooks and linters (make sure you've first installed via pre-commit install)
add unit tests for any new functionality
update docstrings for any new or updated methods or classes
run unit tests via pytest tests
run recipe tests via pytest tests -m integration_test
manually run any new or modified recipes with sufficient proof of correctness
- include relevant commands and any other artifacts in this summary (pastes of loss curves, eval results, etc.)

pytorch-bot · 2024-07-25T04:23:36Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1220

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit eb7d41d with merge base f0a15c5 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2024-07-25T04:23:39Z

Hi @spider-man-tm!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

facebook-github-bot · 2024-07-25T04:26:55Z

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

spider-man-tm · 2024-07-25T09:55:50Z

To the reviewer,

Please let me know if there are any missing parts in my implementation or if there are any parts that should be removed. It's perfectly fine if the maintainer edits it directly.

codecov-commenter · 2024-07-25T11:04:36Z

Codecov Report

Attention: Patch coverage is 0% with 20 lines in your changes missing coverage. Please review.

Project coverage is 70.19%. Comparing base (7eb89e2) to head (aeedd92).
Report is 2 commits behind head on main.

Files	Patch %	Lines
recipes/lora_dpo_distributed.py	0.00%	5 Missing ⚠️
recipes/lora_dpo_single_device.py	0.00%	5 Missing ⚠️
recipes/lora_finetune_distributed.py	0.00%	5 Missing ⚠️
recipes/lora_finetune_single_device.py	0.00%	5 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1220      +/-   ##
==========================================
+ Coverage   67.81%   70.19%   +2.37%     
==========================================
  Files         219      220       +1     
  Lines        9908     9957      +49     
==========================================
+ Hits         6719     6989     +270     
+ Misses       3189     2968     -221

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

SalmanMohammadi · 2024-07-25T13:24:20Z

recipes/lora_dpo_distributed.py

-                intermediate_checkpoint=intermediate_checkpoint,
-            )
+            # If the option was True, save only the adapter except for the last epoch
+            is_intermediate_epoch = epoch + 1 < self.total_epochs


Thanks so much for your contribution : )
Small q: do we want to support the case where merged weights are never saved, even in the final epoch, and let the user merge them? Might be helpful when working with particularly large models. @joecummings

Yep, I think @SalmanMohammadi's point is inline with what I was thinking here. This does mean that we need to do a better job explaining how to merge weights or what exactly to do with them afterwards.

Concretely, should we fix adapter_only=self._save_adapter_only here, and add something to the effect of:

if not is_intermediate_epoch: log.info( "Saving final model checkpoint. Please note that you have set save_adapter_only=True, so only adapter weights will be saved." "You will need to merge the adapter weights into your base model for further use. See {where do we point them for now??}")

EDIT: could we also move this into the checkpointer's save_checkpoint?

joecummings · 2024-07-25T13:28:12Z

docs/source/deep_dives/checkpointer.rst

+    # adapter_only option
+    # Set to True to save only the adapter weights for intermediate epochs.
+    # For the final epoch, the entire model weights will be saved regardless of this option.
+    adapter_only: False


nit: Considering save_adapter_weights_only instead of adapter_only for clarity.

I know it's different from the param name on the checkpointer, but the extra context isn't needed there b/c you already know you're dealing with saving a checkpointing.

cc @SalmanMohammadi and @pbontrager for thoughts.

joecummings · 2024-07-25T13:30:28Z

recipes/lora_dpo_distributed.py

@@ -143,6 +143,7 @@ def __init__(self, cfg: DictConfig) -> None:
        self.global_step = 0

        self._resume_from_checkpoint = cfg.resume_from_checkpoint
+        self._adapter_only = cfg.get("adapter_only", False)


Can you add a logging INFO statement saying that only the adapter weights will be saved so the user knows right away?

I see your excellent point, and raise you another (dubious quality) point.

In line with moving towards outsourcing config validation/handholding into utilities from recipes, why not have save_adapter_weights_only asd a property of the checkpointer? We could then throw an INFO in the checkpointer constructor.

Your config will then be:

checkpointer: ... save_adapter_weights_only: True

This would fit if we're not doing any additional logic checks on saving the adapter weights only, i.e. if save_adapter_weights_only: True we don't merge weights in the last epoch.

edit: sorry, as usual, I'm adding complexity to everything I touch

Hmmmm, this does make sense to me. I'd much rather have the error propagated from the checkpointer, rather than the recipe file.

I think the complexity arises in how we actually instantiate the checkpointer. I don't think we want save_adapter_weights_only as a param on the initialization of the checkpointer b/c the user should be able to take the component and save adapters or the whole file if they want without re-initializing it. Therefore, it has to be a param on the save method.

So if we don't want it being passed to the constructor, we would have to parse out the save_adapter_weights_only from the config file before creating the checkpointer, which feels a little messy. Maybe we push the warning to the save method of the checkpointer? The only issue with this is that then the user could accidentally have the save_adapter_weights_only=True, train their whole model, and then save the checkpoints without actually wanting that feature. But that might be too hand-holdy to worry about?

Thanks for coming on this journey of vomiting all my thoughts on this PR. I think my TL;DR is that this config variable should be separate from the instantiation of the checkpointer, but maybe we push the logging info to the save method instead of in the main recipe.

Thoughts?

maybe we push the logging info to the save method instead of in the main recipe.

The save_checkpoint_method of the checkpointer right? I agree here - we also validate adapter_only there.

Generally agree with @joecummings here.. the checkpointer class itself should mostly care about stuff that's global across a given fine-tuning run (input and output checkpoint formats). The flag to save only adapter weights is a more local thing since it can in theory vary depending on where we are in training. So imo it makes sense to expose only in save_checkpoint and not in init.

A separate (but still relevant) point: our checkpointer naming still seems to imply it's only used for full fine-tuning. This is confusing and another reason why it'd look weird to put save_adapter_weights_only in its init. (I don't think that's a good reason to keep save_adapter_weights_only outside the checkpointer init, I actually think we should just rename the checkpointer instead)

joecummings

This is looking great - thanks so much :)

Just a couple comments.

spider-man-tm · 2024-07-26T05:15:15Z

Thank you to all reviewers. I have made the following revisions:

The name of the option adapter_only was changed to save_adapter_weights_only. While adapter_only is understandable within the save_checkpoint method, it becomes ambiguous when used independently.
To avoid complications, the option is retained as a method option for save_checkpoint rather than being included in the Checkpointer constructor.
Adjusted the code to maintain consistency by removing the special case where all weights were saved only for the last epoch.
Added log output to inform the user that only adapter weights are being saved.
- Due to a too long error occurring when running pre-commit run --all-files, the logs are spread over 4 lines.

Feel free to review the changes and provide further feedback!

ebsmothers

Really appreciate you adding this! Just two minor things then I think this is good to go

ebsmothers · 2024-07-26T23:57:58Z

recipes/lora_finetune_single_device.py

+        if not is_intermediate_epoch:
+            log.info(
+                "Saving final model checkpoint."
+                "Please note that you have set save_adapter_weights_only=True, so only adapter weights will be saved."
+                "You need to merge the adapter weights into your base model for further use. "
+                f"See {type(self._checkpointer).__name__}"
+            )


Maybe I'm being dense but is this log correct? I don't see where we actually check that save_adapter_weights_only=True prior to logging this (similar comment in the other recipes too)

Good catch. Sorry, my original suggestion missed this.

Do we still think it's a good idea to move this into the checkpointer, rather than recipes?

@SalmanMohammadi

Indeed, rather than writing similar code each time we create a new recipe, it might be better to create it with a checkpointer. What do you think?

commit: e8e8757

Indeed, rather than writing similar code each time we create a new recipe, it might be better to create it with a checkpointer. What do you think?

Good point, I'm inclined to agree with this (kinda similar to my other comment, it's nice to keep the recipe files themselves as clean as possible)

ebsmothers · 2024-07-27T00:00:47Z

recipes/lora_dpo_distributed.py

@@ -143,6 +143,9 @@ def __init__(self, cfg: DictConfig) -> None:
        self.global_step = 0

        self._resume_from_checkpoint = cfg.resume_from_checkpoint
+        self._save_adapter_weights_only = cfg.get("save_adapter_weights_only", False)
+        log.info(f"save_adapter_weights_only: {self._save_adapter_weights_only}")


This is more of a nit since I know @joecummings gave contrary advice here already, but I don't like logging config fields in the recipe like this. We already log the full config, no need to just directly re-log an individual config field unless there is something more non-trivial happening (e.g. we use some feature that depends on a particular version of PyTorch or something).

spider-man-tm · 2024-07-27T10:58:37Z

@ebsmothers

Thank you for your review! I have made the corrections based on your feedback. 7dec6ef

Maybe I'm being dense but is this log correct? I don't see where we actually check that save_adapter_weights_only=True prior to logging this

You’re absolutely right, this could indeed confuse the user. I’ve now modified the code to log the message based on the value of the option.

            if not is_intermediate_epoch:
                log.ingo("Saving final epoch checkpoint.")
                if self._save_adapter_weights_only:
                    log.info(
                        "Please note that you have set save_adapter_weights_only=True, so only adapter weights will be saved."
                        "You need to merge the adapter weights into your base model for further use. "
                        f"See {type(self._checkpointer).__name__}"
                    )
                else:
                    log.info(
                        "The full model checkpoint, including all weights and configurations, has been saved successfully."
                        "You can now use this checkpoint for further training or inference."
                    )

We already log the full config, no need to just directly re-log an individual config field unless there is something more non-trivial happening

I’ve removed the log output for that section. I agree that it’s best not to deviate too much from the overall logging pattern for the other options.

spider-man-tm · 2024-07-27T13:09:04Z

I moved the log output to checkpointer.
e8e8757

SalmanMohammadi · 2024-07-29T08:33:02Z

LGTM. Thanks so much for your contribution, and your patience in addressing our comments : )

spider-man-tm · 2024-07-29T08:46:42Z

@SalmanMohammadi
PyTorch is my favorite ML package, and I'm happy to have contributed to it. Thanks so much!

joecummings

Awesome work!

Added adapter_only to lora

50e6b62

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 25, 2024

spider-man-tm added 3 commits July 25, 2024 13:42

Revert double quotes back to single quotes

da3ec33

Update docstrings

a85d908

Update document

aeedd92

spider-man-tm marked this pull request as ready for review July 25, 2024 09:52

SalmanMohammadi reviewed Jul 25, 2024

View reviewed changes

joecummings reviewed Jul 25, 2024

View reviewed changes

Changed the specification of _save_adapter_weights_only option

b162bae

spider-man-tm added 3 commits July 26, 2024 14:54

typo save_adapter_only -> save_adapter_weights_only

1584f1a

update documentation

22aa16b

update log messages

b01b699

SalmanMohammadi mentioned this pull request Jul 26, 2024

QLoRA with Llama 3.1 405B #1232

Merged

11 tasks

spider-man-tm requested review from joecummings, SalmanMohammadi and ebsmothers July 26, 2024 23:40

ebsmothers reviewed Jul 27, 2024

View reviewed changes

spider-man-tm added 2 commits July 27, 2024 19:35

Merge branch 'main' into feature-checkpoint

17aa3ac

update log messages

7dec6ef

spider-man-tm requested a review from ebsmothers July 27, 2024 10:58

remove: log messages(recipe -> checkponter)

e8e8757

rename is_intermediate_epoch -> intermediate_checkpoint

f429d8a

typo

eb7d41d

SalmanMohammadi approved these changes Jul 29, 2024

View reviewed changes

joecummings approved these changes Jul 29, 2024

View reviewed changes

joecummings merged commit f34b5b0 into pytorch:main Jul 29, 2024
29 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added adapter_only option to LoRA #1220

Added adapter_only option to LoRA #1220

spider-man-tm commented Jul 25, 2024 •

edited

Loading

pytorch-bot bot commented Jul 25, 2024 •

edited

Loading

facebook-github-bot commented Jul 25, 2024

facebook-github-bot commented Jul 25, 2024

spider-man-tm commented Jul 25, 2024

codecov-commenter commented Jul 25, 2024

SalmanMohammadi Jul 25, 2024

joecummings Jul 25, 2024

SalmanMohammadi Jul 25, 2024 •

edited

Loading

joecummings Jul 25, 2024

joecummings Jul 25, 2024

SalmanMohammadi Jul 25, 2024 •

edited

Loading

joecummings Jul 25, 2024 •

edited

Loading

SalmanMohammadi Jul 25, 2024

ebsmothers Jul 25, 2024

joecummings left a comment

spider-man-tm commented Jul 26, 2024

ebsmothers left a comment

ebsmothers Jul 26, 2024

SalmanMohammadi Jul 27, 2024

spider-man-tm Jul 27, 2024 •

edited

Loading

ebsmothers Jul 27, 2024

ebsmothers Jul 27, 2024

spider-man-tm commented Jul 27, 2024 •

edited

Loading

spider-man-tm commented Jul 27, 2024

SalmanMohammadi commented Jul 29, 2024

spider-man-tm commented Jul 29, 2024

joecummings left a comment

Added adapter_only option to LoRA #1220

Added adapter_only option to LoRA #1220

Conversation

spider-man-tm commented Jul 25, 2024 • edited Loading

Context

Changelog

Test plan

pytorch-bot bot commented Jul 25, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1220

✅ No Failures

facebook-github-bot commented Jul 25, 2024

Action Required

Process

facebook-github-bot commented Jul 25, 2024

spider-man-tm commented Jul 25, 2024

codecov-commenter commented Jul 25, 2024

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SalmanMohammadi Jul 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SalmanMohammadi Jul 25, 2024 • edited Loading

Choose a reason for hiding this comment

joecummings Jul 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joecummings left a comment

Choose a reason for hiding this comment

spider-man-tm commented Jul 26, 2024

ebsmothers left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

spider-man-tm Jul 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

spider-man-tm commented Jul 27, 2024 • edited Loading

spider-man-tm commented Jul 27, 2024

SalmanMohammadi commented Jul 29, 2024

spider-man-tm commented Jul 29, 2024

joecummings left a comment

Choose a reason for hiding this comment

spider-man-tm commented Jul 25, 2024 •

edited

Loading

pytorch-bot bot commented Jul 25, 2024 •

edited

Loading

SalmanMohammadi Jul 25, 2024 •

edited

Loading

SalmanMohammadi Jul 25, 2024 •

edited

Loading

joecummings Jul 25, 2024 •

edited

Loading

spider-man-tm Jul 27, 2024 •

edited

Loading

spider-man-tm commented Jul 27, 2024 •

edited

Loading