support multiple LoRAs in batched inference scenario #903

pacman100 · 2023-09-05T07:08:06Z

What does this PR do?

Support multiple LoRAs in batched inference setting. Here, we create sub-batches via logic groupby(adapter_name).

How to use:

[To Do]

ToDos:

Tests
Example Notebook
Documentation
Support all the layers types such as Conv, Embedding, Quantized 8-bit and 4-bit layers.

HuggingFaceDocBuilderDev · 2023-09-05T07:13:34Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

BenjaminBossan · 2023-09-05T09:43:19Z

Thank you for tackling this Sourab. I think this is a feature that many users would find useful and having support for this being added should be high priority.

I'm just leaving some early comments for now, so that we can discuss the design:

I was also thinking about how to solve this problem. Especially, how to pass the information about which adapter to use for which sample to the forward method of the respective adapters. The solution presented here is that this information is part of the kwargs when calling the PeftModel/LoraModel, then it is immediately removed from the kwargs and saved as an attribute on the adapter layer, where it is retrieved during the forward call.

I'm not a big fan of this approach, as it works via side-effects. This makes it hard to reason about and more difficult debug. Furthermore, it requires a lot of care to handle correctly. As an example, say a user calls generate with adapter_indices, then they are set as an attribute. Let's assume that somewhere in the call, there is an error. The adapter_indices won't be cleaned up. Next time the user calls without adapter_indices, the attribute is still set, thus they get incorrect outputs without any indication that something went wrong. So at the very least, we have to take more care of doing the clean up correctly.

Another potential issue is that at the moment, adapter_indices is cleaned up at the end of the forward call of the adapter layer. But what happens if a single forward or generate call on the LoraModel/PeftModel requires multiple forward calls for each layer? Then only the first call is correct and the subsequent ones ignore the information.

Overall, this approach looks very brittle to me and if possible, I would like to find a better approach.

I assume that you also considered just passing down the kwargs with the adapter_indices but that this wouldn't work, because transformers/some model architectures cannot handle arbitrary arguments being added. So this more straightforward way is not possible.

An alternative idea could be to work with pre-forward hooks using with_kwargs=True, as they should allow us to add more arguments to the kwargs before they are passed to forward. This is still a side-effect with all the associated problems, and it would still require us to ensure that the hooks are properly cleaned up, but at least we get the handles, which only require a .remove() call to be cleaned up.

Other than that, I'm open for ideas and discussions, maybe there is a better solution we just haven't considered yet.

Co-Authored-By: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>

review-notebook-app · 2023-09-05T14:09:03Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

github-actions · 2023-10-13T15:03:47Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

BenjaminBossan · 2023-11-09T15:12:32Z

No, bad bot!

github-actions · 2023-12-04T15:03:54Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

BenjaminBossan · 2023-12-04T15:16:27Z

not stale...

github-actions · 2023-12-29T15:03:53Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

This PR tries to revive the work by Sourab in huggingface#903. The core logic is the same between the two PRs. This one should be more complete. The main idea is to allow the user to mix different LoRA adapters in the same batch. This is useful when the user wants perform inference with a batch that uses different LoRA adapters. Without this, each batch would have to be restricted to the same LoRA adapter(s). This PR should encompass: - all task types - all LoRA layer types - bnb layers Extensive tests were added, as well as documentation.

This PR revives the work by Sourab in #903. The core logic is the same between the two PRs. This one should be more complete. The main idea is to allow the user to mix different LoRA adapters in the same batch. This is useful when the user wants perform inference with a batch that uses different LoRA adapters. Without this, each batch would have to be restricted to the same LoRA adapter(s). This PR should encompass: - all task types - all LoRA layer types - bnb layers Extensive tests were added, as well as documentation. --------- Co-authored-by: Sourab Mangrulkar <13534540+pacman100@users.noreply.github.com> Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

support multiple LoRAs in batched inference scenario

1e955fb

pacman100 added 4 commits September 5, 2023 13:30

support generate method

e13b3e8

fix shapes

2c19e65

reformatting the logic

5cda44d

reset the multi batched inference attributes

cc07f3a

pacman100 and others added 8 commits September 5, 2023 16:06

use pre-forward hook

66ad736

clean up

b2c402b

fix the bugs

6cdf56a

refactoring and simplifying logic

4ce7c96

Co-Authored-By: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>

fix errors and tests

b68add4

fix bugs and tests

4eb412f

fix bugs

19364f4

add example

918d4c6

pacman100 marked this pull request as ready for review September 5, 2023 14:09

This was referenced Sep 6, 2023

BB: Using forward hooks #904

Closed

Multi adapter weight conflicts when services are concurrent #804

Closed

github-actions bot closed this Oct 22, 2023

BenjaminBossan reopened this Oct 23, 2023

github-actions bot closed this Oct 31, 2023

BenjaminBossan reopened this Oct 31, 2023

github-actions bot closed this Nov 9, 2023

BenjaminBossan reopened this Nov 9, 2023

BenjaminBossan mentioned this pull request Nov 15, 2023

Attaching lora at runtime #885

Closed

github-actions bot closed this Jan 7, 2024

BenjaminBossan reopened this Jan 8, 2024

github-actions bot closed this Jan 16, 2024

BenjaminBossan mentioned this pull request Mar 7, 2024

Possible to build a LoRA that doesn't inject into the transformer? #1523

Closed

BenjaminBossan mentioned this pull request Mar 13, 2024

FEAT Mixing different LoRA adapters in same batch #1558

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support multiple LoRAs in batched inference scenario #903

support multiple LoRAs in batched inference scenario #903

pacman100 commented Sep 5, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Sep 5, 2023

BenjaminBossan commented Sep 5, 2023

review-notebook-app bot commented Sep 5, 2023

github-actions bot commented Oct 13, 2023

BenjaminBossan commented Nov 9, 2023

github-actions bot commented Dec 4, 2023

BenjaminBossan commented Dec 4, 2023

github-actions bot commented Dec 29, 2023

support multiple LoRAs in batched inference scenario #903

support multiple LoRAs in batched inference scenario #903

Conversation

pacman100 commented Sep 5, 2023 • edited Loading

What does this PR do?

How to use:

HuggingFaceDocBuilderDev commented Sep 5, 2023

BenjaminBossan commented Sep 5, 2023

review-notebook-app bot commented Sep 5, 2023

github-actions bot commented Oct 13, 2023

BenjaminBossan commented Nov 9, 2023

github-actions bot commented Dec 4, 2023

BenjaminBossan commented Dec 4, 2023

github-actions bot commented Dec 29, 2023

pacman100 commented Sep 5, 2023 •

edited

Loading