Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi adapter weight conflicts when services are concurrent #804

Closed
Nipi64310 opened this issue Aug 8, 2023 · 22 comments
Closed

Multi adapter weight conflicts when services are concurrent #804

Nipi64310 opened this issue Aug 8, 2023 · 22 comments

Comments

@Nipi64310
Copy link

Nipi64310 commented Aug 8, 2023

Feature request

Peft's set_adapter function is equal to a global mode setting. Weight conflicts arise when I use multi adapter to construct two demos with different functionality, but someone plays both demos at the same time. Process A, for example, uses Lora A weights to calculate to layer 10. At this point, process B changes the active_adapter of the current model to lora B, forcing the 11th to final layer of process A to use lora B for calculation.

I can use accelerator.wait_for_everyone() to prevent this scenario during training, but I'm not sure whether there is a better approach to avoid conflicts created by this multi-process while reasoning. The current workaround is to add an adapter_name parameter to peft's lora forward function. The lora weight corresponding to the active_adapter will not be utilized if this parameter is supplied, however such a fix requires changing the appropriate model file (modeling_xxxx.py) to take effect.

Motivation

#518

Your contribution

If adding parameters works or if someone has a better idea, I can submit a PR to take care of this.

@BenjaminBossan
Copy link
Member

So I assume that you're serving the model via some web server. It sounds like the solution would have to be applied on that level, by applying some kind of lock on access to the model while it is still processing the output or switching the adapter.

@Nipi64310
Copy link
Author

Yes, I use a webui-like service, but if you use the process LOCK, each process must acquire the lock in order to execute the model's inference, hence the service cannot be parallelized.

@BenjaminBossan
Copy link
Member

If the service is using true multiprocessing, each process should have a copy of the model (in fact, of the whole Python program), so it would be strange if the error you described occurred there. It sounds more like concurrent usage of the same object (multi-threading) is happening here. If that's the case, locking seems like the best solution, but as you said it means blocking for other users.

@Nipi64310
Copy link
Author

Thank you for responding. The issue I encountered is indeed related to the concurrency situation you mentioned. Is there a better solution than locking, because I don't want to block? The current method is to pass the adapter name parameter in the forward function but this necessitates modifying the modeling_xxxx.py file, which is ugly and not universal.

@BenjaminBossan
Copy link
Member

The current method is to pass the adapter name parameter in the forward function but this necessitates modifying the modeling_xxxx.py file, which is ugly and not universal.

You mean so that the model can decide on the fly what adapter to use based on the passed adapter name? IMO, that doesn't sound like such a bad solution. Yes, it is more work, but if it achieves the desired result, it doesn't sound too bad.

Alternatively, you could use true multiprocessing, with a sufficient number of processes to handle the number of concurrent users you have (horizontal scaling). But that means the memory requirement for your app is multiplied by the number of processes, so it can get quite expensive.

@Nipi64310 Nipi64310 changed the title Multi adapter weight conflict under multiprocess Multi adapter weight conflict under when services are concurrent Aug 8, 2023
@Nipi64310 Nipi64310 changed the title Multi adapter weight conflict under when services are concurrent Multi adapter weight conflict when services are concurrent Aug 8, 2023
@Nipi64310 Nipi64310 changed the title Multi adapter weight conflict when services are concurrent Multi adapter weight conflicts when services are concurrent Aug 8, 2023
@Nipi64310
Copy link
Author

Nipi64310 commented Aug 8, 2023

Hi, @BenjaminBossan ,
Yes, the true multiprocessing consumes too many resources, so we can only do this for now and hope for a better solution in the future.
I'm not sure if need to submit a PR to add a passing adatper name parameter to each lora related class, even though this isn't directly available, but someone might need it.

@BenjaminBossan
Copy link
Member

I'm not sure if need to submit a PR to add a passing adatper name parameter to each lora related class, even though this isn't directly available, but someone might need it.

I'm not sure how exactly you solved it, but probably it's not something that we want to add generally, like changing the signature of forward for everyone using LoRA. However, I do think it would be useful for you to post the code here, so that users with similar issues could find it in the future. Maybe we can even figure out some improvements if you do.

@BenjaminBossan
Copy link
Member

Thanks @Nipi64310. One thing I wonder about is that the code you provide does not appear to do anything too different from what would happen if set_adapter is called. Still, as you report, it works, whereas set_adapter does not. I'm not sure what the cause is, if I had to guess it's probably that the loop in set_adapter is not atomic:

for module in self.model.modules():
if isinstance(module, LoraLayer):
if module.merged:
warnings.warn("Adapter cannot be set when the model is merged. Unmerging the model first.")
module.unmerge()
module.active_adapter = adapter_name

Regarding a more general adoption of your proposal: I could see adding an optional adapter_name: Optional[str] argument to all the lora forward methods, and then inside the methods, do active_adapter = self.active_adapter if adapter_name is None else adapter_name. However, the problem, as you describe, is still that the caller needs to pass additional arguments down, which we cannot guarantee (it seems that it's not generally true for transformers models).

ping @pacman100 WDYT?

@Nipi64310
Copy link
Author

Thanks @Nipi64310. One thing I wonder about is that the code you provide does not appear to do anything too different from what would happen if set_adapter is called. Still, as you report, it works, whereas set_adapter does not. I'm not sure what the cause is, if I had to guess it's probably that the loop in set_adapter is not atomic:

for module in self.model.modules():
if isinstance(module, LoraLayer):
if module.merged:
warnings.warn("Adapter cannot be set when the model is merged. Unmerging the model first.")
module.unmerge()
module.active_adapter = adapter_name

Regarding a more general adoption of your proposal: I could see adding an optional adapter_name: Optional[str] argument to all the lora forward methods, and then inside the methods, do active_adapter = self.active_adapter if adapter_name is None else adapter_name. However, the problem, as you describe, is still that the caller needs to pass additional arguments down, which we cannot guarantee (it seems that it's not generally true for transformers models).

ping @pacman100 WDYT?

The code just replied is really crude, just to make an example. In addition, if the adapter_name is provided when model inference, I think set_adapter should not be important, I am looking at the source code of PEFT, can there be a better way to deal with it, if so, I will paste the available code.

@aaajun-czb
Copy link

Thanks @Nipi64310. One thing I wonder about is that the code you provide does not appear to do anything too different from what would happen if set_adapter is called. Still, as you report, it works, whereas set_adapter does not. I'm not sure what the cause is, if I had to guess it's probably that the loop in set_adapter is not atomic:

for module in self.model.modules():
if isinstance(module, LoraLayer):
if module.merged:
warnings.warn("Adapter cannot be set when the model is merged. Unmerging the model first.")
module.unmerge()
module.active_adapter = adapter_name

Regarding a more general adoption of your proposal: I could see adding an optional adapter_name: Optional[str] argument to all the lora forward methods, and then inside the methods, do active_adapter = self.active_adapter if adapter_name is None else adapter_name. However, the problem, as you describe, is still that the caller needs to pass additional arguments down, which we cannot guarantee (it seems that it's not generally true for transformers models).
ping @pacman100 WDYT?

The code just replied is really crude, just to make an example. In addition, if the adapter_name is provided when model inference, I think set_adapter should not be important, I am looking at the source code of PEFT, can there be a better way to deal with it, if so, I will paste the available code.

I am currently preparing to launch a project to parallelize the operation of multiple PEFT models under insufficient device resources. My assumption is that it is possible to load a pre trained model on a single card, and then load the parameters of multiple downstream tasks' PEFT simultaneously (meaning that the graphics memory>(pre trained model and all parameters)).
Then, when inferring multiple models based on the same pre trained model in parallel, it will automatically guide its PEFT parameters.

My idea comes from the following paper: PetS: A Unified Framework for Parameter Efficient Transformers Serving

I think your work requirements for parallelizing reasoning are very similar to this one, aiming to improve the utilization of GPU. I believe that simply modifying the source code of PEFT may not be able to complete this task. The author of the paper has customized the inference related interface through the CUDA interface, and perhaps should consider writing the relevant library themselves. But I am a beginner in CUDA programming. If there is any research progress, please feel free to communicate.

@Nipi64310
Copy link
Author

Hi @aaajun-czb ,
I don’t know if this [repository](https://github.com/sabetAI/BLoRA) is helpful to you. I have solved my problem by modifying vllm.

@BenjaminBossan
Copy link
Member

Note that we're also currently working on supporting mixed LoRA batches in #903. It's not exactly the same thing as OP is asking for, but could still provide a viable solution.

@aaajun-czb
Copy link

Hi @aaajun-czb , I don’t know if this [repository](https://github.com/sabetAI/BLoRA) is helpful to you. I have solved my problem by modifying vllm.

Thank you! It really give me some idea about how to do it.

As my paper deadline is 11.30, I may share my code near that day, after my work done.

I will focus on how to support the parallel inference for multi peft models. Hope my work will give someone help.

Note that we're also currently working on supporting mixed LoRA batches in #903. It's not exactly the same thing as OP is asking for, but could still provide a viable solution.

I admit that I'm not very familiar with structure of PEFT, but my work may combine it. I will pay attention to the issue.

@Nipi64310
Copy link
Author

Note that we're also currently working on supporting mixed LoRA batches in #903. It's not exactly the same thing as OP is asking for, but could still provide a viable solution.

This is good and can already solve the problem, but the implementation of PR may significantly slow down the inference speed.

@BenjaminBossan
Copy link
Member

Mixing adapters within a batch is expected to be slower than having only a single adapter per batch. If you have some proposal to speed up the proposed implementation, please let us know.

@Nipi64310
Copy link
Author

Mixing adapters within a batch is expected to be slower than having only a single adapter per batch. If you have some proposal to speed up the proposed implementation, please let us know.

Sorry, I'm not sure how to optimize, but I think the tensor[mask] operation should be avoided because it is very slow.

sub_batch = x[sub_batch_indices_list[i]].to(lora_A.weight.dtype)
lora_output = lora_B(lora_A(dropout(sub_batch))) * scaling
final_output[sub_batch_indices_list[i]] += lora_output.to(previous_dtype)

@BenjaminBossan
Copy link
Member

From internal testing, we know that the suggested implementation will be faster than having batch size 1 and switching adapters between each sample. So there is some progress, even though there might be room for speed ups. If we, or someone else, comes up with a better suggestion, we're happy to make the change.

@whybeyoung
Copy link

whybeyoung commented Oct 14, 2023

can we reopen this ?

@huggingface huggingface deleted a comment from github-actions bot Nov 20, 2023
@younesbelkada younesbelkada reopened this Nov 20, 2023
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

@codernew007
Copy link

codernew007 commented Dec 25, 2023

saw solution in s-lora, but most of huggingface models are not supported

@sunjunlishi
Copy link

@Nipi64310 does peft support vllm speedup
does it surrpost vllm to realize "model = PeftModel.from_pretrained(model, adapter_to_resume, is_trainable=False)"

@BenjaminBossan
Copy link
Member

Note that we're also currently working on supporting mixed LoRA batches in #903. It's not exactly the same thing as OP is asking for, but could still provide a viable solution.

We merged #1558, a follow up to #903. If you want to give this a try and report back if the performance is good for you, that would be great.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants