-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi adapter weight conflicts when services are concurrent #804
Comments
So I assume that you're serving the model via some web server. It sounds like the solution would have to be applied on that level, by applying some kind of lock on access to the model while it is still processing the output or switching the adapter. |
Yes, I use a webui-like service, but if you use the process LOCK, each process must acquire the lock in order to execute the model's inference, hence the service cannot be parallelized. |
If the service is using true multiprocessing, each process should have a copy of the model (in fact, of the whole Python program), so it would be strange if the error you described occurred there. It sounds more like concurrent usage of the same object (multi-threading) is happening here. If that's the case, locking seems like the best solution, but as you said it means blocking for other users. |
Thank you for responding. The issue I encountered is indeed related to the concurrency situation you mentioned. Is there a better solution than locking, because I don't want to block? The current method is to pass the adapter name parameter in the forward function but this necessitates modifying the modeling_xxxx.py file, which is ugly and not universal. |
You mean so that the model can decide on the fly what adapter to use based on the passed adapter name? IMO, that doesn't sound like such a bad solution. Yes, it is more work, but if it achieves the desired result, it doesn't sound too bad. Alternatively, you could use true multiprocessing, with a sufficient number of processes to handle the number of concurrent users you have (horizontal scaling). But that means the memory requirement for your app is multiplied by the number of processes, so it can get quite expensive. |
Hi, @BenjaminBossan , |
I'm not sure how exactly you solved it, but probably it's not something that we want to add generally, like changing the signature of |
Thanks @Nipi64310. One thing I wonder about is that the code you provide does not appear to do anything too different from what would happen if Lines 513 to 518 in ed396a6
Regarding a more general adoption of your proposal: I could see adding an optional ping @pacman100 WDYT? |
The code just replied is really crude, just to make an example. In addition, if the adapter_name is provided when model inference, I think set_adapter should not be important, I am looking at the source code of PEFT, can there be a better way to deal with it, if so, I will paste the available code. |
I am currently preparing to launch a project to parallelize the operation of multiple PEFT models under insufficient device resources. My assumption is that it is possible to load a pre trained model on a single card, and then load the parameters of multiple downstream tasks' PEFT simultaneously (meaning that the graphics memory>(pre trained model and all parameters)). My idea comes from the following paper: PetS: A Unified Framework for Parameter Efficient Transformers Serving I think your work requirements for parallelizing reasoning are very similar to this one, aiming to improve the utilization of GPU. I believe that simply modifying the source code of PEFT may not be able to complete this task. The author of the paper has customized the inference related interface through the CUDA interface, and perhaps should consider writing the relevant library themselves. But I am a beginner in CUDA programming. If there is any research progress, please feel free to communicate. |
Hi @aaajun-czb , |
Note that we're also currently working on supporting mixed LoRA batches in #903. It's not exactly the same thing as OP is asking for, but could still provide a viable solution. |
Thank you! It really give me some idea about how to do it. As my paper deadline is 11.30, I may share my code near that day, after my work done. I will focus on how to support the parallel inference for multi peft models. Hope my work will give someone help.
I admit that I'm not very familiar with structure of PEFT, but my work may combine it. I will pay attention to the issue. |
This is good and can already solve the problem, but the implementation of PR may significantly slow down the inference speed. |
Mixing adapters within a batch is expected to be slower than having only a single adapter per batch. If you have some proposal to speed up the proposed implementation, please let us know. |
Sorry, I'm not sure how to optimize, but I think the tensor[mask] operation should be avoided because it is very slow.
|
From internal testing, we know that the suggested implementation will be faster than having batch size 1 and switching adapters between each sample. So there is some progress, even though there might be room for speed ups. If we, or someone else, comes up with a better suggestion, we're happy to make the change. |
can we reopen this ? |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. |
saw solution in s-lora, but most of huggingface models are not supported |
@Nipi64310 does peft support vllm speedup |
We merged #1558, a follow up to #903. If you want to give this a try and report back if the performance is good for you, that would be great. |
Feature request
Peft's set_adapter function is equal to a global mode setting. Weight conflicts arise when I use multi adapter to construct two demos with different functionality, but someone plays both demos at the same time. Process A, for example, uses Lora A weights to calculate to layer 10. At this point, process B changes the active_adapter of the current model to lora B, forcing the 11th to final layer of process A to use lora B for calculation.
I can use accelerator.wait_for_everyone() to prevent this scenario during training, but I'm not sure whether there is a better approach to avoid conflicts created by this multi-process while reasoning. The current workaround is to add an adapter_name parameter to peft's lora forward function. The lora weight corresponding to the active_adapter will not be utilized if this parameter is supplied, however such a fix requires changing the appropriate model file (modeling_xxxx.py) to take effect.
Motivation
#518
Your contribution
If adding parameters works or if someone has a better idea, I can submit a PR to take care of this.
The text was updated successfully, but these errors were encountered: