Multi adapter weight conflicts when services are concurrent #804

Nipi64310 · 2023-08-08T03:52:02Z

Feature request

Peft's set_adapter function is equal to a global mode setting. Weight conflicts arise when I use multi adapter to construct two demos with different functionality, but someone plays both demos at the same time. Process A, for example, uses Lora A weights to calculate to layer 10. At this point, process B changes the active_adapter of the current model to lora B, forcing the 11th to final layer of process A to use lora B for calculation.

I can use accelerator.wait_for_everyone() to prevent this scenario during training, but I'm not sure whether there is a better approach to avoid conflicts created by this multi-process while reasoning. The current workaround is to add an adapter_name parameter to peft's lora forward function. The lora weight corresponding to the active_adapter will not be utilized if this parameter is supplied, however such a fix requires changing the appropriate model file (modeling_xxxx.py) to take effect.

Motivation

#518

Your contribution

If adding parameters works or if someone has a better idea, I can submit a PR to take care of this.

BenjaminBossan · 2023-08-08T08:19:41Z

So I assume that you're serving the model via some web server. It sounds like the solution would have to be applied on that level, by applying some kind of lock on access to the model while it is still processing the output or switching the adapter.

Nipi64310 · 2023-08-08T08:40:27Z

Yes, I use a webui-like service, but if you use the process LOCK, each process must acquire the lock in order to execute the model's inference, hence the service cannot be parallelized.

BenjaminBossan · 2023-08-08T08:55:27Z

If the service is using true multiprocessing, each process should have a copy of the model (in fact, of the whole Python program), so it would be strange if the error you described occurred there. It sounds more like concurrent usage of the same object (multi-threading) is happening here. If that's the case, locking seems like the best solution, but as you said it means blocking for other users.

Nipi64310 · 2023-08-08T09:15:57Z

Thank you for responding. The issue I encountered is indeed related to the concurrency situation you mentioned. Is there a better solution than locking, because I don't want to block? The current method is to pass the adapter name parameter in the forward function but this necessitates modifying the modeling_xxxx.py file, which is ugly and not universal.

BenjaminBossan · 2023-08-08T09:23:32Z

The current method is to pass the adapter name parameter in the forward function but this necessitates modifying the modeling_xxxx.py file, which is ugly and not universal.

You mean so that the model can decide on the fly what adapter to use based on the passed adapter name? IMO, that doesn't sound like such a bad solution. Yes, it is more work, but if it achieves the desired result, it doesn't sound too bad.

Alternatively, you could use true multiprocessing, with a sufficient number of processes to handle the number of concurrent users you have (horizontal scaling). But that means the memory requirement for your app is multiplied by the number of processes, so it can get quite expensive.

Nipi64310 · 2023-08-08T09:44:02Z

Hi, @BenjaminBossan ,
Yes, the true multiprocessing consumes too many resources, so we can only do this for now and hope for a better solution in the future.
I'm not sure if need to submit a PR to add a passing adatper name parameter to each lora related class, even though this isn't directly available, but someone might need it.

BenjaminBossan · 2023-08-08T09:55:01Z

I'm not sure if need to submit a PR to add a passing adatper name parameter to each lora related class, even though this isn't directly available, but someone might need it.

I'm not sure how exactly you solved it, but probably it's not something that we want to add generally, like changing the signature of forward for everyone using LoRA. However, I do think it would be useful for you to post the code here, so that users with similar issues could find it in the future. Maybe we can even figure out some improvements if you do.

BenjaminBossan · 2023-08-08T11:47:46Z

Thanks @Nipi64310. One thing I wonder about is that the code you provide does not appear to do anything too different from what would happen if set_adapter is called. Still, as you report, it works, whereas set_adapter does not. I'm not sure what the cause is, if I had to guess it's probably that the loop in set_adapter is not atomic:

peft/src/peft/tuners/lora.py

Lines 513 to 518 in ed396a6

    
           for module in self.model.modules(): 
        
               if isinstance(module, LoraLayer): 
        
                   if module.merged: 
        
                       warnings.warn("Adapter cannot be set when the model is merged. Unmerging the model first.") 
        
                       module.unmerge() 
        
                   module.active_adapter = adapter_name

Regarding a more general adoption of your proposal: I could see adding an optional adapter_name: Optional[str] argument to all the lora forward methods, and then inside the methods, do active_adapter = self.active_adapter if adapter_name is None else adapter_name. However, the problem, as you describe, is still that the caller needs to pass additional arguments down, which we cannot guarantee (it seems that it's not generally true for transformers models).

ping @pacman100 WDYT?

Nipi64310 · 2023-08-08T13:27:25Z

Thanks @Nipi64310. One thing I wonder about is that the code you provide does not appear to do anything too different from what would happen if set_adapter is called. Still, as you report, it works, whereas set_adapter does not. I'm not sure what the cause is, if I had to guess it's probably that the loop in set_adapter is not atomic:

peft/src/peft/tuners/lora.py

Lines 513 to 518 in ed396a6

for module in self.model.modules():

if isinstance(module, LoraLayer):

if module.merged:

warnings.warn("Adapter cannot be set when the model is merged. Unmerging the model first.")

module.unmerge()

module.active_adapter = adapter_name

Regarding a more general adoption of your proposal: I could see adding an optional adapter_name: Optional[str] argument to all the lora forward methods, and then inside the methods, do active_adapter = self.active_adapter if adapter_name is None else adapter_name. However, the problem, as you describe, is still that the caller needs to pass additional arguments down, which we cannot guarantee (it seems that it's not generally true for transformers models).

ping @pacman100 WDYT?

The code just replied is really crude, just to make an example. In addition, if the adapter_name is provided when model inference, I think set_adapter should not be important, I am looking at the source code of PEFT, can there be a better way to deal with it, if so, I will paste the available code.

aaajun-czb · 2023-09-05T02:24:27Z

Thanks @Nipi64310. One thing I wonder about is that the code you provide does not appear to do anything too different from what would happen if set_adapter is called. Still, as you report, it works, whereas set_adapter does not. I'm not sure what the cause is, if I had to guess it's probably that the loop in set_adapter is not atomic:

peft/src/peft/tuners/lora.py

Lines 513 to 518 in ed396a6

for module in self.model.modules():

if isinstance(module, LoraLayer):

if module.merged:

warnings.warn("Adapter cannot be set when the model is merged. Unmerging the model first.")

module.unmerge()

module.active_adapter = adapter_name

Regarding a more general adoption of your proposal: I could see adding an optional adapter_name: Optional[str] argument to all the lora forward methods, and then inside the methods, do active_adapter = self.active_adapter if adapter_name is None else adapter_name. However, the problem, as you describe, is still that the caller needs to pass additional arguments down, which we cannot guarantee (it seems that it's not generally true for transformers models).
ping @pacman100 WDYT?

The code just replied is really crude, just to make an example. In addition, if the adapter_name is provided when model inference, I think set_adapter should not be important, I am looking at the source code of PEFT, can there be a better way to deal with it, if so, I will paste the available code.

I am currently preparing to launch a project to parallelize the operation of multiple PEFT models under insufficient device resources. My assumption is that it is possible to load a pre trained model on a single card, and then load the parameters of multiple downstream tasks' PEFT simultaneously (meaning that the graphics memory>(pre trained model and all parameters)).
Then, when inferring multiple models based on the same pre trained model in parallel, it will automatically guide its PEFT parameters.

My idea comes from the following paper: PetS: A Unified Framework for Parameter Efficient Transformers Serving

I think your work requirements for parallelizing reasoning are very similar to this one, aiming to improve the utilization of GPU. I believe that simply modifying the source code of PEFT may not be able to complete this task. The author of the paper has customized the inference related interface through the CUDA interface, and perhaps should consider writing the relevant library themselves. But I am a beginner in CUDA programming. If there is any research progress, please feel free to communicate.

Nipi64310 · 2023-09-07T05:19:03Z

Hi @aaajun-czb ,
I don’t know if this [repository](https://github.com/sabetAI/BLoRA) is helpful to you. I have solved my problem by modifying vllm.

BenjaminBossan · 2023-09-07T09:16:05Z

Note that we're also currently working on supporting mixed LoRA batches in #903. It's not exactly the same thing as OP is asking for, but could still provide a viable solution.

aaajun-czb · 2023-09-08T03:09:03Z

Hi @aaajun-czb , I don’t know if this [repository](https://github.com/sabetAI/BLoRA) is helpful to you. I have solved my problem by modifying vllm.

Thank you! It really give me some idea about how to do it.

As my paper deadline is 11.30, I may share my code near that day, after my work done.

I will focus on how to support the parallel inference for multi peft models. Hope my work will give someone help.

Note that we're also currently working on supporting mixed LoRA batches in #903. It's not exactly the same thing as OP is asking for, but could still provide a viable solution.

I admit that I'm not very familiar with structure of PEFT, but my work may combine it. I will pay attention to the issue.

Nipi64310 · 2023-09-13T02:01:15Z

Note that we're also currently working on supporting mixed LoRA batches in #903. It's not exactly the same thing as OP is asking for, but could still provide a viable solution.

This is good and can already solve the problem, but the implementation of PR may significantly slow down the inference speed.

BenjaminBossan · 2023-09-17T09:30:34Z

Mixing adapters within a batch is expected to be slower than having only a single adapter per batch. If you have some proposal to speed up the proposed implementation, please let us know.

Nipi64310 · 2023-09-18T07:55:33Z

Mixing adapters within a batch is expected to be slower than having only a single adapter per batch. If you have some proposal to speed up the proposed implementation, please let us know.

Sorry, I'm not sure how to optimize, but I think the tensor[mask] operation should be avoided because it is very slow.

sub_batch = x[sub_batch_indices_list[i]].to(lora_A.weight.dtype)
lora_output = lora_B(lora_A(dropout(sub_batch))) * scaling
final_output[sub_batch_indices_list[i]] += lora_output.to(previous_dtype)

BenjaminBossan · 2023-09-18T10:46:44Z

From internal testing, we know that the suggested implementation will be faster than having batch size 1 and switching adapters between each sample. So there is some progress, even though there might be room for speed ups. If we, or someone else, comes up with a better suggestion, we're happy to make the change.

whybeyoung · 2023-10-14T02:21:10Z

can we reopen this ?

github-actions · 2023-12-15T15:03:45Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

codernew007 · 2023-12-25T09:57:16Z

saw solution in s-lora, but most of huggingface models are not supported

sunjunlishi · 2024-03-12T03:53:52Z

@Nipi64310 does peft support vllm speedup
does it surrpost vllm to realize "model = PeftModel.from_pretrained(model, adapter_to_resume, is_trainable=False)"

BenjaminBossan · 2024-03-18T15:16:35Z

Note that we're also currently working on supporting mixed LoRA batches in #903. It's not exactly the same thing as OP is asking for, but could still provide a viable solution.

We merged #1558, a follow up to #903. If you want to give this a try and report back if the performance is good for you, that would be great.

Nipi64310 changed the title ~~Multi adapter weight conflict under multiprocess~~ Multi adapter weight conflict under when services are concurrent Aug 8, 2023

Nipi64310 changed the title ~~Multi adapter weight conflict under when services are concurrent~~ Multi adapter weight conflict when services are concurrent Aug 8, 2023

Nipi64310 changed the title ~~Multi adapter weight conflict when services are concurrent~~ Multi adapter weight conflicts when services are concurrent Aug 8, 2023

Nipi64310 closed this as completed Aug 8, 2023

whybeyoung mentioned this issue Sep 28, 2023

Support multi-adpater concurrent inferencing #973

Closed

BenjaminBossan reopened this Oct 16, 2023

github-actions bot closed this as completed Nov 19, 2023

huggingface deleted a comment from github-actions bot Nov 20, 2023

younesbelkada reopened this Nov 20, 2023

github-actions bot closed this as completed Dec 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi adapter weight conflicts when services are concurrent #804

Multi adapter weight conflicts when services are concurrent #804

Nipi64310 commented Aug 8, 2023 •

edited

Loading

BenjaminBossan commented Aug 8, 2023

Nipi64310 commented Aug 8, 2023

BenjaminBossan commented Aug 8, 2023

Nipi64310 commented Aug 8, 2023

BenjaminBossan commented Aug 8, 2023

Nipi64310 commented Aug 8, 2023 •

edited

Loading

BenjaminBossan commented Aug 8, 2023

BenjaminBossan commented Aug 8, 2023

Nipi64310 commented Aug 8, 2023

aaajun-czb commented Sep 5, 2023

Nipi64310 commented Sep 7, 2023

BenjaminBossan commented Sep 7, 2023

aaajun-czb commented Sep 8, 2023

Nipi64310 commented Sep 13, 2023

BenjaminBossan commented Sep 17, 2023

Nipi64310 commented Sep 18, 2023

BenjaminBossan commented Sep 18, 2023

whybeyoung commented Oct 14, 2023 •

edited

Loading

github-actions bot commented Dec 15, 2023

codernew007 commented Dec 25, 2023 •

edited

Loading

sunjunlishi commented Mar 12, 2024

BenjaminBossan commented Mar 18, 2024

Multi adapter weight conflicts when services are concurrent #804

Multi adapter weight conflicts when services are concurrent #804

Comments

Nipi64310 commented Aug 8, 2023 • edited Loading

Feature request

Motivation

Your contribution

BenjaminBossan commented Aug 8, 2023

Nipi64310 commented Aug 8, 2023

BenjaminBossan commented Aug 8, 2023

Nipi64310 commented Aug 8, 2023

BenjaminBossan commented Aug 8, 2023

Nipi64310 commented Aug 8, 2023 • edited Loading

BenjaminBossan commented Aug 8, 2023

BenjaminBossan commented Aug 8, 2023

Nipi64310 commented Aug 8, 2023

aaajun-czb commented Sep 5, 2023

Nipi64310 commented Sep 7, 2023

BenjaminBossan commented Sep 7, 2023

aaajun-czb commented Sep 8, 2023

Nipi64310 commented Sep 13, 2023

BenjaminBossan commented Sep 17, 2023

Nipi64310 commented Sep 18, 2023

BenjaminBossan commented Sep 18, 2023

whybeyoung commented Oct 14, 2023 • edited Loading

github-actions bot commented Dec 15, 2023

codernew007 commented Dec 25, 2023 • edited Loading

sunjunlishi commented Mar 12, 2024

BenjaminBossan commented Mar 18, 2024

Nipi64310 commented Aug 8, 2023 •

edited

Loading

Nipi64310 commented Aug 8, 2023 •

edited

Loading

whybeyoung commented Oct 14, 2023 •

edited

Loading

codernew007 commented Dec 25, 2023 •

edited

Loading