An extension for oobabooga/text-generation-webui that allows the currently loaded model to automatically unload itself immediately after a prompt is processed, thereby freeing up VRAM for use in other programs. It automatically reloads the last model upon sending another prompt.
This should theoretically help systems with limited VRAM run multiple VRAM-dependent programs in parallel.
The following scenario should paint a clearer picture of what the extension does.
- User creates prompt. System generates a response.
- Model Ducking unloads the current model, freeing up VRAM.
- System runs Local TTS program (i.e. XTTSv2) to automatically voice the system's response.
- User runs Stable-Diffusion to generate an image based on response.
- User creates another prompt.
- Model Ducking reloads the last model. VRAM is taken up by the model again.
- System generates a response.
- Model Ducking unload the model again, freeing up VRAM.
Enable Model Ducking by appending it to the --extensions
parameter on startup, or by enabling it from the Session tab.
Once the extension is enabled, Model Ducking has to be activated from the Chat
tab by checking the Activate Model Ducking
checkbox.
In the Chat
tab, check the Using API
checkbox.
Important Note: Using API
must NOT be checked when you do your prompts from the text-generation-webui. Only have it checked when using from outside (i.e. SillyTavern).
There is an obvious additional latency after subsequent prompts to reload the last model used. Do note that the latency is limited to the reloading of the last model, and does not directly impact the model's ability to generate responses (i.e. tokens per second).
Because of this side-effect, I do not recommend turning on features that automatically asks the AI to "continue" their response (i.e. SillyTavern's Auto-Continue feature), as Model Ducking will unload and reload your model in between the intitially generated response and the subsequent attempts to continue.