-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
~4.6x speedup, Huge memory optimizations: EXL2 backend #37
Conversation
Nice work! Overall, it looks great. I made some changes that I think improve the implementation and pushed them to the PR for you to check out. Changes:
|
Hm.. I'm a bit conflicted on exactly what to do with max_length. GGUF allocates the whole cache immediately (HF I think does too?) so some sort of settable value in For actual new tokens, it'll probably be useful to add a check that shape of input_ids + max_new_tokens < cache length (HF and EXL2 will probably raise an error on their own, but having unifying language is good. Not sure what llama cpp would do.)
does that sound good?
Speaking of streaming, I was planning to open a PR for it soon. Do you have any thoughts? My main concern is that wav_tokenizer decoding in chunks often adds "clicks" (also that small chunks sound bad but that can be avoided). Proof of concept: outestreaming.mp4In terms of methods, I think adding a separate Everything else looks good |
That sounds fine. We can use the max size from the model config as you suggested: For the condition
The issue is that most audio encoders tend to struggle with small decoding chunks. For instance, decoding just a few words at a time often results in degraded quality, clicks, changes in intonation between chunks. From my testing, the WavTokenizer actually performed the best even with smaller chunks, while the words were understandable, there were still occasional changes in intonation or odd audio artifacts. One way to address this would be to find a sweet spot for chunking, such as decoding 5–6 words at a time. This should also dynamically adjust based on the total output. For example, if the output is only 6 words, chunking it into 4, 2 or 3, 3 might make the end result sound more natural. For generating audio, we could detect the I feel like this logic should live outside so that it can be easily integrated with Let me know when this PR is ready, and I’ll merge it :) |
Coincidentally, that's about exactly what I've done in my proof of concept. 8 tokens seems like a decent sweet spot, though the clicking seems basically unavoidable when the audio transitions from one to the next. I can reduce it partially with fading in and out, but it's not perfect, so I also do a bit of additional chunk logic to try to compensate, especially with long sentences. The decode token count should definitely be settable via a param too. from exllamav2 import ExLlamaV2, ExLlamaV2Config, ExLlamaV2Cache, ExLlamaV2Tokenizer
from exllamav2.generator import ExLlamaV2DynamicGenerator, ExLlamaV2DynamicJob, ExLlamaV2Sampler
from outetts.version.v1.prompt_processor import PromptProcessor
from outetts.wav_tokenizer.audio_codec import AudioCodec
import json
import torch
import sounddevice as sd
import asyncio
import numpy as np
import time
import threading
model_path = "./OuteTTS-0.2-500M-exl2-b6.0-hb8"
speaker_path = "male.json"
speaker_inserts = []
audio_codec = AudioCodec("cuda", None)
processor = PromptProcessor("OuteAI/OuteTTS-0.2-500M", ["en"])
with open(speaker_path, "r") as f:
speaker = json.load(f)
config = ExLlamaV2Config(model_path)
config.arch_compat_overrides()
model = ExLlamaV2(config)
cache = ExLlamaV2Cache(model, max_seq_len=32768, lazy=True)
model.load_autosplit(cache, progress=False)
tokenizer = ExLlamaV2Tokenizer(config)
audio_queue = []
def player():
global audio_queue
while True:
time.sleep(0.01)
while audio_queue != []:
sd.play(audio_queue[0], samplerate=24000)
sd.wait()
audio_queue.pop(0)
threading.Thread(target=player).start()
def fade_out(audio, length):
length = min(length, len(audio) - 1)
for edx, i in enumerate(range(len(audio) - 1 - length, len(audio) - 1, 1)):
audio[i] *= (((length - edx) / length) ** (1/3))
return audio
def fade_in(audio, length):
length = min(length, len(audio) - 1)
for edx, i in enumerate(range(0, length, 1)):
audio[i] *= ((edx / length) ** (1/3))
return audio
while True:
prompt = input(">")
prompts = prompt.split(".")
for prompt in prompts:
prompt = processor.get_completion_prompt(prompt, "en", speaker)
input_ids = processor.tokenizer.encode(
prompt,
add_special_tokens=False,
return_tensors="pt",
).to("cpu")
job = ExLlamaV2DynamicJob(input_ids=input_ids, max_new_tokens=4096)
generator = ExLlamaV2DynamicGenerator(
model = model,
cache = cache,
tokenizer = tokenizer,
gen_settings = ExLlamaV2Sampler.Settings(token_repetition_penalty=1.1, temperature=0.25, min_p=0.1)
)
generator.enqueue(job)
def stream():
eos = False
do_yield = 0
yield_thresh = 10
while not eos:
results = generator.iterate()
for result in results:
assert result["job"] == job
if result["stage"] == "streaming":
#tokens.append(int(result.get("token_ids", "")[0][0]))
outtext = result.get("text", "").strip()
if outtext != "":
print(outtext)
do_yield += 1
if do_yield == yield_thresh:
if yield_thresh < 50:
yield_thresh = int(yield_thresh ** 1.25)
yield True
do_yield = 0
else:
yield int(result.get("token_ids", "")[0][0])
if int(result.get("token_ids", "")[0][0]) == tokenizer.eos_token_id:
yield True
return
def decoder():
tokens = []
for i in stream():
if isinstance(i, bool):
output = processor.extract_audio_from_tokens(tokens)
if output != []:
output = audio_codec.decode(torch.tensor([output], dtype=torch.int64).to(audio_codec.device)).squeeze().cpu().numpy()
yield fade_in(fade_out(output, 500), 500)
tokens = []
else:
tokens.append(i)
def player():
global audio_queue
for i in decoder():
audio_queue.append(i)
player()
Isn't it that 32768 is max for input_ids + output concatenated? My thinking was that max_positional_embeddings/max_seq_len was 32768 and the maximum output ids per forward pass was 4096 (i.e. llama context is 128k but it can only output 4096 per forward pass. This also happens to make its actual max_positional_embeddings 131k)
I'll make a final pass now and let you know once I'm done 🚀 |
I'm gonna comment out |
Okay, should be good to merge once max_positional_embeddings is resolved. I've tested with hf, gguf, and exl2. |
This value essentially sets the limit for how many tokens the model can process without a significant drop in performance. In this case, the model was trained with a batch size of 4096 tokens, which is why perplexity and performance start to degrade beyond that point. |
I think 8192 would be a reasonable default for cache_size but I'll let you decide the final behavior |
I think the initial idea you had, Now onto streaming, when you’re ready, make a PR, and I’ll help out. I was also thinking that using a |
For actual usage, I was going to leave it up to the user and just provide a
The main issue here is that not all tokens are equal length wise. Though I think current chunk length times real time factor times some constant less than one would work fine. I think the best approach would be to reliably remove the clicks between audio chunks, as then there's no need to dynamically chunk. I'll mess around with this a bit and see if I can come up with anything good |
Great, I’ll start looking into this as well :) |
This adds EXL2 as a backend as well as slight refactoring to the original backend code. I now achieve the example code in 2.76 seconds versus 12.6 seconds from the HF implementation (measuring only model gen, not model load), as well as much lower memory requirements. Crosses the real time factor threshold.
HF:
hf_example.mp4
EXL2:
exl2_example.mp4
Example code: