Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

~4.6x speedup, Huge memory optimizations: EXL2 backend #37

Merged
merged 22 commits into from
Nov 30, 2024

Conversation

Ednaordinary
Copy link
Contributor

This adds EXL2 as a backend as well as slight refactoring to the original backend code. I now achieve the example code in 2.76 seconds versus 12.6 seconds from the HF implementation (measuring only model gen, not model load), as well as much lower memory requirements. Crosses the real time factor threshold.

HF:

hf_example.mp4

EXL2:

exl2_example.mp4

Example code:

import outetts

model_config = outetts.EXL2ModelConfig_v1(
    model_path="OuteTTS-0.2-500M-exl2-b6.0-hb8",
    language="en",  # Supported languages in v0.2: en, zh, ja, ko
    max_length=4096,
)
interface = outetts.InterfaceEXL2(model_version="0.2", cfg=model_config)
speaker = interface.load_default_speaker(name="male_1")

output = interface.generate(
    text="Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and it can be implemented in software or hardware products.",
    temperature=0.1,
    repetition_penalty=1.1,
    speaker=speaker,
)

output.save("output.wav")

@Ednaordinary Ednaordinary marked this pull request as ready for review November 27, 2024 23:14
@edwko
Copy link
Owner

edwko commented Nov 29, 2024

Nice work! Overall, it looks great. I made some changes that I think improve the implementation and pushed them to the PR for you to check out.

Changes:

  • Added max_length to prevent exceeding limits. The values are set by default to keep compatibility with previous versions.
  • Added checks for max_length to raise an error if it’s None or goes over the maximum value.
  • Some interface cleanup.
  • Updated ExLlamaV2DynamicGenerator to include an additional setting, additional_dynamic_generator_config. For example, on ROCm, paged might not be supported, so users may need to set paged = False.
  • Streaming isn’t necessary for now, we can just use generator.generate.
  • Updated the cache self.cache = ExLlamaV2Cache(self.model, max_seq_len=32768, lazy=True) shouldn’t be hardcoded. Using something like max_length * 2 seems better, though I’m not sure if this is the best approach.

@Ednaordinary
Copy link
Contributor Author

Added max_length to prevent exceeding limits. The values are set by default to keep compatibility with previous versions.

Hm.. I'm a bit conflicted on exactly what to do with max_length. GGUF allocates the whole cache immediately (HF I think does too?) so some sort of settable value in GGUFModel to set the context length would be helpful (this would also be helpful for HF too, since I think currently it allocates the full 32768 tokens.). It's currently set to 4096 which is a bad default because it includes input_ids. EXL2 doesn't matter as much since lazy cache allocates as needed. I'll push a change to EXL2 cache to just take from the model config because of this.

For actual new tokens, it'll probably be useful to add a check that shape of input_ids + max_new_tokens < cache length (HF and EXL2 will probably raise an error on their own, but having unifying language is good. Not sure what llama cpp would do.)

cache_size -> defaults to the max tokens in the model
max_length -> defaults to 4096 -> throws an error if over 4096 or over (cache_size - input_ids)

does that sound good?

Streaming isn’t necessary for now, we can just use generator.generate.

Speaking of streaming, I was planning to open a PR for it soon. Do you have any thoughts? My main concern is that wav_tokenizer decoding in chunks often adds "clicks" (also that small chunks sound bad but that can be avoided). Proof of concept:

outestreaming.mp4

In terms of methods, I think adding a separate generate_stream method that yields numpy arrays with audio playable by sounddevice would suffice? One worry I have with that is that anything in the for i in interface.generate_stream(): would block (even with an async implementation), so that should be explicitly clear or somehow avoided.

Everything else looks good

@edwko
Copy link
Owner

edwko commented Nov 30, 2024

That sounds fine. We can use the max size from the model config as you suggested: "cache_size -> defaults to the max tokens in the model." I've also noticed that I haven't updated "max_position_embeddings" in the model config, which technically only supports 4096, not 32768. I think we can retain the rest of the "max_length" checks I've already added.

For the condition "input_ids + max_new_tokens < cache length", we can add this. However, it will throw an error anyway for llama.cpp, HF, and EXL2, so it might not be necessary.

Speaking of streaming, I was planning to open a PR for it soon. Do you have any thoughts? My main concern is that wav_tokenizer decoding in chunks often adds "clicks" (also that small chunks sound bad but that can be avoided). Proof of concept

The issue is that most audio encoders tend to struggle with small decoding chunks. For instance, decoding just a few words at a time often results in degraded quality, clicks, changes in intonation between chunks. From my testing, the WavTokenizer actually performed the best even with smaller chunks, while the words were understandable, there were still occasional changes in intonation or odd audio artifacts.

One way to address this would be to find a sweet spot for chunking, such as decoding 5–6 words at a time. This should also dynamically adjust based on the total output. For example, if the output is only 6 words, chunking it into 4, 2 or 3, 3 might make the end result sound more natural.

For generating audio, we could detect the <|code_end|> token during generation, add the associated word with its audio tags to a buffer list, and check if the buffer has reached the chunk size. Once it does, we could retrieve the audio tokens and spin up a thread to play the audio while the model continues generating and buffering subsequent chunks.

I feel like this logic should live outside so that it can be easily integrated with HF and llama.cpp.

Let me know when this PR is ready, and I’ll merge it :)

@Ednaordinary
Copy link
Contributor Author

Coincidentally, that's about exactly what I've done in my proof of concept. 8 tokens seems like a decent sweet spot, though the clicking seems basically unavoidable when the audio transitions from one to the next. I can reduce it partially with fading in and out, but it's not perfect, so I also do a bit of additional chunk logic to try to compensate, especially with long sentences. The decode token count should definitely be settable via a param too.

from exllamav2 import ExLlamaV2, ExLlamaV2Config, ExLlamaV2Cache, ExLlamaV2Tokenizer
from exllamav2.generator import ExLlamaV2DynamicGenerator, ExLlamaV2DynamicJob, ExLlamaV2Sampler
from outetts.version.v1.prompt_processor import PromptProcessor
from outetts.wav_tokenizer.audio_codec import AudioCodec
import json
import torch
import sounddevice as sd
import asyncio
import numpy as np
import time
import threading

model_path = "./OuteTTS-0.2-500M-exl2-b6.0-hb8"
speaker_path = "male.json"
speaker_inserts = []

audio_codec = AudioCodec("cuda", None)
processor = PromptProcessor("OuteAI/OuteTTS-0.2-500M", ["en"])
with open(speaker_path, "r") as f:
    speaker = json.load(f)

config = ExLlamaV2Config(model_path)
config.arch_compat_overrides()
model = ExLlamaV2(config)
cache = ExLlamaV2Cache(model, max_seq_len=32768, lazy=True)
model.load_autosplit(cache, progress=False)
tokenizer = ExLlamaV2Tokenizer(config)

audio_queue = []
def player():
    global audio_queue
    while True:
        time.sleep(0.01)
        while audio_queue != []:
            sd.play(audio_queue[0], samplerate=24000)
            sd.wait()
            audio_queue.pop(0)

threading.Thread(target=player).start()

def fade_out(audio, length):
    length = min(length, len(audio) - 1)
    for edx, i in enumerate(range(len(audio) - 1 - length, len(audio) - 1, 1)):
        audio[i] *= (((length - edx) / length) ** (1/3))
    return audio

def fade_in(audio, length):
    length = min(length, len(audio) - 1)
    for edx, i in enumerate(range(0, length, 1)):
        audio[i] *= ((edx / length) ** (1/3))
    return audio

while True:
    prompt = input(">")

    prompts = prompt.split(".")

    for prompt in prompts:
        prompt = processor.get_completion_prompt(prompt, "en", speaker)
        input_ids = processor.tokenizer.encode(
            prompt,
            add_special_tokens=False,
            return_tensors="pt",
        ).to("cpu")
        job = ExLlamaV2DynamicJob(input_ids=input_ids, max_new_tokens=4096)
        generator = ExLlamaV2DynamicGenerator(
            model = model,
            cache = cache,
            tokenizer = tokenizer,
            gen_settings = ExLlamaV2Sampler.Settings(token_repetition_penalty=1.1, temperature=0.25, min_p=0.1)
        )
        generator.enqueue(job)

        def stream():
            eos = False
            do_yield = 0
            yield_thresh = 10
            while not eos:
                results = generator.iterate()
                for result in results:
                    assert result["job"] == job
                    if result["stage"] == "streaming":
                        #tokens.append(int(result.get("token_ids", "")[0][0]))
                        outtext = result.get("text", "").strip()
                        if outtext != "":
                            print(outtext)
                            do_yield += 1
                            if do_yield == yield_thresh:
                                if yield_thresh < 50:
                                    yield_thresh = int(yield_thresh ** 1.25)
                                yield True
                                do_yield = 0
                        else:
                            yield int(result.get("token_ids", "")[0][0])
                        if int(result.get("token_ids", "")[0][0]) == tokenizer.eos_token_id:
                            yield True
                            return
        def decoder():
            tokens = []
            for i in stream():
                if isinstance(i, bool):
                    output = processor.extract_audio_from_tokens(tokens)
                    if output != []:
                        output = audio_codec.decode(torch.tensor([output], dtype=torch.int64).to(audio_codec.device)).squeeze().cpu().numpy()
                        yield fade_in(fade_out(output, 500), 500)
                        tokens = []
                else:
                    tokens.append(i)
        def player():
            global audio_queue
            for i in decoder():
                audio_queue.append(i)
        player()

I've also noticed that I haven't updated "max_position_embeddings" in the model config, which technically only supports 4096, not 32768

Isn't it that 32768 is max for input_ids + output concatenated? My thinking was that max_positional_embeddings/max_seq_len was 32768 and the maximum output ids per forward pass was 4096 (i.e. llama context is 128k but it can only output 4096 per forward pass. This also happens to make its actual max_positional_embeddings 131k)

Let me know when this PR is ready, and I’ll merge it :)

I'll make a final pass now and let you know once I'm done 🚀

@Ednaordinary
Copy link
Contributor Author

I'm gonna comment out check_max_length(cfg.cache_size, config["max_length"]) for the moment until max_positional_embeddings is more clear

@Ednaordinary
Copy link
Contributor Author

Okay, should be good to merge once max_positional_embeddings is resolved. I've tested with hf, gguf, and exl2.

@edwko
Copy link
Owner

edwko commented Nov 30, 2024

Isn't it that 32768 is max for input_ids + output concatenated? My thinking was that max_positional_embeddings/max_seq_len was 32768 and the maximum output ids per forward pass was 4096 (i.e. llama context is 128k but it can only output 4096 per forward pass. This also happens to make its actual max_positional_embeddings 131k)

max_position_embeddings refers to the maximum number of tokens the model can process in a single sequence. This includes both input tokens and any generated output tokens if they are concatenated. So, if it's set to 4096, the model cannot handle sequences up to 32,768 tokens without updating this parameter.

This value essentially sets the limit for how many tokens the model can process without a significant drop in performance. In this case, the model was trained with a batch size of 4096 tokens, which is why perplexity and performance start to degrade beyond that point.

@Ednaordinary
Copy link
Contributor Author

I think 8192 would be a reasonable default for cache_size but I'll let you decide the final behavior

@edwko edwko merged commit ffd8179 into edwko:main Nov 30, 2024
@edwko
Copy link
Owner

edwko commented Nov 30, 2024

I think the initial idea you had, "cache_size -> defaults to the max tokens in the model," works well. It also avoids errors when max new tokens and cache size are the same. So let’s go with that. I’ve made the final tweaks and removed the cache settings, opting to use max_seq_length instead for a broader setting. I’ve merged it with these changes.

Now onto streaming, when you’re ready, make a PR, and I’ll help out. I was also thinking that using a while loop to detect audio might not be the best approach, we should use some atomic queue for the audio. We could also start with an initial chunk size of 8 words, as you mentioned, and then allow larger chunks dynamically depending on the GPU processing speed. As the GPU finishes generating the initial chunk, it could handle larger outputs, like 15 words. This should help mitigate these issues. We’d also need to set a minimum value to buffer if the GPU isn’t ready yet, based on generation speed. I’ll need to run some tests to figure out how this can be added.

@Ednaordinary
Copy link
Contributor Author

I was also thinking that using a while loop to detect audio might not be the best approach, we should use some atomic queue for the audio.

For actual usage, I was going to leave it up to the user and just provide a generate_stream() method, though for the readme example an atomic queue works.

and then allow larger chunks dynamically depending on the GPU processing speed

The main issue here is that not all tokens are equal length wise. Though I think current chunk length times real time factor times some constant less than one would work fine.

I think the best approach would be to reliably remove the clicks between audio chunks, as then there's no need to dynamically chunk. I'll mess around with this a bit and see if I can come up with anything good

@edwko
Copy link
Owner

edwko commented Nov 30, 2024

Great, I’ll start looking into this as well :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants