Finetuning models for audio_ctx support #1951

abb128 · 2024-03-13T00:14:32Z

It's possible to fine-tune models to be able to use audio_ctx more freely, without affecting their knowledge too much.

Example with default settings (notice the ~3x speed difference):

$ ./main -m ggml-model-whisper-tiny.en-q8_0.bin -f samples/jfk.wav -ac 500
[...]
[00:00:00.000 --> 00:00:09.760]   and so my fellow Americans ask not what your country can do for you ask what you can do for
[00:00:09.760 --> 00:00:10.760]   You are a country.


whisper_print_timings:     load time =    47.05 ms
whisper_print_timings:     fallbacks =   0 p /   1 h
whisper_print_timings:      mel time =    17.20 ms
whisper_print_timings:   sample time =   389.59 ms /   762 runs (    0.51 ms per run)
whisper_print_timings:   encode time =   191.74 ms /     2 runs (   95.87 ms per run)
whisper_print_timings:   decode time =     5.03 ms /     2 runs (    2.51 ms per run)
whisper_print_timings:   batchd time =  1040.05 ms /   752 runs (    1.38 ms per run)
whisper_print_timings:   prompt time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time =  1699.19 ms

$ ./main -m tiny_en_acft_q8_0.bin -f samples/jfk.wav -ac 500
[...]
[00:00:00.000 --> 00:00:07.880]   And so, my fellow Americans ask not what your country can do for you
[00:00:07.880 --> 00:00:09.880]   ask what you can do for your...
[00:00:09.880 --> 00:00:10.880]   country.


whisper_print_timings:     load time =    60.26 ms
whisper_print_timings:     fallbacks =   1 p /   0 h
whisper_print_timings:      mel time =    15.26 ms
whisper_print_timings:   sample time =    62.74 ms /   186 runs (    0.34 ms per run)
whisper_print_timings:   encode time =   208.25 ms /     2 runs (  104.13 ms per run)
whisper_print_timings:   decode time =    12.02 ms /     5 runs (    2.40 ms per run)
whisper_print_timings:   batchd time =   189.45 ms /   169 runs (    1.12 ms per run)
whisper_print_timings:   prompt time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time =   556.35 ms

Example with greedy search and no timestamps (notice it doesn't repeat itself):

$ ./main -m ggml-model-whisper-tiny.en-q8_0.bin -f samples/jfk.wav -nt -ng -nf -bo 1 -bs 1 -ac 500
[...]
 And so my fellow Americans ask not what your country can do for you. Ask what you can do for your country for you. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do for your country. Ask what you can do

whisper_print_timings:     load time =    41.61 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =    13.48 ms
whisper_print_timings:   sample time =    97.74 ms /     1 runs (   97.74 ms per run)
whisper_print_timings:   encode time =   114.27 ms /     1 runs (  114.27 ms per run)
whisper_print_timings:   decode time =   506.76 ms /   219 runs (    2.31 ms per run)
whisper_print_timings:   batchd time =     3.95 ms /     2 runs (    1.98 ms per run)
whisper_print_timings:   prompt time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time =   783.24 ms

$ ./main -m ft3-quant/tiny_en_acft_q8_0.bin -f samples/jfk.wav -nt -ng -nf -bo 1 -bs 1 -ac 500
[...]
 And so my fellow Americans ask not what your country can do for you, ask what you can do for your

whisper_print_timings:     load time =    46.31 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =    16.60 ms
whisper_print_timings:   sample time =     9.33 ms /     1 runs (    9.33 ms per run)
whisper_print_timings:   encode time =    95.40 ms /     1 runs (   95.40 ms per run)
whisper_print_timings:   decode time =    47.55 ms /    22 runs (    2.16 ms per run)
whisper_print_timings:   batchd time =     3.45 ms /     2 runs (    1.73 ms per run)
whisper_print_timings:   prompt time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time =   222.61 ms

Models and method are available here: https://github.com/futo-org/whisper-acft

Feedback and comments are welcome! The finetuning method probably isn't perfect, it may need fewer epochs, more data or avoiding randomly subtracting from context too much, but it still produces good results.

Related to #137 but I thought to open a new issue for this to discuss this specific method.

(Edit: The original results were from an older version of whisper.cpp which showed a 10x speed difference with default beam search, I have updated the results to a56f435 and the speed difference is no longer as significant, but is still there)

The text was updated successfully, but these errors were encountered:

ggerganov · 2024-03-15T13:28:03Z

Wow! This looks like a very important work. Would love to give this a try at some point

Any reason to prefer -ac 500 over -ac 512? Round numbers are generally better for performance, though depending on the backend implementation there might not be much difference

Do the fine-tuned models work only for a specific value of -ac or it can be varied all the way to 1500?

abb128 · 2024-03-15T14:53:43Z

@ggerganov The audio context can be varied from roughly 100 all the way to 1500. You can use very low values sometimes but they may produce sketchy results or fall into repetition loop in the same way. More short examples in the training data may help mitigate this issue, I used google/fleurs which had a shortest example of 3.18s, meaning the model hasn't seen anything less than roughly -ac 159.

Context as low as -ac 32 does end up working with jfk.wav specifically:

$ ./main -m tiny_en_acft_q8_0.bin -f samples/jfk.wav -nt -ng -nf -bo 1 -bs 1 -ac 32
 And...

The reason I used -ac 500 was just to emphasize the difference with jfk.wav, because the default model just happens to not repeat itself with jfk.wav if you use -ac 512 in particular.

(normal model)
$ ./main -m ggml-model-whisper-tiny.en-q8_0.bin -f samples/jfk.wav -nt -ng -nf -bo 1 -bs 1 -ac 512
 And so my fellow Americans ask not what your country can do for you. Ask what you can do for your country.

This doesn't mean that the normal model will always work just fine if you just use -ac 512, there are many cases where 512 fails.

(normal model)
$ ./main -m ggml-model-whisper-tiny.en-q8_0.bin -f ~/Music/example3.wav -nt -ng -nf -bo 1 -bs 1 -ac 512
 people are never gonna know, you know what it is and there doesn't need to be that knowledge. He's in the fascinatingly awkward position of not being able to say how much he needs to be that knowledge. He's in the fascinatingly awkward position of not being able to say how much he needs to be that knowledge. He's in the fascinatingly awkward position of not being able to say how much he needs to be that knowledge. He's in the fascinatingly awkward position of not being able to say how much he needs to be that knowledge. He's in the fascinatingly awkward position of not being able to say how much he needs to be that knowledge. He's in the fascinatingly awkward position of not being able to say how much he needs to be that knowledge. He's in the fascinatingly awkward position of not being able to say how much he needs to be that knowledge. He's in the fascinatingly awkward position of not being able to say how much he needs to be that knowledge. He's in the fascinatingly awkward position of not being able to say how

(finetuned model)
$ ./main -m tiny_en_acft_q8_0.bin -f ~/Music/example3.wav -nt -ng -nf -bo 1 -bs 1 -ac 512
 people are never going to know, you know what it is and there doesn't need to be that knowledge. He's in the fascinatingly awkward position of not being able to say how much he

soupslurpr · 2024-03-15T21:59:46Z

Was the original tiny model audio_ctx scaled from 0 to 1500 just from the audio length being 0 to 30? Would be interesting to see the results of doing 0 to 1500 but + 256 up to max 1500 as that is what I'm using and it seems to work pretty well.

abb128 · 2024-03-17T06:10:09Z

@soupslurpr I did some preliminary tests on this, it seems like the tiny.en model doesn't react well to just +256, it needs +512 to finally get to something usable, whereas the finetuned model stays roughly stable. (top graph and bottom graph data is identical just zoomed differently)

base.en +256 works though

(2048 is the baseline, all of it gets clamped up to 1500)

Of course this is all evaluated in hf transformers' implementation which probably differs from whisper.cpp in many aspects, but I'd say it's a good indication for the finetuned models

soupslurpr · 2024-03-17T06:18:05Z

Hm interesting. So the whisper.cpp implementation might be more resilient to lower audio_ctx

Edit: actually maybe 512 seems to be needed for whisper.cpp as well. I didn't notice because I was including silence at the end which was increasing the audio_ctx used.

zhouwg · 2024-03-20T11:24:57Z

@abb128 , thanks so much.it's very helpful for this PoC

performance of real-time transcription on Xiaomi14 was improved very significantly

before fine-tune:

after fine-tune:

but this fine-tune also brings an unexpected side-effect:whispercpp would produce incorrect/repeat tokens or app would crash suddenly.


p_params->max_tokens        = 256;
p_params->temperature_inc   = 0.0f;
p_params->audio_ctx         = std::min(1500, (int)ceil((double)num_samples / (double)(320.0)) + 16);
if (WHISPER_SAMPLING_GREEDY == n_decoding_mode) {
    p_params->strategy = WHISPER_SAMPLING_GREEDY;
    p_params->greedy.best_of = 1;//https://github.com/openai/whisper/blob/f82bc59f5ea234d4b97fb2860842ed38519f7e65/whisper/transcribe.py#L264
} else {
   p_params->strategy               = WHISPER_SAMPLING_BEAM_SEARCH;
   p_params->beam_search.beam_size  = 5;//https://github.com/openai/whisper/blob/f82bc59f5ea234d4b97fb2860842ed38519f7e65/whisper/transcribe.py#L265
   p_params->greedy.best_of         = 5;
}

btw, I'm sorry to interrupt to you:I really do NOT know the meaning of above code, could you help to point out what's/where is the problem in above code? thanks so much.

…TV on Xiaomi 14 at the first time but bug-fix is still required

…hisper.cpp#1951 after sync with upstream whispercpp

* continue to try a new finetune method which introduced in ggerganov/whisper.cpp#1951 after sync with upstream whispercpp * continue to try a new finetune method which introduced in ggerganov/whisper.cpp#1951 after sync with upstream whispercpp * Update FAQ.md * Update FAQ.md

…ne with method which introduced in ggerganov/whisper.cpp#1951

…ne with method which introduced in ggerganov/whisper.cpp#1951 (#97)

* whispercpp-jni: better performance with better stability after finetune with method which introduced in ggerganov/whisper.cpp#1951 * arch:new software architecture since 03-25-2024 * Update FAQ.md * Update FAQ.md * Update FAQ.md

ggerganov added ideas Interesting ideas for experimentation research🔬 labels Mar 15, 2024

zhouwg referenced this issue in zhouwg/kantv Mar 20, 2024

real real-time transcription(real-time subtitle) with English online-…

4cd35dd

…TV on Xiaomi 14 at the first time but bug-fix is still required

zhouwg added a commit to zhouwg/kantv that referenced this issue Mar 22, 2024

continue to try a new finetune method which introduced in ggerganov/w…

67ef098

…hisper.cpp#1951 after sync with upstream whispercpp

zhouwg added a commit to zhouwg/kantv that referenced this issue Mar 22, 2024

continue to try a new finetune method which introduced in ggerganov/w…

18cb5b5

…hisper.cpp#1951 after sync with upstream whispercpp

zhouwg added a commit to zhouwg/kantv that referenced this issue Mar 23, 2024

continue to try a new finetune method which introduced in ggerganov/w…

b9cd1e2

…hisper.cpp#1951 after sync with upstream whispercpp

zhouwg added a commit to zhouwg/kantv that referenced this issue Mar 24, 2024

whispercpp-jni: better performance with better stability after finetu…

290a7fc

…ne with method which introduced in ggerganov/whisper.cpp#1951

zhouwg added a commit to zhouwg/kantv that referenced this issue Mar 24, 2024

whispercpp-jni: better performance with better stability after finetu…

37f6caf

…ne with method which introduced in ggerganov/whisper.cpp#1951 (#97)

abb128 mentioned this issue Jan 6, 2025

Quality suffers on earnings22 dataset futo-org/whisper-acft#5

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finetuning models for audio_ctx support #1951

Finetuning models for audio_ctx support #1951

abb128 commented Mar 13, 2024 •

edited

Loading

ggerganov commented Mar 15, 2024

abb128 commented Mar 15, 2024

soupslurpr commented Mar 15, 2024

abb128 commented Mar 17, 2024

soupslurpr commented Mar 17, 2024 •

edited

Loading

zhouwg commented Mar 20, 2024 •

edited

Loading

Finetuning models for audio_ctx support #1951

Finetuning models for audio_ctx support #1951

Comments

abb128 commented Mar 13, 2024 • edited Loading

ggerganov commented Mar 15, 2024

abb128 commented Mar 15, 2024

soupslurpr commented Mar 15, 2024

abb128 commented Mar 17, 2024

soupslurpr commented Mar 17, 2024 • edited Loading

zhouwg commented Mar 20, 2024 • edited Loading

abb128 commented Mar 13, 2024 •

edited

Loading

soupslurpr commented Mar 17, 2024 •

edited

Loading

zhouwg commented Mar 20, 2024 •

edited

Loading