-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Finetuning models for audio_ctx support #1951
Comments
Wow! This looks like a very important work. Would love to give this a try at some point Any reason to prefer Do the fine-tuned models work only for a specific value of |
@ggerganov The audio context can be varied from roughly 100 all the way to 1500. You can use very low values sometimes but they may produce sketchy results or fall into repetition loop in the same way. More short examples in the training data may help mitigate this issue, I used Context as low as
The reason I used
This doesn't mean that the normal model will always work just fine if you just use
|
Was the original tiny model audio_ctx scaled from 0 to 1500 just from the audio length being 0 to 30? Would be interesting to see the results of doing 0 to 1500 but + 256 up to max 1500 as that is what I'm using and it seems to work pretty well. |
@soupslurpr I did some preliminary tests on this, it seems like the tiny.en model doesn't react well to just +256, it needs +512 to finally get to something usable, whereas the finetuned model stays roughly stable. (top graph and bottom graph data is identical just zoomed differently) (2048 is the baseline, all of it gets clamped up to 1500) Of course this is all evaluated in hf transformers' implementation which probably differs from whisper.cpp in many aspects, but I'd say it's a good indication for the finetuned models |
Hm interesting. So the whisper.cpp implementation might be more resilient to lower audio_ctx Edit: actually maybe 512 seems to be needed for whisper.cpp as well. I didn't notice because I was including silence at the end which was increasing the audio_ctx used. |
@abb128 , thanks so much.it's very helpful for this PoC performance of real-time transcription on Xiaomi14 was improved very significantly before fine-tune: after fine-tune: but this fine-tune also brings an unexpected side-effect:whispercpp would produce incorrect/repeat tokens or app would crash suddenly.
btw, I'm sorry to interrupt to you:I really do NOT know the meaning of above code, could you help to point out what's/where is the problem in above code? thanks so much. |
…TV on Xiaomi 14 at the first time but bug-fix is still required
…hisper.cpp#1951 after sync with upstream whispercpp
…hisper.cpp#1951 after sync with upstream whispercpp
…hisper.cpp#1951 after sync with upstream whispercpp
* continue to try a new finetune method which introduced in ggerganov/whisper.cpp#1951 after sync with upstream whispercpp * continue to try a new finetune method which introduced in ggerganov/whisper.cpp#1951 after sync with upstream whispercpp * Update FAQ.md * Update FAQ.md
…ne with method which introduced in ggerganov/whisper.cpp#1951
…ne with method which introduced in ggerganov/whisper.cpp#1951 (#97)
* whispercpp-jni: better performance with better stability after finetune with method which introduced in ggerganov/whisper.cpp#1951 * arch:new software architecture since 03-25-2024 * Update FAQ.md * Update FAQ.md * Update FAQ.md
It's possible to fine-tune models to be able to use audio_ctx more freely, without affecting their knowledge too much.
Example with default settings (notice the ~3x speed difference):
Example with greedy search and no timestamps (notice it doesn't repeat itself):
Models and method are available here: https://github.com/futo-org/whisper-acft
Feedback and comments are welcome! The finetuning method probably isn't perfect, it may need fewer epochs, more data or avoiding randomly subtracting from context too much, but it still produces good results.
Related to #137 but I thought to open a new issue for this to discuss this specific method.
(Edit: The original results were from an older version of whisper.cpp which showed a 10x speed difference with default beam search, I have updated the results to a56f435 and the speed difference is no longer as significant, but is still there)
The text was updated successfully, but these errors were encountered: