-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster streaming support #137
Comments
Oh my god! I just tested that and it seems to work o.O Very interesting... |
Here is diff if someone wants to play with it: git diff
diff --git a/whisper.cpp b/whisper.cpp
index 7078863..df47bff 100644
--- a/whisper.cpp
+++ b/whisper.cpp
@@ -1053,6 +1053,7 @@ static bool whisper_model_load(const std::string & fname, whisper_context & wctx
return false;
}
}
+ model.e_pe->ne[1] /= 2;
fin.close();
@@ -1076,7 +1077,7 @@ static bool whisper_encode(
const auto & mel_inp = wctx.mel;
const auto & hparams = model.hparams;
- const int n_ctx = hparams.n_audio_ctx;
+ const int n_ctx = hparams.n_audio_ctx/2;
const int n_state = hparams.n_audio_state;
const int n_head = hparams.n_audio_head;
const int n_layer = hparams.n_audio_layer;
@@ -1474,7 +1475,7 @@ static bool whisper_decode(
const int n_layer = hparams.n_text_layer;
const int N = n_tokens;
- const int M = hparams.n_audio_ctx;
+ const int M = hparams.n_audio_ctx/2;
struct ggml_init_params params = {
.mem_size = wctx.buf_compute.size(), |
@ggerganov amazing work on this project 👍 |
Noise removal such as with rrnoise https://jmvalin.ca/demo/rnnoise/ also increases accuracy of whisper even if sped up a bit. Also had a good experience with "audio companding / compression" in sox along with band pass. Maybe simple audio filtering could help in accuracy when run 2x... Here is an example: https://github.com/audo-ai/magic-mic/tree/main/src-native |
Thanks for the ideas - these are very interesting to me. Initially, I was very hopeful for the idea of increasing the tempo combined with partially evaluating the encoder as described in the original comment above by @ameenba. It seemed that it would be possible for example to process in real-time 4 second chunks by speeding the tempo to get 2 second chunks and running the Encoder with a context of Unfortunately, my experiments show that when I "trim" the Encoder so much, the Decoding fails miserably - usually repeating the first word over and over again. I tried to resolve that by "stitching" sequential Encoder outputs hoping to stabilize the Decoder, but it didn't work out for some reason. In fact, using anything less than Sure, you can use it to process long audio (i.e. > 30 sec) faster, but for this use case you can simply use a 3rd-party software to pre-process the audio (e.g. VAD and Still, the original idea above for audio context reduction with Next, I want to add asynchronous interface to the C API which will make it easier to create streaming applications for iOS and WebAssembly. I expect with the current performance to be able to create some nice real-time transcription demos on these platforms. |
@ggerganov thanks for the investigation! I also found 5-10s to be close to the limit, I haven't fully investigated from my end in python but curious on what you've tried also, my next steps were:
I haven't added the -su option at the moment, I'm trying to investigate this streaming approach in isolation before adding loss from other sources. Although I'm sure like you said, there's a lot of preprocessing that could be done to the input, such as speeding up the speech or cutting noise from outside the vocal range? I'd like to clarify the current streaming approach too: if you were to encode in 5s increments for a 15s audio sample, you would effectively generate 3 encodings with 5, 10, and 15s of relevant audio each (expanding window), with a padded spectrogram, and so each chunk would cost the same computation as a 30s encoding, and for the final translation, only the last encoding is relevant. in this example, the encoder is 3x less efficient for streaming, while the decoding is roughly the same. |
Also, one important note about the encoder - for each new chunk I use the corresponding position embeddings, by extracting a sub-tensor from the Lines 1148 to 1151 in 1ac2665
Not sure if this is correct, but I think it makes sense because this is the way to tell the transformer how the audio is ordered in time. Another thing that I have in mind is that it is obvious that appending the encoder outputs in this way is not equivalent to processing the audio all together. The question is - does this produce at least an approximation of the original result, or is there a better strategy to mix them?
The
Correct. To clarify, the decoding time is proportional to the number of decoded tokens.
Yes, I agree.
|
##I guess below logic may work better for stream application pcmf32.resize(n_samples_new);
#if 0
#endif
//Run inference here |
Hi guys, I've been tinkering with whisper.cpp for a while to explore how well it works for streaming/real-time applications. So far I'm able to get not bad transcription quality with reasonable CPU usage. I would like to share my progress so far. First, a demo video of the result: out.mp4The CPU usage on an M1 MacBook Air is around 100% to 200% (1 core = 100%), leaving much of the resources for other applications that people may want to run alongside, for example, a video call. Note that I wrapped whisper.cpp in an Electron app with node-addon-api, so the actual %CPU used by whisper should be even lower. Here are the tricks I'm using:
Here is my code if you want to check out: https://github.com/tableos/mina/blob/main/native/stt_whisper.cc#L109 At this moment I'm satisfied with the quality-performance balance of this setup, in terms of streaming/real-time application, and I'm most wondering if there're other low-hanging fruit to try to further improve quality or performance. (For example, maybe, using a more accurate VAD such Silero VAD.) |
Is it possible to use other available languages in this? |
@dragonman225 One idea could be to use the text from the previous sentence as a prompt / context for the currently transcribed sentence. This is prototyped in the whisper.cpp/examples/stream/stream.cpp Line 111 in 7282e21
This is something that might or might not help improve the quality of the transcription. I haven't done much investigation. |
@ggerganov thanks for your suggestions! I tried changing It's a good idea to use previously transcribed text as the prompt, but soon after I tried it I found it may not be needed in my app. Since my app uses VAD, when it's transcribing a sentence, it keeps accumulating audio chunks for that sentence and transcribes the them as a whole. This gives whisper the opportunity to correct transcription of old audio chunks as new ones come in, until the end of a sentence is detected by VAD. (you can see this in action by looking at the blue sentences in my demo video) Ideally, if VAD is accurate, I think every sentence would be pretty independent of each other, and keeping context in whisper is maybe not necessary. In sum, I'll keep this option in mind. |
Yes, you can change the |
It processes any non-multiple-of-32/64 leftovers without using SIMD: Lines 737 to 740 in 4e0b206
|
Quick question, is it possible to use |
Noise removal is a strange one as an absolutely great noise removal aka RTXVoice level https://github.com/Rikorose/DeepFilterNet seems reduce Whisper accuracy which is strange as the results to human ear are superb. RnnNoise is pretty stinky IMO the model and tech was a very early method and often fails. I have hunch that filtering unknown noise is likely not the way to go and like Google Voice-Filter-Lite BSS (blind source seperation) and extracting the signal containing a known KW is likely to give better low load solutions. Simulary there are monophonic BSS but the multichannel based on TDOA seem to provide better results for load. |
In my experience there is no pre processing that increases Whisper accuracy, at least I haven't seen one so far. Noise robustness is already great and my theory is that any additional processing that wasn't part of the training process introduces artifacts even unnoticeable for humans 🤔. |
I am not sure what you are testing but Noise robustness is not great as its not something Whisper is designed to do. Whisper is tuned to human hearing and forgot the low-end but likely 80-8Khz that is split into 40 bins of the spectrogram. There is a whole range of speech enhancement techniques that can increase Whisper accuracy as they increase the SNR and reduce noise and reverberation. https://arxiv.org/pdf/2202.05993.pdf Haven't seen a Wav2Vec2.cpp but maybe one exits and should really checkout Pytorch Mobile or some of the ONXX implementations. I didn't reaslly do much of an empirical test with DeepFilterNet and my memory is hazy as thinking about it it could be load and on the testbed of a RK3588 as struggling to remember if I tried it on my XEON desktop. |
Ok let me rephrase that: Its pretty great compared to other open ASR systems I've tested ;-). Btw I'm testing it simply by using it in real world scenarios like my SEPIA smart-speaker in the living-room with all kinds of noise from the street (people, cars, train etc.) and occasional background from TV etc..
If they are open-source and run in real-time on a consumer CPU and single or dual-mic setup I'd be happy to test them :-). "Googles VoicefilterLite" looks interesting, are there code examples? |
The WER vs SNR is in https://cdn.openai.com/papers/whisper.pdf page 8 and you can see its not drastically different to the others they tested. There is a mp3 on the github page demonstrating DeepFilterNet noise reduction and likely you could train on a dataset such as using https://github.com/microsoft/MS-SNSD on the Librispeech 960h dataset and then process with DeepFilterNet to create a training dataset. However Whisper doesn't like DeepFilterNet and presume others could be the same, but just haven't tried. I was replying to the post about RnNoise as have used it before and have heard the artefacts it can produce so was suprised that it works with Whisper as said above.
Nope some brief papers but the have kept it to themselves and think it runs on both phones and there smart speaker devices. |
I think the main difference here to my real world experience is that librispeech-clean is simply a bad benchmark. Its the benchmark you use when you want to show your best WER numbers 😅. Maybe my subjective experience is different because I start at a point where WER in general is already much higher or maybe its not even the noise but the robustness of Whisper when working with low-quality recordings and voice assistant specific vocabulary 🤷. In my tests Whisper small already outperforms Nvidias CTC Conformer models btw.
Maybe we could fine-tune Whisper on DeepFilterNet filtered audio? 🤔 |
They didn't use librispeech-clean they used librispeech-clean mixed with white noise and librispeech-clean mixed by https://code.soundsoftware.ac.uk/projects/audio-degradation-toolbox to create noisy datasets to benchmark.
Its interesting though as Whisper accuracy is a bit of a cheat as it merely feeds tokens into a LLM with its 30sec beamsearch and it recreates the sentence as the most plausible in the model. That is why it halucinates and WER rockets on short command sentances. Still though load and more emprical testing is needed and maybe KW command sentances can be created by TTS and added to the dataset, As said though Whisper is great at what it does but for smart-speaker type applications I have become less of a fan. |
Yes, I know, but my point is that librispeech-clean ist just a bad baseline and when you mix it with noise you still build on perfectly clean recordings. Maybe your WER drops from 4% to 6% or even 8%, that doesn't matter when your real-life baseline for Whisper is, lets say ~10% and CTC conformer starts at ~15%. My impression of noise-robustness then is simply general robustness to bad recordings.
Hallucination is a problem, thats true, but not as relevant as you might think when your input is captured in a well defined window. Also short commands work pretty well. I'm almost exclusively testing on 3-6s recordings and it still beats everything else in this area. Whsiper small can capture most of what you need for daily voice-assistant usage, especially in English language, from timers, to shopping-lists, from smart home control to music, navigation and general QA.
That's bothering me as well :-/ |
You are having another heat sink moment. You have to have a clean base so when you mix noise you know exactly how much noise is in the dataset so the test is empirical and worth while. OpenAi would look a bit foolish to say it seems ok with noise and leave at that...
That is great but it means absolutely nothing as a 'bad recording' is a recording with noise and other impressions such as OpenAI or myself would also like to apply a meteric to test that interms of how bad by adding noise at various db levels.
That means absolutely nothing without specific empirical testings and quoting what it beats. What exactly is everything else in this area? Are we talking wav2vec or the latest conformers that actually beat Whisper WER? That are general purpose that could be trained with command sentences and LM's. I get you have added whisper to your Sepia framework and your a fan with no empirical true testing and I will prefer to keep to what OpenAi posts and my own tests than what you may believe. wav2vec2 with command domain dataset and LM likely would beat Whisper and newer conformer models such as https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/stt_en_conformer_transducer_xlarge definately would. There are a number of conformer models out in the wild have you tested and trained any of these and tried to create some sort of documented empirical test that contains metrics? Google conformer or stt_conformer in the above OpenAi paper https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/stt_en_conformer_transducer_xlarge beats Whisper already in this field. Its was Google who 1st wrote the paper https://arxiv.org/pdf/2005.08100.pdf but I think wenet2 is conformer based and a ready made end2end ASR ready to test. https://github.com/wenet-e2e/wenet dunno what its WER levels are, but a few exist on github and nvidia, which is best I dunno. It would be so good if georgi could take a full blown opensource transformer model like the newer conformers and apply optimised ggml code as its likely every aspect of whisper from WER, memory footprint to performance could be beaten considerabilly and trained with speech enhancement preprocessing uping noise levels sub 0db with domain specific datasets and LMs. |
It seems you need some kind of scientific documentation for everything I say and atm I can't give you that, because I don't have the time to collect statistically relevant data and write down results in a way I would write scientific work. Fair enough, I see the problem.
The main problem I have with Whisper is speed. Which ultimately brings us back to the title of this thread "Faster streaming support" 😉. |
No I think its relevant as streaming support is fast if run on Apple silicon first-class citizen - optimized via ARM NEON, Accelerate framework and Core ML. For a smart speaker/assistants multi language in one model is rarely needed anyway so much smaller more effcient models can be used that likely are single lang loaded by something like silero vad as in the confines of a home use is unlikely multilingual but VAD can load last lang used. Also streaming itself isn't the way the natural way the model works as its a pure hack because it is a LLM and it wants an optimised default of 30 secs beamsearch to create context even though that can be shortened. A Apple M1 can take a full command sentance and return with very little latency in non-streaming race-till-idle scenario and that single unit can be shared amongst varios rooms and users. So for me faster streaming is not an issue with Whisper.cpp its just the hardware that some are trying to use. The only reason I came into the issue is I noticed RnNoise was mentioned whilst I have found Whisper doesn't seem to like its input and DeepFilterNet has a LaDspa plugin that can be used with alsa, pipewire or pulseaudio and maybe it was the setup I had or maybe just load that caused the problems but I have really become to doubt Whisper itself might be the solution so never progressed. Also I doubt cloning consumer smart assitants is valid when a single unit can service an array of distributed wireless mics/kws amd deliver audio over modern wireless true audio sytems, than relatively poor mono 'smart speaker' systems that requires a complete additional unit just to do stereo equally as poorly. So you need a single central ASR to service multiple zones and input is by wireless mic and embedded for Whisper.cpp is the wrong platform. So as I say streaming works on the right hardware and likely its not needed as race-till-idle on that equipment produces such small latency anyway hacking streaming isn't probably needed. So once again streaming is fast enough on the right hardware and if you do have the right hardware its likely you don't need streaming input anyway, but hey. There is nothing wrong with current streaming or performance with Whisper,cpp apart from some trying to force a square peg into a round hole. |
Understand the existing Whisper architecture: Before we can modify the decoder/encoder blocks in Whisper, we need to have a clear understanding of how the existing architecture works. This includes understanding the data flow, the transformer model, and how the audio chunks are processed. Modify the decoder/encoder blocks: Once we have a clear understanding of the existing architecture, we can modify the decoder/encoder blocks to handle audio chunks of 10-200 ms. This may involve changing the block size, modifying the input/output buffers, and adjusting the processing pipeline. Update the transformer model's weights: To update the transformer model's weights with every audio chunk, we need to implement an online learning algorithm. This involves computing the gradient of the loss function with respect to the model parameters for each audio chunk and updating the parameters using stochastic gradient descent or a similar algorithm. Take advantage of SIMD instruction sets: To optimize the performance of the real-time streaming functionality, we can take advantage of SIMD instruction sets in C++. This involves using vectorized operations to process multiple audio chunks in parallel, which can significantly improve the processing speed. Test and refine the implementation: Once we have implemented the real-time streaming functionality, we need to test it thoroughly to ensure that it works as expected. We may need to refine the implementation based on the test results and user feedback. |
…ne with method which introduced in ggerganov/whisper.cpp#1951 (#97)
Have you tried building the spectrogram and encoder output in smaller chunks and appending? I think the spectrogram should generate fairly easily with minimal noise depending on the size of the chunk, and the encoder output can also be appended with sufficiently large chunks.
So the encoder instead takes in Bx80xN as its input and outputs Bx(N/2)x(embedding size), if you wanted to send in 1s of audio into the tiny.en model, for example, 1x80x100->1x50x384. This should result in much faster processing for short clips (when the audio clip is <30s), and allows real time streaming without much wasted computation (like having to calculate a full x1500 encoding for each chunk of audio).
Some noise may be introduced at various size of chunks (spectrogram chunk size can be independent from encoder chunk size), and some overlap of spectrogram input/encoder output may help further to reduce that noise. This allows us for better scheduled deployment where the decoder, encoder, and spectrogram can run on different threads at the same time to produce our transcription.
Choosing when to decode will be another challenge as you don't want to decode if a full word is not complete in the encoding, but there are definitely solutions around that as well.
The text was updated successfully, but these errors were encountered: