Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster streaming support #137

Open
ameenba opened this issue Nov 10, 2022 · 27 comments
Open

Faster streaming support #137

ameenba opened this issue Nov 10, 2022 · 27 comments
Labels
ideas Interesting ideas for experimentation

Comments

@ameenba
Copy link

ameenba commented Nov 10, 2022

Have you tried building the spectrogram and encoder output in smaller chunks and appending? I think the spectrogram should generate fairly easily with minimal noise depending on the size of the chunk, and the encoder output can also be appended with sufficiently large chunks.

So the encoder instead takes in Bx80xN as its input and outputs Bx(N/2)x(embedding size), if you wanted to send in 1s of audio into the tiny.en model, for example, 1x80x100->1x50x384. This should result in much faster processing for short clips (when the audio clip is <30s), and allows real time streaming without much wasted computation (like having to calculate a full x1500 encoding for each chunk of audio).

Some noise may be introduced at various size of chunks (spectrogram chunk size can be independent from encoder chunk size), and some overlap of spectrogram input/encoder output may help further to reduce that noise. This allows us for better scheduled deployment where the decoder, encoder, and spectrogram can run on different threads at the same time to produce our transcription.

Choosing when to decode will be another challenge as you don't want to decode if a full word is not complete in the encoding, but there are definitely solutions around that as well.

@ggerganov ggerganov added the ideas Interesting ideas for experimentation label Nov 10, 2022
@ggerganov
Copy link
Owner

Oh my god! I just tested that and it seems to work o.O
I reduced the audio context by half and the performance doubled. jfk.wav transcribed correctly!

Very interesting...
I have to go now, but I think this is a big breakthrough. Need to double check.

@ggerganov
Copy link
Owner

Here is diff if someone wants to play with it:

git diff
diff --git a/whisper.cpp b/whisper.cpp
index 7078863..df47bff 100644
--- a/whisper.cpp
+++ b/whisper.cpp
@@ -1053,6 +1053,7 @@ static bool whisper_model_load(const std::string & fname, whisper_context & wctx
             return false;
         }
     }
+    model.e_pe->ne[1] /= 2;
 
     fin.close();
 
@@ -1076,7 +1077,7 @@ static bool whisper_encode(
     const auto & mel_inp = wctx.mel;
     const auto & hparams = model.hparams;
 
-    const int n_ctx   = hparams.n_audio_ctx;
+    const int n_ctx   = hparams.n_audio_ctx/2;
     const int n_state = hparams.n_audio_state;
     const int n_head  = hparams.n_audio_head;
     const int n_layer = hparams.n_audio_layer;
@@ -1474,7 +1475,7 @@ static bool whisper_decode(
     const int n_layer = hparams.n_text_layer;
 
     const int N = n_tokens;
-    const int M = hparams.n_audio_ctx;
+    const int M = hparams.n_audio_ctx/2;
 
     struct ggml_init_params params = {
             .mem_size   = wctx.buf_compute.size(),

@gitslav
Copy link

gitslav commented Nov 11, 2022

@ggerganov amazing work on this project 👍
Another demo idea: Typing comments while doing an asciinema cast is so 2021...

@trholding
Copy link
Contributor

trholding commented Nov 16, 2022

#10 (comment)

Noise removal such as with rrnoise https://jmvalin.ca/demo/rnnoise/ also increases accuracy of whisper even if sped up a bit. Also had a good experience with "audio companding / compression" in sox along with band pass. Maybe simple audio filtering could help in accuracy when run 2x...

Here is an example: https://github.com/audo-ai/magic-mic/tree/main/src-native

@ggerganov
Copy link
Owner

@trholding

Thanks for the ideas - these are very interesting to me.

Initially, I was very hopeful for the idea of increasing the tempo combined with partially evaluating the encoder as described in the original comment above by @ameenba. It seemed that it would be possible for example to process in real-time 4 second chunks by speeding the tempo to get 2 second chunks and running the Encoder with a context of N/2=100 (i.e. 2 sec) instead of the original N/2=1500 (30 sec) which would in theory give about x15 speed-up for the Encoder. This can easily run in real-time using the base model on a Raspberry Pi 4 for example.

Unfortunately, my experiments show that when I "trim" the Encoder so much, the Decoding fails miserably - usually repeating the first word over and over again. I tried to resolve that by "stitching" sequential Encoder outputs hoping to stabilize the Decoder, but it didn't work out for some reason.

In fact, using anything less than N/2 < 512 (i.e 10 sec) audio context seems to be unstable and at this point the idea for tempo increase becomes not so lucrative in the context of real-time streaming.

Sure, you can use it to process long audio (i.e. > 30 sec) faster, but for this use case you can simply use a 3rd-party software to pre-process the audio (e.g. sox, ffmpeg, etc.) before feeding it to Whisper, so it no longer makes sense to support it on the whisper.cpp level.

VAD and rnnoise could be useful, but same argument as above is valid here as well. In general, unless it is some very simple and robust algorithm for pre-processing that can be easily implemented in C/C++ without any extra dependencies, then I don't have interest in adding it to the project.

Still, the original idea above for audio context reduction with N/2 = 512 (without tempo increase) so far seems to be useful. For example, it is now possible to run the tiny model at 4 second time steps in real-time on RPi4, which I think is getting close to something that can have actual real-world applications.

Next, I want to add asynchronous interface to the C API which will make it easier to create streaming applications for iOS and WebAssembly. I expect with the current performance to be able to create some nice real-time transcription demos on these platforms.

@ameenba
Copy link
Author

ameenba commented Nov 17, 2022

@ggerganov thanks for the investigation!

I also found 5-10s to be close to the limit, I haven't fully investigated from my end in python but curious on what you've tried also, my next steps were:

  • see how much overlap (of mel or encoder) would reduce noise
  • maybe theres a smarter way instead of overlapping? I'm not an audio expert
  • allow KV audio cache size to be dynamic, and investigate effects of changing encoding on each decode cycle? I.e. if the cache size is changing on each new encoding, so first set of decodes have a 5s cache, 1x250x384, and the next set has 10s due to appending, do we need to start over with decoding? Will an offset prevent this? Can we treat this as a totally new decode set with maybe some overlap to prevent words being cut off?

I haven't added the -su option at the moment, I'm trying to investigate this streaming approach in isolation before adding loss from other sources. Although I'm sure like you said, there's a lot of preprocessing that could be done to the input, such as speeding up the speech or cutting noise from outside the vocal range?

I'd like to clarify the current streaming approach too: if you were to encode in 5s increments for a 15s audio sample, you would effectively generate 3 encodings with 5, 10, and 15s of relevant audio each (expanding window), with a padded spectrogram, and so each chunk would cost the same computation as a 30s encoding, and for the final translation, only the last encoding is relevant. in this example, the encoder is 3x less efficient for streaming, while the decoding is roughly the same.
So the first big jump possible would the dynamically sized encoder allows for a 5s, 10s, and 15s encoding (afterwards you pad the encoding output directly before decoding, rather than padding the spectrogram), which will roughly translate to the same computation as a single 30s encoding. And of course the dream is to be able to run this as three independent 5s encodings, effectively reducing our computation by 50%, or however long the audio is divided by 30s.

@ggerganov
Copy link
Owner

@ameenba

  • Haven't performed much tests regarding the overlap - I experimented mostly with 0 ms overlap of audio between sequential chunks, because it is most obvious how to append the encoder outputs one after the other.
  • I tested only with fixed-size KV cache, padding with 0s for the chunks that are not yet available. Always starting the decoding from the beginning. I'll probably test your dynamic cache-size suggestion when I have some time

Also, one important note about the encoder - for each new chunk I use the corresponding position embeddings, by extracting a sub-tensor from the encoder.positional_embedding tensor:

whisper.cpp/whisper.cpp

Lines 1148 to 1151 in 1ac2665

const size_t e_pe_stride = model.e_pe->ne[0]*ggml_element_size(model.e_pe);
const size_t e_pe_offset = model.e_pe->ne[0]*ggml_element_size(model.e_pe)*n_ctx*iter;
struct ggml_tensor * e_pe = ggml_view_2d(ctx0, model.e_pe, model.e_pe->ne[0], n_ctx, e_pe_stride, e_pe_offset);

Not sure if this is correct, but I think it makes sense because this is the way to tell the transformer how the audio is ordered in time.

Another thing that I have in mind is that it is obvious that appending the encoder outputs in this way is not equivalent to processing the audio all together. The question is - does this produce at least an approximation of the original result, or is there a better strategy to mix them?

I haven't added the -su option at the moment, I'm trying to investigate this streaming approach in isolation before adding loss from other sources.

The -su option might become useful only if the encoding starts working with ~1 second chunks. So I wouldn't worry about it for now.

I'd like to clarify the current streaming approach too: if you were to encode in 5s increments for a 15s audio sample, you would effectively generate 3 encodings with 5, 10, and 15s of relevant audio each (expanding window), with a padded spectrogram, and so each chunk would cost the same computation as a 30s encoding, and for the final translation, only the last encoding is relevant. in this example, the encoder is 3x less efficient for streaming, while the decoding is roughly the same.

Correct. To clarify, the decoding time is proportional to the number of decoded tokens.

So the first big jump possible would the dynamically sized encoder allows for a 5s, 10s, and 15s encoding (afterwards you pad the encoding output directly before decoding, rather than padding the spectrogram), which will roughly translate to the same computation as a single 30s encoding. And of course the dream is to be able to run this as three independent 5s encodings, effectively reducing our computation by 50%, or however long the audio is divided by 30s.

Yes, I agree.
Alternative real-time "adaptive" strategy that I had in mind is like this:

  • Run tiny.en at each 5s second chunk to get rough transcription fast - i.e. for t=[0,5], t=[5, 10], t=[10, 15]
  • Then, while processing t=[15, 20], run a bigger model in parallel on t=[0, 15] and improve the transcription retrospectively

@nyadla-sys
Copy link

nyadla-sys commented Dec 7, 2022

##I guess below logic may work better for stream application

pcmf32.resize(n_samples_new);

  SDL_DequeueAudio(g_dev_id_in, pcmf32.data() , n_samples_new*sizeof(float));

#if 0
static int total_samples = 0;
total_samples += n_samples_new;
if (total_samples <= 480000) {
memcpy(&pcmf32_to_keep[total_samples - n_samples_new], pcmf32.data(), n_samples_new * sizeof(float));
} else {
memmove(&pcmf32_to_keep[0], &pcmf32_to_keep[n_samples_new], (n_samples_30s - n_samples_new) * sizeof(float));
memcpy(&pcmf32_to_keep[n_samples_30s - n_samples_new], pcmf32.data(), n_samples_new * sizeof(float));
}
#else

   static int n_buffer_samples = 0;
   if( (n_buffer_samples+n_samples_new) > n_samples_30s)
      pcmf32_to_keep.resize(n_buffer_samples+n_samples_new);

   for (int i = 0; i < n_samples_new; i++) {
      pcmf32_to_keep[n_buffer_samples+i] = pcmf32[i] ;
   }

   n_buffer_samples = n_buffer_samples+n_samples_new;

   if(n_buffer_samples >= n_samples_30s){
      pcmf32 = std::vector<float>(pcmf32_to_keep.end() - n_samples_30s, pcmf32_to_keep.end());
      pcmf32_to_keep.resize(n_samples_30s);
      pcmf32_to_keep = pcmf32;
      n_buffer_samples = n_samples_30s;
   }

#endif

  //Generate spectrograms
  const auto processor_count = std::thread::hardware_concurrency();
  if (!log_mel_spectrogram(pcmf32_to_keep.data(), pcmf32_to_keep.size(), WHISPER_SAMPLE_RATE, WHISPER_N_FFT, WHISPER_HOP_LENGTH, WHISPER_N_MEL, processor_count,filters, mel)) {
    fprintf(stderr, "%s: failed to compute mel spectrogram\n", __func__);
    return -1;
  }

//Run inference here

@dragonman225
Copy link

Hi guys, I've been tinkering with whisper.cpp for a while to explore how well it works for streaming/real-time applications. So far I'm able to get not bad transcription quality with reasonable CPU usage. I would like to share my progress so far.

First, a demo video of the result:

out.mp4

The CPU usage on an M1 MacBook Air is around 100% to 200% (1 core = 100%), leaving much of the resources for other applications that people may want to run alongside, for example, a video call. Note that I wrapped whisper.cpp in an Electron app with node-addon-api, so the actual %CPU used by whisper should be even lower.

Here are the tricks I'm using:

  • I use the smallest tiny.en model.
  • Whisper runs at 400ms step.
  • 4 threads seem to be the sweet spot for 8-core M1. Having more or less results in worse performance.
  • Setting audio_ctx to 750 (half of the default), as explored eariler, do speed up transcription by 2x. In my case, it reflects through about 2x lower CPU usage.
  • Use simple VAD like in the "stream" example to decide when a sentence ends. My parameters for the VAD function are: 3 seconds of audio, detecting silence in the last 450ms, vad_thold 0.3, freq_thold 200.
    • If no sentence end is detected for too long (14 seconds), for example, when a person speaks so fast that there's no long enough gap, force an end and listen for a new sentence.
  • Each sentence is preceded by 0.2 seconds of audio of the previous sentence, in case the cut is in the middle of a word.

Here is my code if you want to check out: https://github.com/tableos/mina/blob/main/native/stt_whisper.cc#L109

At this moment I'm satisfied with the quality-performance balance of this setup, in terms of streaming/real-time application, and I'm most wondering if there're other low-hanging fruit to try to further improve quality or performance. (For example, maybe, using a more accurate VAD such Silero VAD.)

@stevevaius2015
Copy link

Is it possible to use other available languages in this?

@ggerganov
Copy link
Owner

@dragonman225
Thanks for sharing this!
You might want to use audio_ctx equal to 768 - this will make processing slightly faster since it is multiple of 32 / 64.

One idea could be to use the text from the previous sentence as a prompt / context for the currently transcribed sentence. This is prototyped in the stream example via the --keep_context command-line argument:

fprintf(stderr, " -kc, --keep-context [%-7s] keep context between audio chunks\n", params.no_context ? "false" : "true");

This is something that might or might not help improve the quality of the transcription. I haven't done much investigation.

@dragonman225
Copy link

@ggerganov thanks for your suggestions!

I tried changing audio_ctx to 768, no noticeable change in CPU usage, but it makes sense. I wonder what whisper.cpp does when audio_ctx isn't a multiple of 32 / 64, does it need to pad zeros to make it a multiple, therefore wasting some computation in the neural network?

It's a good idea to use previously transcribed text as the prompt, but soon after I tried it I found it may not be needed in my app. Since my app uses VAD, when it's transcribing a sentence, it keeps accumulating audio chunks for that sentence and transcribes the them as a whole. This gives whisper the opportunity to correct transcription of old audio chunks as new ones come in, until the end of a sentence is detected by VAD. (you can see this in action by looking at the blue sentences in my demo video) Ideally, if VAD is accurate, I think every sentence would be pretty independent of each other, and keeping context in whisper is maybe not necessary. In sum, I'll keep this option in mind.

@dragonman225
Copy link

@stevevaius2015

Is it possible to use other available languages in this?

Yes, you can change the language parameter like in all other examples provided by whisper.cpp. (a list of all available languages)

@ggerganov
Copy link
Owner

@dragonman225

I tried changing audio_ctx to 768, no noticeable change in CPU usage, but it makes sense. I wonder what whisper.cpp does when audio_ctx isn't a multiple of 32 / 64, does it need to pad zeros to make it a multiple, therefore wasting some computation in the neural network?

It processes any non-multiple-of-32/64 leftovers without using SIMD:

whisper.cpp/ggml.c

Lines 737 to 740 in 4e0b206

// leftovers
for (int i = np; i < n; ++i) {
sumf += x[i]*y[i];
}

@fquirin
Copy link

fquirin commented Apr 23, 2023

Quick question, is it possible to use audio_ctx with main as well? I've only seen it in the stream script so far.
I'd like to try it on short files (<30s) instead of microphone input.

@StuartIanNaylor
Copy link

StuartIanNaylor commented May 1, 2023

#10 (comment)

Noise removal such as with rrnoise https://jmvalin.ca/demo/rnnoise/ also increases accuracy of whisper even if sped up a bit. Also had a good experience with "audio companding / compression" in sox along with band pass. Maybe simple audio filtering could help in accuracy when run 2x...

Here is an example: https://github.com/audo-ai/magic-mic/tree/main/src-native

Noise removal is a strange one as an absolutely great noise removal aka RTXVoice level https://github.com/Rikorose/DeepFilterNet seems reduce Whisper accuracy which is strange as the results to human ear are superb.

RnnNoise is pretty stinky IMO the model and tech was a very early method and often fails.
DTLN is sort of inbetween the Deepfilternet & RnnNoise in terms of load (If the single thread tract ML framework was dropped for another then likely Deepfilternet would run on lesser)
https://github.com/breizhn/DTLN

I have hunch that filtering unknown noise is likely not the way to go and like Google Voice-Filter-Lite BSS (blind source seperation) and extracting the signal containing a known KW is likely to give better low load solutions.
I think the esp32-s3-box does this where its just a basic BSS alg for a 2 channel input with a KWS on each signal.
But github contains many repo's https://github.com/fakufaku/fast_bss_eval

Simulary there are monophonic BSS but the multichannel based on TDOA seem to provide better results for load.
But a stereo ADC is only $13 https://plugable.com/products/usb-audio you need a single device or you will get clock drift

@fquirin
Copy link

fquirin commented May 25, 2023

DeepFilterNet seems reduce Whisper accuracy which is strange as the results to human ear are superb.

In my experience there is no pre processing that increases Whisper accuracy, at least I haven't seen one so far. Noise robustness is already great and my theory is that any additional processing that wasn't part of the training process introduces artifacts even unnoticeable for humans 🤔.
If you have multiple mics and can separate the sources that is a different story.

@StuartIanNaylor
Copy link

StuartIanNaylor commented May 26, 2023

Noise robustness is already great

I am not sure what you are testing but Noise robustness is not great as its not something Whisper is designed to do.
https://cdn.openai.com/papers/whisper.pdf Has the WER vs SNR and its an improvement on Wav2Vec2 but still quickly degrades on relatively low levels of noise.

Whisper is tuned to human hearing and forgot the low-end but likely 80-8Khz that is split into 40 bins of the spectrogram.
That is the weird thing about Deepfilternet as we should be able to hear or at least see the artefacts in the spectrogram but its fullband filter does such an amazing job it would seem artefact free.
Artefacts I am keeping in a narrow band of usual codec error sounds and pops, whistles and noise bleed through.
Its a curious one as I have tested Deepfilternet with KWS and the levels you can run noise to likely can beat consumer units and yeah whatever it is unoticeable by humans and for ASR that really doesn't make much sense :)

There is a whole range of speech enhancement techniques that can increase Whisper accuracy as they increase the SNR and reduce noise and reverberation.
The multiple mic algs often work in the time domain of mixed signals by comparing tracks and provide simpler and more light weight filters than DeepFilterNet or RtxVoice.
Likely the targetted voice extraction of Googles VoicefilterLite provides the best results on modest hardware by extracting the known rather than attenuating the unknown of noise.
https://arxiv.org/pdf/2106.02934.pdf
Filters, Source seperation to deverberation are an active area of dev and seem often to be single channel designs that the only detriment is they tend also to be a bit heavy for embedded devices.

https://arxiv.org/pdf/2202.05993.pdf Haven't seen a Wav2Vec2.cpp but maybe one exits and should really checkout Pytorch Mobile or some of the ONXX implementations.
The advantage being with full opensource you can train the models, plus custom domain LM's being specific can add much accuracy in terms of command ASR and known language subsets such as a media library and fairly lite to train as opposed to the full model.

I didn't reaslly do much of an empirical test with DeepFilterNet and my memory is hazy as thinking about it it could be load and on the testbed of a RK3588 as struggling to remember if I tried it on my XEON desktop.
I didn't bother too much though as likey you could feed a central whisper via a BSS+KWS of cheap and light weight microcontrollers as Espressif do and create wireless distributed zonal streams already seperated for Whisper.

@fquirin
Copy link

fquirin commented May 26, 2023

I am not sure what you are testing but Noise robustness is not great as its not something Whisper is designed to do.

Ok let me rephrase that: Its pretty great compared to other open ASR systems I've tested ;-). Btw I'm testing it simply by using it in real world scenarios like my SEPIA smart-speaker in the living-room with all kinds of noise from the street (people, cars, train etc.) and occasional background from TV etc..

There is a whole range of speech enhancement techniques that can increase Whisper accuracy as they increase the SNR and reduce noise and reverberation.

If they are open-source and run in real-time on a consumer CPU and single or dual-mic setup I'd be happy to test them :-). "Googles VoicefilterLite" looks interesting, are there code examples?

@StuartIanNaylor
Copy link

Its pretty great compared to other open ASR systems I've tested

The WER vs SNR is in https://cdn.openai.com/papers/whisper.pdf page 8 and you can see its not drastically different to the others they tested.
The SNR really needs to be above 10db or x3.2 signal to noise as it quickly fails less than that, so it can be quickly swamped by noise as noise from the street should be low level and hopefully when you in the house 10db or less.
DeepFilterNet is sort of RtxVoice comparable and not sure what the artefacts are, that can be heard and clearly seen in spectrogram from say RnNoise or DTLN. I guess it must be more subtle but Whisper doesn't like.

There is a mp3 on the github page demonstrating DeepFilterNet noise reduction and likely you could train on a dataset such as using https://github.com/microsoft/MS-SNSD on the Librispeech 960h dataset and then process with DeepFilterNet to create a training dataset.

However Whisper doesn't like DeepFilterNet and presume others could be the same, but just haven't tried. I was replying to the post about RnNoise as have used it before and have heard the artefacts it can produce so was suprised that it works with Whisper as said above.

Googles VoicefilterLite

Nope some brief papers but the have kept it to themselves and think it runs on both phones and there smart speaker devices.
There are some other examples on github of targetted speech seperation but nothing as lite as what Google seem to state.
Even if it was avail I have hunch Google train in its effects by preprocessing a dataset and whatever filter / alg you use, you might have to do the same or get a increase in WER.

@fquirin
Copy link

fquirin commented May 26, 2023

The WER vs SNR is in https://cdn.openai.com/papers/whisper.pdf page 8 and you can see its not drastically different to the others they tested.

I think the main difference here to my real world experience is that librispeech-clean is simply a bad benchmark. Its the benchmark you use when you want to show your best WER numbers 😅. Maybe my subjective experience is different because I start at a point where WER in general is already much higher or maybe its not even the noise but the robustness of Whisper when working with low-quality recordings and voice assistant specific vocabulary 🤷. In my tests Whisper small already outperforms Nvidias CTC Conformer models btw.

However Whisper doesn't like DeepFilterNet

Maybe we could fine-tune Whisper on DeepFilterNet filtered audio? 🤔

@StuartIanNaylor
Copy link

StuartIanNaylor commented May 27, 2023

librispeech-clean is simply a bad benchmark

They didn't use librispeech-clean they used librispeech-clean mixed with white noise and librispeech-clean mixed by https://code.soundsoftware.ac.uk/projects/audio-degradation-toolbox to create noisy datasets to benchmark.
librispeech-clean was picked as a base so they could control the SNR and make a relevent dataset.
As said it is in https://cdn.openai.com/papers/whisper.pdf page 8

Maybe we could fine-tune Whisper on DeepFilterNet filtered audio?
Maybe https://wandb.ai/parambharat/whisper_finetuning/reports/Fine-tuning-Whisper-ASR-models---VmlldzozMTEzNDE5 is a good example that links to https://huggingface.co/blog/fine-tune-whisper

Its interesting though as Whisper accuracy is a bit of a cheat as it merely feeds tokens into a LLM with its 30sec beamsearch and it recreates the sentence as the most plausible in the model. That is why it halucinates and WER rockets on short command sentances.
My interest is also smart-speaker type applications and IMO Whisper is a bad fit, so I haven't bothered exploring that further.
What @ggerganov has done with GGml and transformers is amazing but Whisper itself as this all-in-one psuedo opensource model is a bit clunky especially for smart-speaker type applications.

Still though load and more emprical testing is needed and maybe KW command sentances can be created by TTS and added to the dataset, As said though Whisper is great at what it does but for smart-speaker type applications I have become less of a fan.

@fquirin
Copy link

fquirin commented May 27, 2023

They didn't use librispeech-clean they used librispeech-clean mixed with white noise

Yes, I know, but my point is that librispeech-clean ist just a bad baseline and when you mix it with noise you still build on perfectly clean recordings. Maybe your WER drops from 4% to 6% or even 8%, that doesn't matter when your real-life baseline for Whisper is, lets say ~10% and CTC conformer starts at ~15%. My impression of noise-robustness then is simply general robustness to bad recordings.

That is why it halucinates and WER rockets on short command sentances.

Hallucination is a problem, thats true, but not as relevant as you might think when your input is captured in a well defined window. Also short commands work pretty well. I'm almost exclusively testing on 3-6s recordings and it still beats everything else in this area. Whsiper small can capture most of what you need for daily voice-assistant usage, especially in English language, from timers, to shopping-lists, from smart home control to music, navigation and general QA.

Whisper itself as this all-in-one psuedo opensource model

That's bothering me as well :-/

@StuartIanNaylor
Copy link

StuartIanNaylor commented May 27, 2023

when you mix it with noise you still build on perfectly clean recordings.

You are having another heat sink moment. You have to have a clean base so when you mix noise you know exactly how much noise is in the dataset so the test is empirical and worth while. OpenAi would look a bit foolish to say it seems ok with noise and leave at that...

My impression of noise-robustness then is simply general robustness to bad recordings.

That is great but it means absolutely nothing as a 'bad recording' is a recording with noise and other impressions such as OpenAI or myself would also like to apply a meteric to test that interms of how bad by adding noise at various db levels.

Also short commands work pretty well. I'm almost exclusively testing on 3-6s recordings and it still beats everything else in this area.

That means absolutely nothing without specific empirical testings and quoting what it beats. What exactly is everything else in this area? Are we talking wav2vec or the latest conformers that actually beat Whisper WER? That are general purpose that could be trained with command sentences and LM's.

I get you have added whisper to your Sepia framework and your a fan with no empirical true testing and I will prefer to keep to what OpenAi posts and my own tests than what you may believe.
I was trying 'turn on the light'/'turn off the light' repetively and at different distances from a mic and it wasn't great, not empirical and not much worth as a test but the last point of the all-in-one psuedo opensource model stops me even bothering.

wav2vec2 with command domain dataset and LM likely would beat Whisper and newer conformer models such as https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/stt_en_conformer_transducer_xlarge definately would.

There are a number of conformer models out in the wild have you tested and trained any of these and tried to create some sort of documented empirical test that contains metrics? Google conformer or stt_conformer in the above OpenAi paper https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/stt_en_conformer_transducer_xlarge beats Whisper already in this field. Its was Google who 1st wrote the paper https://arxiv.org/pdf/2005.08100.pdf but I think wenet2 is conformer based and a ready made end2end ASR ready to test. https://github.com/wenet-e2e/wenet dunno what its WER levels are, but a few exist on github and nvidia, which is best I dunno.

It would be so good if georgi could take a full blown opensource transformer model like the newer conformers and apply optimised ggml code as its likely every aspect of whisper from WER, memory footprint to performance could be beaten considerabilly and trained with speech enhancement preprocessing uping noise levels sub 0db with domain specific datasets and LMs.

@fquirin
Copy link

fquirin commented May 27, 2023

It seems you need some kind of scientific documentation for everything I say and atm I can't give you that, because I don't have the time to collect statistically relevant data and write down results in a way I would write scientific work. Fair enough, I see the problem.
Here is what I did to get to "my believe" and you can decide for yourself if you want to take it seriously or not:

  • I've recorded a few dozen 3-6s long voice assistant specific commands with 3 different systems, different distances from 1-3m in English and German
  • I've tested those against alle the systems from my speech-recognition-experiments repository and a few others to get empirical results for performance (speed and accuracy). For example: Vosk, Coqui, Nvidia NeMo CTC in all sizes, with and without LMs, speechcatcher (an ESPNet streaming conformer), Sherpa NCNN (next-gen Kaldi) and Whisper (several variants).
  • Most of these systems don't even have proper German models. The best one besides Whisper is probably NeMo DE conformer transducer large (seems still better than the new fast conformer models). Whisper small beats it in roughly 8 out of 10 cases while often being 100% correct. All the CTC models generate weird vocabulary in a frequency that is just too high. I'm still trying to fix this with LMs with mixed results.
  • Whisper small for English reaches 100% accuracy in most of my test-files with few occasional outliers. NeMo fast conformer is pretty good too, very fast (indeed), but has the typical CTC glitches, so it is often close, but hardly ever 100% correct.
  • Vosk with a custom LM based on a small 2000 sentences corpus is my favorite German base model for SEPIA right now, but will obviously fail for large-vocab tasks like playing music by artists or navigating to certain place. I plan to fallback to Whisper for these task right now.

The main problem I have with Whisper is speed. Which ultimately brings us back to the title of this thread "Faster streaming support" 😉.
I'm happy to discuss Whisper results further but maybe we should move this to some place else.

@StuartIanNaylor
Copy link

No I think its relevant as streaming support is fast if run on Apple silicon first-class citizen - optimized via ARM NEON, Accelerate framework and Core ML.
Its when people try to squeeze whisper onto platforms that are not really capable it becomes an issue.

For a smart speaker/assistants multi language in one model is rarely needed anyway so much smaller more effcient models can be used that likely are single lang loaded by something like silero vad as in the confines of a home use is unlikely multilingual but VAD can load last lang used.

Also streaming itself isn't the way the natural way the model works as its a pure hack because it is a LLM and it wants an optimised default of 30 secs beamsearch to create context even though that can be shortened.
Many ASR, LLM, NLP models are not natural streamers and the idea of cloning single use consumer devices is not needed when a central system non-streaming with the right equipment due to diversification of use will take full samples and race-till-idle and queue any other calls.

A Apple M1 can take a full command sentance and return with very little latency in non-streaming race-till-idle scenario and that single unit can be shared amongst varios rooms and users.
Its only when someone tries to force fit single use streaming into embeded devices like a Pi4 it becomes pretty awful or equally low power architecture that is often lacking in more modern instructions that ArmV8.2 gives such Mat/Prod and co-processors and GPU's.

So for me faster streaming is not an issue with Whisper.cpp its just the hardware that some are trying to use.

The only reason I came into the issue is I noticed RnNoise was mentioned whilst I have found Whisper doesn't seem to like its input and DeepFilterNet has a LaDspa plugin that can be used with alsa, pipewire or pulseaudio and maybe it was the setup I had or maybe just load that caused the problems but I have really become to doubt Whisper itself might be the solution so never progressed.

Also I doubt cloning consumer smart assitants is valid when a single unit can service an array of distributed wireless mics/kws amd deliver audio over modern wireless true audio sytems, than relatively poor mono 'smart speaker' systems that requires a complete additional unit just to do stereo equally as poorly.
The model big data has provided is relatively short life ewaste and I don't think its one opensource should clone anyway.

So you need a single central ASR to service multiple zones and input is by wireless mic and embedded for Whisper.cpp is the wrong platform.
Simulary the assumption the same ASR model is going to service command based speech as well as conversational speech is likely equally wrong that also likely true of skill types.
That is where the all-in-one of whisper becomes problematic where partitioning to multi model skill based language subsets is the way to go that a simple predicate ASR will route to a specific skill based ASR and again you need much more than embedded but we are reaching a hardware level where a single shared unit is cost effective and also more energy efficient.
There is only really 3 main skill based ASR needs, command, media and conversational. The 1st 2 are extremely domain specific and should have domain specific datasets where LM's are created on the fly for additional commands or additions to local media and only the latter of coversational needs to be a full blown general purpose language model.

So as I say streaming works on the right hardware and likely its not needed as race-till-idle on that equipment produces such small latency anyway hacking streaming isn't probably needed.
When it comes to filters it would seem whisper is picky as any ASR might be with the signature speech enhancement might create so likely you would have to train and fine tune Whisper.
If you have to do that why bother and create far more optimised and domain specific models yourself with datasets for the lang you use...

So once again streaming is fast enough on the right hardware and if you do have the right hardware its likely you don't need streaming input anyway, but hey.
Also to match consumer level smart assistant noise tolerances you need to employ a filter and fine tune train whisper that is likely equally as much work as creating smaller and more effecient models on full opensource on different model types that will also allow domain and skill specific ASR...

There is nothing wrong with current streaming or performance with Whisper,cpp apart from some trying to force a square peg into a round hole.

@johnhoang-star
Copy link

Understand the existing Whisper architecture: Before we can modify the decoder/encoder blocks in Whisper, we need to have a clear understanding of how the existing architecture works. This includes understanding the data flow, the transformer model, and how the audio chunks are processed.

Modify the decoder/encoder blocks: Once we have a clear understanding of the existing architecture, we can modify the decoder/encoder blocks to handle audio chunks of 10-200 ms. This may involve changing the block size, modifying the input/output buffers, and adjusting the processing pipeline.

Update the transformer model's weights: To update the transformer model's weights with every audio chunk, we need to implement an online learning algorithm. This involves computing the gradient of the loss function with respect to the model parameters for each audio chunk and updating the parameters using stochastic gradient descent or a similar algorithm.

Take advantage of SIMD instruction sets: To optimize the performance of the real-time streaming functionality, we can take advantage of SIMD instruction sets in C++. This involves using vectorized operations to process multiple audio chunks in parallel, which can significantly improve the processing speed.

Test and refine the implementation: Once we have implemented the real-time streaming functionality, we need to test it thoroughly to ensure that it works as expected. We may need to refine the implementation based on the test results and user feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ideas Interesting ideas for experimentation
Projects
None yet
Development

No branches or pull requests

10 participants