Can real-time transcription be achieved? #1653

AimoneAndex · 2023-12-18T14:06:56Z

Hi,there!
Thanks for your hard work!
Due to work requirements, it is so necessary for me to transcribe speech into text in real time, and the precision required is like that of a whisker.
Could we use some simple methods to achieve real-time transcription?
Is there a corresponding code?
Or can other methods be used to achieve it?
Thank you!

themanyone · 2023-12-19T03:12:47Z

I am the author of Caption Anything and Whisper Dictation which alternately transcribe from the record monitor device (what you hear), or from the microphone. They can connect to either whisper, whisper-jax, whisper.cpp or similar servers for the back-end.

The challenge was making timely recordings, buffering enough audio to provide cogent text generation while keeping recording times short enough to minimize delays. I had to implement a Queuing System: To maintain smooth streaming and prevent delays in transcribing long clips. Storing incoming audio clips in a queue and processes them one by one as resources become available. This ensures that the system continuously transcribes without overwhelming it with too many simultaneous tasks. AI assistants were helpful in pointing out methods of doing this. Caption Anything uses a flood of 2-second clips from a mix of the actively-playing desktop streams. While Whisper Dictation uses sound level detection to record longer clips from the microphone for better accuracy.

Whisper falls behind on slow systems, and increasing the number of threads won't help. So there is no real-time transcription option everywhere. But speed can be improved by compiling with acceleration like cuBLAS or CLBlast, using the tiny models or using a client-server setup. Quantizing the models to 4-bit also cuts transcription time in half on my old laptop, so it can keep up. For speed, it is critical to get the loaded model to fit easily in available RAM.

dgm3333 · 2023-12-19T19:49:08Z

Did you look at the examples?
This is a naive example of performing real-time inference on audio from your microphone.
Real-time transcription in the browser using WebAssembly

There are possible problems with accuracy:
#1641

This pull request has suggested someone else is at least working on the code
#1649

themanyone · 2023-12-19T20:50:22Z

Thanks for the links. Those examples should work well if you have a fairly-fast machine. The problems with accuracy are probably due to buffering. It was a big challenge I had to overcome. Once they get that sorted, I would be happy to use real-time inference in my code. But as for now, I will be doing the audio buffering myself. :)

Whisper Dictation does "everything else" outside the scope of whisper, such as pasting the resulting text into the active window or terminal, communicating with stable-diffusion and chat servers, speaking the results out-loud, launching programs and running editing commands with voice control. It is important to get good accuracy, so we do everything we can to achieve the best-possible result.

dgm3333 · 2023-12-19T21:03:30Z

Yeah thats all useful. I did the automation a while ago for an internal app using native cpp, libcurl and imgui, but probably a helpful integration for the python community. I haven't looked at Langchain, autogpt, etc recently but would they benefit?

Incidentally I've just done an update on some other PC components and shifted to Win 11 and real-time inference is now working perfectly, so fingers crossed it lasts. Will have to test on some other machines to see what makes the difference.

themanyone · 2023-12-20T01:37:18Z

Good ideas. By "buffering" I mean getting clean clips of audio, which depends a great deal on having the microphone input volume set appropriately. To improve accuracy, voice activity detection (VAD) waits for silence before sending the clip off to be transcribed, instead of cutting off in the middle. You may have fixed your mic volume level.

The example uses -vth 0.6 for silence detection. But audio clipping from having mic volume set too high can severely reduce accuracy. As well as volume being too low, picking up only the loudest portions of speech. Incorrectly identifying parts of it as silence.

Setting the volume can be tricky. And that too is outside of scope! But it should be possible for the software to detect clipping or volume being overall too-low, attempt to be smart about it and adjust the mic mix. Or if that fails, warn the user somehow via stderr.

AimoneAndex · 2023-12-27T04:48:21Z

I am the author of Caption Anything and Whisper Dictation which alternately transcribe from the record monitor device (what you hear), or from the microphone. They can connect to either whisper, whisper-jax, whisper.cpp or similar servers for the back-end.

The challenge was making timely recordings, buffering enough audio to provide cogent text generation while keeping recording times short enough to minimize delays. I had to implement a Queuing System: To maintain smooth streaming and prevent delays in transcribing long clips. Storing incoming audio clips in a queue and processes them one by one as resources become available. This ensures that the system continuously transcribes without overwhelming it with too many simultaneous tasks. AI assistants were helpful in pointing out methods of doing this. Caption Anything uses a flood of 2-second clips from a mix of the actively-playing desktop streams. While Whisper Dictation uses sound level detection to record longer clips from the microphone for better accuracy.

Whisper falls behind on slow systems, and increasing the number of threads won't help. So there is no real-time transcription option everywhere. But speed can be improved by compiling with acceleration like cuBLAS, using the tiny models or using a client-server setup. Quantizing the models to 4-bit also cuts transcription time in half on my old laptop, so it can keep up.

You know this is very important for my work, and I will carefully check it in my spare time. Thank you!

AimoneAndex · 2023-12-27T04:48:39Z

Did you look at the examples? This is a naive example of performing real-time inference on audio from your microphone. Real-time transcription in the browser using WebAssembly

There are possible problems with accuracy: #1641

This pull request has suggested someone else is at least working on the code #1649

Thank you so much!!!

themanyone mentioned this issue Dec 19, 2023

How to increase speech to text speed when using whisper cpp? #1635

Open

bobqianic added the question Further information is requested label Dec 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can real-time transcription be achieved? #1653

Can real-time transcription be achieved? #1653

AimoneAndex commented Dec 18, 2023

themanyone commented Dec 19, 2023 •

edited

Loading

dgm3333 commented Dec 19, 2023 •

edited

Loading

themanyone commented Dec 19, 2023 •

edited

Loading

dgm3333 commented Dec 19, 2023

themanyone commented Dec 20, 2023

AimoneAndex commented Dec 27, 2023

AimoneAndex commented Dec 27, 2023

Can real-time transcription be achieved? #1653

Can real-time transcription be achieved? #1653

Comments

AimoneAndex commented Dec 18, 2023

themanyone commented Dec 19, 2023 • edited Loading

dgm3333 commented Dec 19, 2023 • edited Loading

themanyone commented Dec 19, 2023 • edited Loading

dgm3333 commented Dec 19, 2023

themanyone commented Dec 20, 2023

AimoneAndex commented Dec 27, 2023

AimoneAndex commented Dec 27, 2023

themanyone commented Dec 19, 2023 •

edited

Loading

dgm3333 commented Dec 19, 2023 •

edited

Loading

themanyone commented Dec 19, 2023 •

edited

Loading