Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can real-time transcription be achieved? #1653

Open
AimoneAndex opened this issue Dec 18, 2023 · 7 comments
Open

Can real-time transcription be achieved? #1653

AimoneAndex opened this issue Dec 18, 2023 · 7 comments
Labels
question Further information is requested

Comments

@AimoneAndex
Copy link

Hi,there!
Thanks for your hard work!
Due to work requirements, it is so necessary for me to transcribe speech into text in real time, and the precision required is like that of a whisker.
Could we use some simple methods to achieve real-time transcription?
Is there a corresponding code?
Or can other methods be used to achieve it?
Thank you!

@themanyone
Copy link

themanyone commented Dec 19, 2023

I am the author of Caption Anything and Whisper Dictation which alternately transcribe from the record monitor device (what you hear), or from the microphone. They can connect to either whisper, whisper-jax, whisper.cpp or similar servers for the back-end.

The challenge was making timely recordings, buffering enough audio to provide cogent text generation while keeping recording times short enough to minimize delays. I had to implement a Queuing System: To maintain smooth streaming and prevent delays in transcribing long clips. Storing incoming audio clips in a queue and processes them one by one as resources become available. This ensures that the system continuously transcribes without overwhelming it with too many simultaneous tasks. AI assistants were helpful in pointing out methods of doing this. Caption Anything uses a flood of 2-second clips from a mix of the actively-playing desktop streams. While Whisper Dictation uses sound level detection to record longer clips from the microphone for better accuracy.

Whisper falls behind on slow systems, and increasing the number of threads won't help. So there is no real-time transcription option everywhere. But speed can be improved by compiling with acceleration like cuBLAS or CLBlast, using the tiny models or using a client-server setup. Quantizing the models to 4-bit also cuts transcription time in half on my old laptop, so it can keep up. For speed, it is critical to get the loaded model to fit easily in available RAM.

@dgm3333
Copy link

dgm3333 commented Dec 19, 2023

Did you look at the examples?
This is a naive example of performing real-time inference on audio from your microphone.
Real-time transcription in the browser using WebAssembly

There are possible problems with accuracy:
#1641

This pull request has suggested someone else is at least working on the code
#1649

@themanyone
Copy link

themanyone commented Dec 19, 2023

Thanks for the links. Those examples should work well if you have a fairly-fast machine. The problems with accuracy are probably due to buffering. It was a big challenge I had to overcome. Once they get that sorted, I would be happy to use real-time inference in my code. But as for now, I will be doing the audio buffering myself. :)

Whisper Dictation does "everything else" outside the scope of whisper, such as pasting the resulting text into the active window or terminal, communicating with stable-diffusion and chat servers, speaking the results out-loud, launching programs and running editing commands with voice control. It is important to get good accuracy, so we do everything we can to achieve the best-possible result.

@dgm3333
Copy link

dgm3333 commented Dec 19, 2023

Yeah thats all useful. I did the automation a while ago for an internal app using native cpp, libcurl and imgui, but probably a helpful integration for the python community. I haven't looked at Langchain, autogpt, etc recently but would they benefit?

Incidentally I've just done an update on some other PC components and shifted to Win 11 and real-time inference is now working perfectly, so fingers crossed it lasts. Will have to test on some other machines to see what makes the difference.

@themanyone
Copy link

Good ideas. By "buffering" I mean getting clean clips of audio, which depends a great deal on having the microphone input volume set appropriately. To improve accuracy, voice activity detection (VAD) waits for silence before sending the clip off to be transcribed, instead of cutting off in the middle. You may have fixed your mic volume level.

The example uses -vth 0.6 for silence detection. But audio clipping from having mic volume set too high can severely reduce accuracy. As well as volume being too low, picking up only the loudest portions of speech. Incorrectly identifying parts of it as silence.

Setting the volume can be tricky. And that too is outside of scope! But it should be possible for the software to detect clipping or volume being overall too-low, attempt to be smart about it and adjust the mic mix. Or if that fails, warn the user somehow via stderr.

@bobqianic bobqianic added the question Further information is requested label Dec 23, 2023
@AimoneAndex
Copy link
Author

I am the author of Caption Anything and Whisper Dictation which alternately transcribe from the record monitor device (what you hear), or from the microphone. They can connect to either whisper, whisper-jax, whisper.cpp or similar servers for the back-end.

The challenge was making timely recordings, buffering enough audio to provide cogent text generation while keeping recording times short enough to minimize delays. I had to implement a Queuing System: To maintain smooth streaming and prevent delays in transcribing long clips. Storing incoming audio clips in a queue and processes them one by one as resources become available. This ensures that the system continuously transcribes without overwhelming it with too many simultaneous tasks. AI assistants were helpful in pointing out methods of doing this. Caption Anything uses a flood of 2-second clips from a mix of the actively-playing desktop streams. While Whisper Dictation uses sound level detection to record longer clips from the microphone for better accuracy.

Whisper falls behind on slow systems, and increasing the number of threads won't help. So there is no real-time transcription option everywhere. But speed can be improved by compiling with acceleration like cuBLAS, using the tiny models or using a client-server setup. Quantizing the models to 4-bit also cuts transcription time in half on my old laptop, so it can keep up.

You know this is very important for my work, and I will carefully check it in my spare time. Thank you!

@AimoneAndex
Copy link
Author

Did you look at the examples? This is a naive example of performing real-time inference on audio from your microphone. Real-time transcription in the browser using WebAssembly

There are possible problems with accuracy: #1641

This pull request has suggested someone else is at least working on the code #1649

Thank you so much!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants