-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
I have an issue where when I am using real time transcription, when I am not talking, it seems like it parses random text. #4
Comments
This can be fixed with VAD detection support. But, VAD detection is not yet implemented. |
I am trying to apply VAD into the C++ source of my project. Get ideas from file: https://github.com/vilassn/whisper_android/blob/master/app/src/main/cpp/silent_detection.cpp I tried calculating dB for each input audio clip according to BUFFER_SIZE then keeping only the audio clips that have speech inserted into outputBuffer. Then use this vector to calculate log_mel_spectrogram(...). However, the test results gave me a completely different sentence than the original sentence. This is the result when I choose the threshold as -45.0: This is the result when I choose the threshold as -40.0: This is the result when I choose the threshold as -35.0:
|
Yea but I don't understand how VAD can fix.. random text detected. I will check what audio is recorded and report back. |
@heromanofe 512 samples are taken as a window to determine the silence for 31.25 ms. If there is sequence of silence, lets say 16 windows are silent continuously, then consider there is no voice activity (i.e. silence). In short, check for 500ms of silence instead of 31.25 ms. 500ms means 16 windows in sequence. I hope, this should works. I should check this too. |
I've noticed interesting thing, I have multi-lag model and it translates my speech when I think it shouldn't |
@heromanofe Yes. This is default behaviour for other languages. It translates to English if input language is other than English. We need to regenerate model with required configuration. |
speaking of which, I would be interested in self-generating those bin and tflite files or at least having some place where I can download other models. I will check in 1-2 hrs what whisper receives from recorder. |
https://1drv.ms/u/s!AgXqUQNVnl-xmZ07Nq71pVUibaZUOg?e=blb6zR <-- Onedrive link, if you want, I can send file using other way. 2023-12-07 18:18:47.100 16170-16184 |
Okay, you were sooo right :D I remembered that I looked into VAD before. I implemented this https://github.com/gkonovalov/android-vad
VadYamnet vad = Vad.builder() SoundCategory soundCategory = vad.classifyAudio(samples); and result is this: 2023-12-07 19:27:45.835 7830-8027 Recorder com. here is onedrive link to file: |
Has your problem been solved? |
it was VAD problem, thou I wouldn't be celebrating for now. I noticed there is some speech it detected as silence instead :D I need to fine-tune it, but then its working 100% :P thanks for you work |
Can you guide me how to run the project from the repo: https://github.com/gkonovalov/android-vad I ran it but when I clicked record even though I was still talking the result was "Noise detected". I don't understand how it works? |
Quick update about my situation, I decided to write kotlin code for real-time recognition. it works very simple, I am taking your recording system and just leaving out 1second chunks part. then in my code I have a system for tracking timeout. 2023-12-11 19:53:26.300 30867-31012 WHISPER: New State com.ERPStudio.ErpDroid W READY I am making 2bl app, I need both: TTS which like here can be slow and Commands (like start X do Y) and those specifically ideally should be very quick, but this 3 second delay is too much for me. what can you suggest for speed optimization, keep in mind I am using right now whisper-tiny.tflite, so multi-lang model. would using eng model speed things up? |
Transcription time varies device to device. On high end device, transcription time will be less. You can debug what is taking more time. |
Hi, first of all thanks for the hard work. Is there a solution to the quiet mode issue? I don't speak and there is complete silence and words are still coming back to me |
@matanel-6over6 scroll up for screenshots, here is library: https://github.com/gkonovalov/android-vad |
@heromanofe Thanks for the quick reply. What should I take from the project I mentioned to Vilassn's project? |
you implement that library in gradle ( |
Do I need to add what you marked to the Class of the recorder? |
in screenshot stuff there, implementation is gradle (app/build.gradle) |
@heromanofe Yes, I understand, thank you very much. |
@heromanofe Working grate. Thank you very much |
I was able to setup model and it works really great. My code is:
`private fun testAudio() {
// Initialize Whisper
val mWhisper = Whisper(this) // Create Whisper instance
// Load model and vocabulary for Whisper
val basePath = Global.fileOperations.getOutputDirectory("/Models", this)!!.path
val modelPath = basePath + "/whisper-tiny.tflite" // Provide model file path
// Set a listener for Whisper to handle updates and results
// Set a listener for Recorder to handle updates and audio data
mRecorder.setListener(object : IRecorderListener {
override fun onUpdateReceived(message: String) {
// Handle Recorder status updates
}
seemed to return:
[audioRecordData][fine] 5s(f:5014 m:0 s:0) : pid 8824 uid 10419 sessionId 41305 sr 16000 ch 1 fmt 1
I'll make a hole in the hole.
2 times this:
[audioRecordData][fine] 10s(f:10000 m:0 s:0) : pid 8824 uid 10419 sessionId 41305 sr 16000 ch 1 fmt 1
then
I'll be back with a little .... <== repeated a lot
thanks for you hard work :P
The text was updated successfully, but these errors were encountered: