-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
clip_timestamps does not work across multiple files [faster-whisper 1.0.2] #839
Comments
Tried to examine how the relevant variables are being passed in the source code in transcribe.py Outputting the length of clip_timestamps (argument passed to transcribe() ), options.clip_timestamps and seek_clips
It seems on the second file options.clip_timestamps is not correctly updated from the passed argument clip_timestamps and results in the wrong seek_clips being used. |
Changed the code from updating the TranscriptionOptions class instead of the options object which likely was the cause of unexpected behaviour
Was able to do a simple fix for this problem, created a pull request #842 |
@nonnoxer , tks for your PR. I merged it. |
I was testing faster whisper with 2 very long audio files (about 30 min each). Both were generated using the gigaspeech dataset, with long.wav being many audio files concatenated into a continuous file and silence.wav being audio files joined with 3-minute-long complete silences in between. I then used silero VAD externally to generate speech timestamps for each file before passing these timestamps through the clip_timestamps parameter.
When testing this functionality on just silence.wav, the transcript generated was as expected. However, when running the model on long.wav first (where there was some hallucination) then silence.wav, the silence.wav transcription was completely hallucinated and the provided clip_timestamps were also not used.
Code:
Files:
https://drive.google.com/file/d/1mfDWNDmcZvW9M-zMaNWFy9tVUOdUrZnA/view?usp=sharing
Environment:
faster-whisper==1.0.2
onnxruntime-gpu==1.17.1
torch==2.2.2+cu121
torchaudio==2.2.2+cu121
torchvision==0.17.2+cu121
python 3.10.12
cuda 12.1
requirements.txt
Expected output:
clip_timestamps are used. This behaviour happens when only silence.wav is run by itself.
Actual output:
The model is transcribing silence (audio only starts at 180s), completely disregarding the given clip_timestamps. This is the same audio file as above, the only difference being that another audio was run before.
I feel this issue is worth raising as running the same audio should give the exact same outputs every time, which is inexplicably not happening here. Additionally the parameter clip_timestamps should cause the model to use the given timestamps when transcribing. Any advice or help as to why this happens and how this can be addressed will be greatly appreciated.
The text was updated successfully, but these errors were encountered: