Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clip_timestamps does not work across multiple files [faster-whisper 1.0.2] #839

Closed
nonnoxer opened this issue May 15, 2024 · 4 comments
Closed

Comments

@nonnoxer
Copy link
Contributor

I was testing faster whisper with 2 very long audio files (about 30 min each). Both were generated using the gigaspeech dataset, with long.wav being many audio files concatenated into a continuous file and silence.wav being audio files joined with 3-minute-long complete silences in between. I then used silero VAD externally to generate speech timestamps for each file before passing these timestamps through the clip_timestamps parameter.

When testing this functionality on just silence.wav, the transcript generated was as expected. However, when running the model on long.wav first (where there was some hallucination) then silence.wav, the silence.wav transcription was completely hallucinated and the provided clip_timestamps were also not used.

Code:

from faster_whisper import WhisperModel
from faster_whisper.audio import decode_audio
import os
import torch


DATA_DIR = "test_data/wav/processed"
OUTPUT_DIR = "test_data/output/asr"
FS = 16000

model = WhisperModel("base")

vad_parameters = {
    "threshold": 0.28,
    "min_speech_duration_ms": 250,
    "max_speech_duration_s": 10,
    "min_silence_duration_ms": 100,
    "window_size_samples": 1536,
    "speech_pad_ms": 30
}

vad, utils = torch.hub.load("snakers4/silero-vad", model="silero_vad", onnx=True)
(get_speech_timestamps,
save_audio,
read_audio,
VADIterator,
collect_chunks) = utils

for file_name in os.listdir(DATA_DIR):
    output_file = os.path.splitext(file_name)[0] + ".txt"
    output_path = os.path.join(OUTPUT_DIR, output_file)

    file_path = os.path.join(DATA_DIR, file_name)
    
    # clip_timestamps generated here
    audio = decode_audio(file_path)
    speech_timestamps = get_speech_timestamps(torch.tensor(audio), vad, sampling_rate=FS, **vad_parameters)
    clip_timestamps_list = []
    for entry in speech_timestamps:
        clip_timestamps_list.append(str(entry["start"] / FS))
        clip_timestamps_list.append(str(entry["end"] / FS))
    clip_timestamps = ",".join(clip_timestamps_list)
    print(clip_timestamps_list)

    with open(output_path, "w") as f:
        segments, info = model.transcribe(
            audio,
            beam_size=5,
            vad_filter=False,
            clip_timestamps=clip_timestamps
        )

        for segment in segments:
            segment_text = segment.text.strip()
            f.write(f"{segment.start:.2f}\t{segment.end:.2f}\t{segment.text}\n")

Files:
https://drive.google.com/file/d/1mfDWNDmcZvW9M-zMaNWFy9tVUOdUrZnA/view?usp=sharing

Environment:
faster-whisper==1.0.2
onnxruntime-gpu==1.17.1
torch==2.2.2+cu121
torchaudio==2.2.2+cu121
torchvision==0.17.2+cu121

python 3.10.12
cuda 12.1
requirements.txt

Expected output:
clip_timestamps are used. This behaviour happens when only silence.wav is run by itself.
Screenshot 2024-05-15 160611

Actual output:
The model is transcribing silence (audio only starts at 180s), completely disregarding the given clip_timestamps. This is the same audio file as above, the only difference being that another audio was run before.
Screenshot 2024-05-15 160719

I feel this issue is worth raising as running the same audio should give the exact same outputs every time, which is inexplicably not happening here. Additionally the parameter clip_timestamps should cause the model to use the given timestamps when transcribing. Any advice or help as to why this happens and how this can be addressed will be greatly appreciated.

@nonnoxer
Copy link
Contributor Author

Tried to examine how the relevant variables are being passed in the source code in transcribe.py

Outputting the length of clip_timestamps (argument passed to transcribe() ), options.clip_timestamps and seek_clips

long.wav
Len clip_timestamps list 714
Len clip_timestamps passed to transcribe() 5894 type <class 'str'>
Len options.clip_timestamps before split 5894 type <class 'str'>
Len options.clip_timestamps after split 714 type <class 'list'>
Len seek_clips 357
silence.wav
Len clip_timestamps list 26
Len clip_timestamps passed to transcribe() 214 type <class 'str'>
Len options.clip_timestamps before split 714 type <class 'list'>
Len options.clip_timestamps after split 714 type <class 'list'>
Len seek_clips 357

It seems on the second file options.clip_timestamps is not correctly updated from the passed argument clip_timestamps and results in the wrong seek_clips being used.

nonnoxer added a commit to nonnoxer/faster-whisper that referenced this issue May 16, 2024
Changed the code from updating the TranscriptionOptions class instead of the options object which likely was the cause of unexpected behaviour
@nonnoxer
Copy link
Contributor Author

Was able to do a simple fix for this problem, created a pull request #842

nonnoxer added a commit to nonnoxer/faster-whisper that referenced this issue May 16, 2024
@nonnoxer nonnoxer changed the title clip_timestamps does not work, cross audio hallucination [faster-whisper 1.0.2] clip_timestamps does not work across multiple files [faster-whisper 1.0.2] May 16, 2024
@trungkienbkhn
Copy link
Collaborator

@nonnoxer , tks for your PR. I merged it.

@liwangd
Copy link

liwangd commented Sep 11, 2024

For future reference, the issue was caused by setting NamedTuple fields at the class level.

Once a field of a NamedTuple is set at the class level, all instances created before and after will be affected. i.e. when trying to call instance.field, the class-level field value will always be returned.

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants