diff --git a/README.md b/README.md index 3dd9f01..fbc69f6 100644 --- a/README.md +++ b/README.md @@ -13,13 +13,18 @@ Multilingual Automatic Speech Recognition with word-level timestamps and confide * [Usage](#usage) * [Python](#python) * [Command line](#command-line) + * [Utility Functions](#utility-functions) * [Plot of word alignment](#plot-of-word-alignment) * [Example output](#example-output) - * [Options that may improve results](#options-that-may-improve-results) - * [Accurate Whisper transcription](#accurate-whisper-transcription) - * [Running Voice Activity Detection (VAD) before sending to Whisper](#running-voice-activity-detection-vad-before-sending-to-whisper) - * [Detecting disfluencies](#detecting-disfluencies) -* [Acknowlegment](#acknowlegment) +* [API Reference](#api-reference) + * [Main Transcription Function](#main-transcription-function) + * [Utility Functions](#utility-functions-1) + * [File Writing Functions](#file-writing-functions) +* [Options that may improve results](#options-that-may-improve-results) + * [Accurate Whisper transcription](#accurate-whisper-transcription) + * [Running Voice Activity Detection (VAD) before sending to Whisper](#running-voice-activity-detection-vad-before-sending-to-whisper) + * [Detecting disfluencies](#detecting-disfluencies) +* [Acknowledgment](#acknowledgment) * [Citations](#citations) ## Description @@ -141,7 +146,7 @@ Besides, the default decoding options are different to favour efficient decoding There are also additional options related to word alignement. In general, if you import `whisper_timestamped` instead of `whisper` in your Python script and use `transcribe(model, ...)` instead of `model.transcribe(...)`, it should do the job: -``` +```python import whisper_timestamped as whisper audio = whisper.load_audio("AUDIO.wav") @@ -155,7 +160,7 @@ print(json.dumps(result, indent = 2, ensure_ascii = False)) ``` Note that you can use a finetuned Whisper model from HuggingFace or a local folder by using the `load_model` method of `whisper_timestamped`. For instance, if you want to use [whisper-large-v2-nob](https://huggingface.co/NbAiLab/whisper-large-v2-nob), you can simply do the following: -``` +```python import whisper_timestamped as whisper model = whisper.load_model("NbAiLab/whisper-large-v2-nob", device="cpu") @@ -197,6 +202,30 @@ Note that you can use a fine-tuned Whisper model from HuggingFace or a local fol whisper_timestamped --model NbAiLab/whisper-large-v2-nob <...> ``` +### Utility Functions + +In addition to the main `transcribe` function, whisper-timestamped provides some utility functions: + +#### `remove_non_speech` + +Remove non-speech segments from audio using Voice Activity Detection (VAD). + +```python +from whisper_timestamped import remove_non_speech + +audio_speech, segments, convert_timestamps = remove_non_speech(audio, vad="silero") +``` + +#### `load_model` + +Load a Whisper model from a given name or path, including support for fine-tuned models from HuggingFace. + +```python +from whisper_timestamped import load_model + +model = load_model("NbAiLab/whisper-large-v2-nob", device="cpu") +``` + ### Plot of word alignment Note that you can use the `plot_word_alignment` option of the `whisper_timestamped.transcribe()` Python function or the `--plot` option of the `whisper_timestamped` CLI to see the word alignment for each segment. @@ -309,11 +338,354 @@ If the language is not specified (e.g. without option `--language fr` in the CLI } ``` -### Options that may improve results +## API Reference + +### Main Transcription Function + +#### `transcribe_timestamped(model, audio, **kwargs)` + +Transcribe audio using a Whisper model and compute word-level timestamps. + +##### Parameters: + +- `model`: Whisper model instance + The Whisper model to use for transcription. + +- `audio`: Union[str, np.ndarray, torch.Tensor] + The path to the audio file to transcribe, or the audio waveform as a NumPy array or PyTorch tensor. + +- `language`: str, optional (default: None) + The language of the audio. If None, language detection will be performed. + +- `task`: str, default "transcribe" + The task to perform: either "transcribe" for speech recognition or "translate" for translation to English. + +- `vad`: Union[bool, str, List[Tuple[float, float]]], optional (default: False) + Whether to use Voice Activity Detection (VAD) to remove non-speech segments. Can be: + - True/False: Enable/disable VAD (uses Silero VAD by default) + - "silero": Use Silero VAD + - "auditok": Use Auditok VAD + - List of (start, end) timestamps: Explicitly specify speech segments + +- `detect_disfluencies`: bool, default False + Whether to detect and mark disfluencies (hesitations, filler words, etc.) in the transcription. + +- `trust_whisper_timestamps`: bool, default True + Whether to rely on Whisper's timestamps for initial segment positions. + +- `compute_word_confidence`: bool, default True + Whether to compute confidence scores for words. + +- `include_punctuation_in_confidence`: bool, default False + Whether to include punctuation probability when computing word confidence. + +- `refine_whisper_precision`: float, default 0.5 + How much to refine Whisper segment positions, in seconds. Must be a multiple of 0.02. + +- `min_word_duration`: float, default 0.02 + Minimum duration of a word, in seconds. + +- `plot_word_alignment`: bool or str, default False + Whether to plot the word alignment for each segment. If a string, save the plot to the given file. + +- `word_alignement_most_top_layers`: int, optional (default: None) + Number of top layers to use for word alignment. If None, use all layers. + +- `remove_empty_words`: bool, default False + Whether to remove words with no duration occurring at the end of segments. + +- `naive_approach`: bool, default False + Force the naive approach of decoding twice (once for transcription, once for alignment). + +- `use_backend_timestamps`: bool, default False + Whether to use word timestamps provided by the backend (openai-whisper or transformers), instead of the ones computed by more complex heuristics of whisper-timestamped. + +- `temperature`: Union[float, List[float]], default 0.0 + Temperature for sampling. Can be a single value or a list for fallback temperatures. + +- `compression_ratio_threshold`: float, default 2.4 + If the gzip compression ratio is above this value, treat the decoding as failed. + +- `logprob_threshold`: float, default -1.0 + If the average log probability is below this value, treat the decoding as failed. + +- `no_speech_threshold`: float, default 0.6 + Probability threshold for <|nospeech|> tokens. + +- `condition_on_previous_text`: bool, default True + Whether to provide the previous output as a prompt for the next window. + +- `initial_prompt`: str, optional (default: None) + Optional text to provide as a prompt for the first window. + +- `suppress_tokens`: str, default "-1" + Comma-separated list of token ids to suppress during sampling. + +- `fp16`: bool, optional (default: None) + Whether to perform inference in fp16 precision. + +- `verbose`: bool or None, default False + Whether to display the text being decoded to the console. If True, displays all details. If False, displays minimal details. If None, does not display anything. + +##### Returns: + +A dictionary containing: +- `text`: str - The full transcription text +- `segments`: List[dict] - List of segment dictionaries, each containing: + - `id`: int - Segment ID + - `seek`: int - Start position in the audio file (in samples) + - `start`: float - Start time of the segment (in seconds) + - `end`: float - End time of the segment (in seconds) + - `text`: str - Transcribed text for the segment + - `tokens`: List[int] - Token IDs for the segment + - `temperature`: float - Temperature used for this segment + - `avg_logprob`: float - Average log probability of the segment + - `compression_ratio`: float - Compression ratio of the segment + - `no_speech_prob`: float - Probability of no speech in the segment + - `confidence`: float - Confidence score for the segment + - `words`: List[dict] - List of word dictionaries, each containing: + - `start`: float - Start time of the word (in seconds) + - `end`: float - End time of the word (in seconds) + - `text`: str - The word text + - `confidence`: float - Confidence score for the word (if computed) +- `language`: str - Detected or specified language +- `language_probs`: dict - Language detection probabilities (if applicable) + +##### Exceptions: + +- `RuntimeError`: If the VAD method is not properly installed or configured. +- `ValueError`: If the `refine_whisper_precision` is not a positive multiple of 0.02. +- `AssertionError`: If the audio duration is shorter than expected or if there are inconsistencies in the number of segments. + +##### Notes: + +- The function uses the Whisper model to transcribe the audio and then performs additional processing to generate word-level timestamps and confidence scores. +- The VAD feature can significantly improve transcription accuracy by removing non-speech segments, but it requires additional dependencies (e.g., torchaudio and onnxruntime for Silero VAD). +- The `naive_approach` parameter can be useful for debugging or when dealing with specific audio characteristics, but it may be slower than the default approach. +- When `use_efficient_by_default` is True, some parameters like `best_of`, `beam_size`, and `temperature_increment_on_fallback` are set to None by default for more efficient processing. +- The function supports both OpenAI Whisper and Transformers backends, which can be specified when loading the model. + +### Utility Functions + +#### `remove_non_speech(audio, **kwargs)` + +Remove non-speech segments from audio using Voice Activity Detection (VAD). + +##### Parameters: + +- `audio`: torch.Tensor + Audio data as a PyTorch tensor. + +- `use_sample`: bool, default False + If True, return start and end times in samples instead of seconds. + +- `min_speech_duration`: float, default 0.1 + Minimum duration of a speech segment in seconds. + +- `min_silence_duration`: float, default 1 + Minimum duration of a silence segment in seconds. + +- `dilatation`: float, default 0.5 + How much to enlarge each speech segment detected by VAD, in seconds. + +- `sample_rate`: int, default 16000 + Sample rate of the audio. + +- `method`: str or List[Tuple[float, float]], default "silero" + VAD method to use. Can be "silero", "auditok", or a list of timestamps. + +- `avoid_empty_speech`: bool, default False + If True, avoid returning an empty speech segment. + +- `plot`: Union[bool, str], default False + If True, plot the VAD results. If a string, save the plot to the given file. + +##### Returns: + +A tuple containing: +1. torch.Tensor: Audio with non-speech segments removed +2. List[Tuple[float, float]]: List of (start, end) timestamps for speech segments +3. Callable: Function to convert timestamps from the new audio to the original audio + +##### Exceptions: + +- `ImportError`: If the required VAD library (e.g., auditok) is not installed. +- `ValueError`: If an invalid VAD method is specified. + +##### Notes: + +- This function is particularly useful for improving transcription accuracy by removing silence and non-speech segments from the audio before processing. +- The choice of VAD method can affect the accuracy and speed of the non-speech removal process. + +#### `load_model(name, device=None, backend="openai-whisper", download_root=None, in_memory=False)` + +Load a Whisper model from a given name or path. + +##### Parameters: + +- `name`: str + Name of the model or path to the model. Can be: + - OpenAI Whisper identifier: "large-v3", "medium.en", etc. + - HuggingFace identifier: "openai/whisper-large-v3", "distil-whisper/distil-large-v2", etc. + - File name: "path/to/model.pt", "path/to/model.ckpt", "path/to/model.bin" + - Folder name: "path/to/folder" + +- `device`: Union[str, torch.device], optional (default: None) + Device to use. If None, use CUDA if available, otherwise CPU. + +- `backend`: str, default "openai-whisper" + Backend to use. Either "transformers" or "openai-whisper". + +- `download_root`: str, optional (default: None) + Root folder to download the model to. If None, use the default download root. + +- `in_memory`: bool, default False + Whether to preload the model weights into host memory. + +##### Returns: + +The loaded Whisper model. + +##### Exceptions: + +- `ValueError`: If an invalid backend is specified. +- `ImportError`: If the transformers library is not installed when using the "transformers" backend. +- `RuntimeError`: If the model cannot be found or downloaded from the specified source. +- `OSError`: If there are issues reading the model file or accessing the specified path. + +##### Notes: + +- When using a local model file, ensure that the file format is compatible with the selected backend. +- For HuggingFace models, an internet connection may be required to download the model if it's not already cached locally. +- The function supports loading both OpenAI Whisper and Transformers models, providing flexibility in model selection. + +#### `get_alignment_heads(model, max_top_layer=3)` + +Get the alignment heads for the given model. + +##### Parameters: + +- `model`: Whisper model instance + The Whisper model for which to retrieve alignment heads. + +- `max_top_layer`: int, default 3 + Maximum number of top layers to consider for alignment heads. + +##### Returns: + +A sparse tensor representing the alignment heads. + +##### Notes: + +- This function is used internally to optimize the word alignment process. +- The alignment heads are model-specific and are used to improve the accuracy of word-level timestamps. + +### File Writing Functions + +The following functions are available for writing transcripts to various file formats: + +#### `write_csv(transcript, file, sep=",", text_first=True, format_timestamps=None, header=False)` + +Write transcript data to a CSV file. + +##### Parameters: + +- `transcript`: List[dict] + List of transcript segment dictionaries. + +- `file`: file-like object + File to write the CSV data to. + +- `sep`: str, default "," + Separator to use in the CSV file. + +- `text_first`: bool, default True + If True, write text column before start/end times. + +- `format_timestamps`: Callable, optional (default: None) + Function to format timestamp values. + +- `header`: Union[bool, List[str]], default False + If True, write default header. If a list, use as custom header. + +##### Exceptions: + +- `IOError`: If there are issues writing to the specified file. +- `ValueError`: If the transcript data is not in the expected format. + +##### Notes: + +- This function is useful for exporting transcription results in a tabular format for further analysis or processing. +- The `format_timestamps` parameter allows for custom formatting of timestamp values, which can be helpful for specific use cases or data analysis requirements. + +#### `write_srt(transcript, file)` + +Write transcript data to an SRT (SubRip Subtitle) file. + +##### Parameters: + +- `transcript`: List[dict] + List of transcript segment dictionaries. + +- `file`: file-like object + File to write the SRT data to. + +##### Exceptions: + +- `IOError`: If there are issues writing to the specified file. +- `ValueError`: If the transcript data is not in the expected format. + +##### Notes: + +- SRT is a widely supported subtitle format, making this function useful for creating subtitles for videos based on the transcription. + +#### `write_vtt(transcript, file)` + +Write transcript data to a VTT (WebVTT) file. + +##### Parameters: + +- `transcript`: List[dict] + List of transcript segment dictionaries. + +- `file`: file-like object + File to write the VTT data to. + +##### Exceptions: + +- `IOError`: If there are issues writing to the specified file. +- `ValueError`: If the transcript data is not in the expected format. + +##### Notes: + +- WebVTT is a W3C standard for displaying timed text in connection with HTML5, making this function useful for web-based applications. + +#### `write_tsv(transcript, file)` + +Write transcript data to a TSV (Tab-Separated Values) file. + +##### Parameters: + +- `transcript`: List[dict] + List of transcript segment dictionaries. + +- `file`: file-like object + File to write the TSV data to. + +##### Exceptions: + +- `IOError`: If there are issues writing to the specified file. +- `ValueError`: If the transcript data is not in the expected format. + +##### Notes: + +- TSV files are useful for importing transcription data into spreadsheet applications or other data analysis tools. + +## Options that may improve results Here are some options that are not enabled by default but might improve results. -#### Accurate Whisper transcription +### Accurate Whisper transcription As mentioned earlier, some decoding options are disabled by default to offer better efficiency. However, this can impact the quality of the transcription. To run with the options that have the best chance of providing a good transcription, use the following options. * In Python: @@ -325,7 +697,7 @@ results = whisper_timestamped.transcribe(model, audio, beam_size=5, best_of=5, t whisper_timestamped --accurate ... ``` -#### Running Voice Activity Detection (VAD) before sending to Whisper +### Running Voice Activity Detection (VAD) before sending to Whisper Whisper models can "hallucinate" text when given a segment without speech. This can be avoided by running VAD and gluing speech segments together before transcribing with the Whisper model. This is possible with `whisper-timestamped`. * In Python: @@ -358,7 +730,7 @@ It will show the VAD results on the input audio signal as following (x-axis is t | :---: | :---: | :---: | | ![Example VAD](figs/VAD_silero_v4.0.png) | ![Example VAD](figs/VAD_silero_v3.1.png) | ![Example VAD](figs/VAD_auditok.png) | -#### Detecting disfluencies +### Detecting disfluencies Whisper models tend to remove speech disfluencies (filler words, hesitations, repetitions, etc.). Without precautions, the disfluencies that are not transcribed will affect the timestamp of the following word: the timestamp of the beginning of the word will actually be the timestamp of the beginning of the disfluencies. `whisper-timestamped` can have some heuristics to avoid this. * In Python: @@ -371,8 +743,7 @@ whisper_timestamped --detect_disfluencies True ... ``` **Important:** Note that when using these options, possible disfluencies will appear in the transcription as a special "`[*]`" word. - -## Acknowlegment +## Acknowledgment * [whisper](https://github.com/openai/whisper): Whisper speech recognition (License MIT). * [dtw-python](https://pypi.org/project/dtw-python): Dynamic Time Warping (License GPL v3).