[feat] update distil-whisper #557

metame-none · 2023-11-11T08:12:42Z

reference issue: #533

features: allowing the user to specify the chunk length and also the maximum generation length

usage:

model_size = "distil-large-v2"
# model_size = "distil-medium.en"
model = WhisperModel(model_size, device="cuda", compute_type="float16")
segments, info = model.transcribe("audio.mp3", beam_size=5, 
    language="en", max_new_tokens=128, condition_on_previous_text=False)

tested ok with version 3.22 of ctranslate2 converted model (hugginface models)

Purfview · 2023-11-26T23:44:34Z

@nguyendc-systran , would be great if you could have a look.

trungkienbkhn · 2023-11-27T13:01:59Z

@metame-none , hello. How can you generate the ct2 conversion model from hf distil-whisper model ?
I tried to convert the distil-whisper model from Hugging face hub to ctranslate2 but the hf-to-ct2 script returns an error. More detail can be view in issue-1564.
After discussing with @nguyendc-systran and @minhthuc2502, we created a new PR to fix this script pr-1565 in ctranslate2.

Besides, I found that the config.json file in your hf model has alignments_head field, but it is different from the config.json file of the openai whisper large-v2 model ? Can you explain how to generate it ?
Thank in advance.

trungkienbkhn · 2023-11-27T13:11:32Z

faster_whisper/transcribe.py

@@ -623,6 +636,10 @@ def generate_with_fallback(
        max_initial_timestamp_index = int(
            round(options.max_initial_timestamp / self.time_precision)
        )
+        if options.max_new_tokens is not None:
+            max_length = min(self.max_length, len(prompt) + options.max_new_tokens)


@metame-none ,
Can you re-check this logic ? I think we should use max_length option as input instead of creating new param max_new_tokens

We can event simplify like:
max_length = options.max_length if options.max_length is not None else self.max_length

And please rebase your branch on the last version of master to avoid conflict.

thx for your advice, I use a new max_new_tokens is just for keeping the same logic as distil-whisper, where the setting is

pipe = pipeline( "automatic-speech-recognition", model=model, tokenizer=processor.tokenizer, feature_extractor=processor.feature_extractor, max_new_tokens=128, torch_dtype=torch_dtype, device=device, )

and the default max_length seems include the length of the prompt. So, to make these two compatable, I introduce a new param, not sure if this is a good way, glad to hear your advice. thx a lot.

I remove the new param max_new_tokens, and use the options.max_length instead, thx

@metame-none, hello.
I think we should rollback to both use max_new_tokens and max_length params, then update this logic similar to what HF transformer is doing:

# If the user passes `max_new_tokens`, increase its number to account for the prompt if kwargs.get("max_new_tokens", None) is not None: kwargs["max_new_tokens"] += len(text_prompt_ids) if kwargs["max_new_tokens"] >= self.config.max_target_positions: raise ValueError( f"The length of the sliced `prompt_ids` is {len(text_prompt_ids)}, and the `max_new_tokens` " f"{kwargs['max_new_tokens'] - len(text_prompt_ids)}. Thus, the combined length of the sliced " f"`prompt_ids` and `max_new_tokens` is: {kwargs['max_new_tokens']}. This exceeds the " f"`max_target_positions` of the Whisper model: {self.config.max_target_positions}. " "You should either reduce the length of your prompt, or reduce the value of `max_new_tokens`, " f"so that their combined length is less that {self.config.max_target_positions}." )

=> raise error if max_new_tokens + len(prompt) > max_length
For more info, please check this pr.

Besides, I also agree that max_length includes the length of the prompt: https://huggingface.co/docs/transformers/main/en/main_classes/text_generation#transformers.GenerationConfig.max_length

Please take a look. Tks.

@trungkienbkhn hi,
I rollback to the version with max_new_tokens, plz help to review, thx

metame-none · 2023-11-28T14:24:30Z

ctranslate2

hi, by the time I converted, the version I used was 3.21 for ctranslate, and the lastest is 3.22 and it seems it doesn't test on distil-whisper.

And for the config.json, the alignments_head follows the logic as code, I also tried to figure out what was the alignments_head for distil-whisper, sadly no luck. But, I tested the word_timestamps for this setting, and it seems working good.

trungkienbkhn · 2023-11-29T04:23:03Z

@sanchit-gandhi, tks for your interesting repo, we are supporting to integrate it to faster-whisper. I'd like to clarify the alignment_heads field.
For info about aligment_heads:

To allow the Hugging Face version of Whisper to predict word-level timestamps, a new property alignment_heads must be added to the GenerationConfig object. This is a list of [layer, head] pairs that select the cross-attention heads that are highly correlated to word-level timing.

I found that this field does not exist in the generation_config.json file in your HF model, while the original openai model still exists. Could it affect the word-level timing of the transcription results ?

metame-none · 2023-11-29T16:35:52Z

I find two issues with faster-distil-whisper:

the running speed of faster-distil-whisper is not that fast as I was expected, especially sending numpy array to result = pipe(sample) as link, the transformers version seems faster (tested with 3090, roughly 300ms/transformers vs 400ms/faster-distil-whisper with 10s audio sample)
the faster-distil-whisper works worse in the long audio setting, tested with link and the param condition_on_previous_text makes a huge impact.

trungkienbkhn · 2023-11-30T07:59:48Z

faster_whisper/transcribe.py

@@ -642,6 +647,10 @@ def generate_with_fallback(
        max_initial_timestamp_index = int(
            round(options.max_initial_timestamp / self.time_precision)
        )
+        if options.max_length is not None:
+            max_length = min(self.max_length, len(prompt) + options.max_length)


@metame-none, hello. Why do you need to get min in this logic ? If we use max_length > default (448 for faster whisper), will it cause too negative effects ? If not, I think can make this logic more flexible by removing the min function.

you are right, I dont know if it will cause negative effects, i'll remove it.

trungkienbkhn · 2023-11-30T08:14:17Z

@metame-none , for #557 (comment):

I'm not sure about that. In my tests for both short-form and long-form audio, fw-distilled-whisper is always superior in timing due to the original fw-large-v2.
I also confirm this. But if we use condition_on_previous_text=False, transcription quality has improved quite a lot.

Besides, I think shoud set chunk_length=15 in preprocessor_config.json file:

To enable chunking, pass the chunk_length_s parameter to the pipeline. For Distil-Whisper, a chunk length of 15-seconds is optimal.

metame-none · 2023-11-30T15:49:00Z

@trungkienbkhn for #557 (comment)

could you please share your testing code? I would like to re-check with that.
Yes, condition_on_previous_text=False will imporve a lot.
you mean, we should set chunk_length=15 as the default setting? I just copy it from the origin HF repo.
thx for your time.

Purfview · 2023-12-02T12:02:34Z

fw-distilled is ~1.5 faster in my tests, that's OK if transcription accuracy wouldn't suffer so much...
why fw-distilled is broken with condition_on_previous_text=True?
imho, shorter chunk = less accurate

Conclusion from few non-extensive tests:
Instead of distill just use smaller fw model - much faster than distill, and with similar reduction on transcription accuracy... [I didn't looked at timestamps accuracy]

trungkienbkhn · 2023-12-03T07:23:54Z

@metame-none,

Below is my testing code:
model = WhisperModel('/tmp/distil-whisper-large-v2-ct2, device="cuda")
segments, info = model.transcribe(jfk_path, word_timestamps=True, condition_on_previous_text=False, max_length=128)
I tried converting the large v2 model from the distil-whisper and got a model similar to your HF model.
Based on experience, I find that set condition_on_previous_text=False (default is True) will give much better quality. Hope someone can find the reason
I think it's best to put chunk_length as an input variable when calling the transcribe function

funboarder13920 · 2023-12-04T08:51:23Z

2. why fw-distilled is broken with `condition_on_previous_text=True`?

It is likely that the distillation was done without prompting/conditionning on previous text

metame-none · 2023-12-05T15:25:54Z

@metame-none,

Below is my testing code:
model = WhisperModel('/tmp/distil-whisper-large-v2-ct2, device="cuda")
segments, info = model.transcribe(jfk_path, word_timestamps=True, condition_on_previous_text=False, max_length=128)
I tried converting the large v2 model from the distil-whisper and got a model similar to your HF model.

Based on experience, I find that set condition_on_previous_text=False (default is True) will give much better quality. Hope someone can find the reason

I think it's best to put chunk_length as an input variable when calling the transcribe function

@trungkienbkhn

instead of sending the filepath, you can try load the audio with librosa and send the numpy array, and do the same with original distill-model, cause I found that when sending array to transformers distill model will be much faster.
sounds good, I may look into that when I get time.

sanchit-gandhi · 2023-12-05T17:54:42Z

Hey @trungkienbkhn, @metame-none, @Purfview, @funboarder13920!

First of all, thank you for maintaining this amazing repository! It's such a valuable resource for the open-source speech community 🙌 Super excited to see how it improves over the coming months 🚀

Answering your questions sequentially below:

Alignment heads: I've not looked at the specific cross-attention plots, but we should be able to use the same alignment heads as in the original large-v2 (resp. medium.en) model for distil-large-v2 (resp. distil-medium.en). c.f. Retrieve alignment_heads of a fine-tuned model openai/whisper#1733 (reply in thread)
The running speed of faster-distil-whisper is not that fast as I was expected: have you benchmarked Transformers Whisper vs faster-whisper? We're consistently seeing that Transformers Whisper is faster than faster-whisper, so it's likely this trend also holds for Transformers Distil-Whisper vs faster-distil-whisper. c.f. https://github.com/Vaibhavs10/insanely-fast-whisper#insanely-fast-whisper
condition_on_previous_text: the model is trained with conditioning on previous context. However, not to the same proportion as the original Whisper model. It's also very hard to maintain the same distribution of previous context tokens as the original Whisper, since we don't have access to the original dataset details 😅 We tried our best to implement this using our intuition, but it might well be that it doesn't perform as well as the original Whisper model with prev text conditioning. Has anyone evaluated the word error rate difference with / without using prev text conditioning? Would be curious to see what the difference is, and possibly provide some insight as to how we can improve this!
Chunk size 15s: this was set using the Transformers chunking algorithm. For your sequential long-form algorithm, it might well be that 30s is sufficient. This should probably be tuned by evaluating over a corpus of long-form audio and tuning the hyper parameter (as we did for our chunked algorithm)
Instead of distill just use smaller fw model: Distil-Whisper should be significantly more performant than smaller checkpoints, with faster latency! We found this held across the board using our Transformers chunking algorithm.

Purfview · 2023-12-05T18:21:38Z

@sanchit-gandhi

we should be able to use the same alignment heads as in the original

I think I tried that and model was outputting empty transcriptions.

Has anyone evaluated the word error rate difference with / without using prev text conditioning?

I didn't, but with the prev text conditioning after the first chunk it was outputting non-stop pure hallucinations .

5... Distil-Whisper should be significantly more performant than smaller checkpoints, with faster latency!

Not the case for faster-whisper. People would want to use distill because it's faster, but this effect is insignificant with fw. So with smaller fw model you get the wanted faster effect.
Frankly, I just quick tested "fw-distil-medium.en", "fw-small.en", "fw-medium.en". Btw, I don't remember if conditioning was enabled for fw non-distill models, anyway that would be a fair test.

ostegm · 2023-12-05T22:28:19Z

@sanchit-gandhi Wanted to ask for some clarification here:

The running speed of faster-distil-whisper is not that fast as I was expected: have you benchmarked Transformers Whisper vs faster-whisper? We're consistently seeing that Transformers Whisper is faster than faster-whisper, so it's likely this trend also holds for Transformers Distil-Whisper vs faster-distil-whisper. c.f. https://github.com/Vaibhavs10/insanely-fast-whisper#insanely-fast-whisper

It was a month or two ago, prior to distill whisper, but I tested this and found that the results posted in "insanely fast whisper" are heavily dependent on batching. If you keep batch size at 1 I did not see transformers whisper outperforming faster-whisper. Am I missing something? Are there any other resources confirming that transformers-whisper is indeed faster?

trungkienbkhn · 2023-12-12T07:55:47Z

we should be able to use the same alignment heads as in the original

FYI, I tested my distil-whisper conversion model with the original alignment heads. But I encountered this error:

This error only occurs if I set word_timestamps=True :

model = WhisperModel('distil-large-v2', device='cuda')
segments, info = model.transcribe(jfk_path, condition_on_previous_text=False, word_timestamps=True)

After verifying, I found that it was stuck due to this logic:

result = self.model.align(
            encoder_output,
            tokenizer.sot_sequence,
            [text_tokens],
            num_frames,
            median_filter_width=median_filter_width,
        )[0]

Maybe this is a bug on ctranslate2 for the align function.

SinanAkkoyun · 2023-12-16T02:04:57Z

Does anyone have latency figures on batch size 1 faster distill whisper on 2s audio clips (or at least Ns audio speed comparison between faster whisper and faster distill whisper)

metame-none · 2024-01-02T16:13:09Z

we should be able to use the same alignment heads as in the original

FYI, I tested my distil-whisper conversion model with the original alignment heads. But I encountered this error:

Hi, @trungkienbkhn, what the alignment heads you use? I tested with the default alignment heads and it seems ok。And the origin alignment head for "large-v2" seems starts with layer 10, and the distill-whisper only has two decode layers.

According to the author of whisper from openai, openai/whisper#1388 (comment)

The heads were not specifically designed or constrained to be monotonically aligned, but some heads in the cross attention layers naturally learned to have the attention weights matching the time alignment. The _ALIGNMENT_HEADS masks are obtained post-training by rather manually selecting the heads that have clean alignment patterns.

So I suppose, the default alignment heads should work just fine.

metame-none · 2024-01-02T16:15:39Z

Does anyone have latency figures on batch size 1 faster distill whisper on 2s audio clips (or at least Ns audio speed comparison between faster whisper and faster distill whisper)

hi, @SinanAkkoyun

I test with rtx 3090, with 10s audio sample:faster-distill-large-v2 takes 430ms, faster-whisper-large-v2 takes 960ms, and faster-whisper-medium-en takes 710ms.

SinanAkkoyun · 2024-01-02T16:17:59Z

@metame-none Thank you so so much!!

metame-none · 2024-01-02T16:42:21Z

I think it's best to put chunk_length as an input variable when calling the transcribe function

hi, @trungkienbkhn , I added the chunk_length as an input of transcribe, thx.

ostegm · 2024-01-04T21:45:42Z

Have you tried this with an initial prompt? I've been trying it out and it seems to break the transcripts when I provide a prompt.

trungkienbkhn · 2024-01-05T04:12:49Z

what the alignment heads you use?

Hi, @metame-none. I used the original alignment heads starting from layer 10, which caused an error because the decoder only has 2 layers. I agree with you that we should use default alignment heads. Tks.

nguyendc-systran · 2024-01-18T12:05:53Z

@trungkienbkhn please help to generate corresponding Systran models in the HUB.
@metame-none can you please add a note in the readme regarding impact of the "condition_on_previous_text=False" option when using the distil model?

trungkienbkhn · 2024-01-19T04:56:34Z

@metame-none , hello. FYI, we uploaded the Systran Faster Distil-Whisper conversion models to the HuggingFace hub:

Please verify then update those models in this project's utils in your pr.
And could you squad your commit? The logic now looks good to me. Tks.

ostegm · 2024-01-19T19:39:52Z

Have you tried this with an initial prompt? I've been trying it out and it seems to break the transcripts when I provide a prompt.

HI, just want to bump this comment.

metame-none · 2024-01-20T02:56:29Z

Have you tried this with an initial prompt? I've been trying it out and it seems to break the transcripts when I provide a prompt.

hi, @ostegm
emprically, condition_on_previous_text=True will degrade the performance for long audio, and I think the initial prompt will likely have the same problem, since there are the same logically.

metame-none · 2024-01-20T08:03:38Z

hi, @trungkienbkhn
I updated utils.py and readme with Systran models, and test ok.
btw, the alignment heads for the small.en model is wrong in distil-whisper/distil-small.en, so I make a PR in the distil-small.en. thx

nguyendc-systran · 2024-01-23T08:32:40Z

README.md

+| Implementation | Precision | Beam size | Time | Gigaspeech WER |
+| --- | --- | --- | --- | --- |
+| distil-whisper/distil-large-v2 | fp16 | 4 |- | 10.36 |
+| [faster-distil-large-v2](https://huggingface.co/metame/faster-distil-whisper-large-v2) | fp16 | 5 | - | 10.28 |


Can you please update these links according to info in utils.py

Can you please update these links according to info in utils.py

thx, I updated the readme.

Thanks for your contribution, I merge it now

metame-none mentioned this pull request Nov 11, 2023

Support for Distil-Whisper #533

Closed

trungkienbkhn reviewed Nov 27, 2023

View reviewed changes

metame-none force-pushed the distil-whisper branch from 3f7c765 to 2e152c5 Compare November 29, 2023 16:16

trungkienbkhn reviewed Nov 30, 2023

View reviewed changes

metame-none force-pushed the distil-whisper branch from e9ee567 to cd307c5 Compare January 2, 2024 16:40

metame-none force-pushed the distil-whisper branch 2 times, most recently from 9294673 to ade1476 Compare January 20, 2024 08:26

nguyendc-systran reviewed Jan 23, 2024

View reviewed changes

support distil-whisper

f992cbe

metame-none force-pushed the distil-whisper branch from ade1476 to f992cbe Compare January 23, 2024 15:03

nguyendc-systran merged commit ad3c830 into SYSTRAN:master Jan 24, 2024
3 checks passed

trungkienbkhn mentioned this pull request Feb 21, 2024

distil + word_timestamps=True => CRASH #688

Closed

trungkienbkhn mentioned this pull request Mar 6, 2024

Using distil-whisper-large-v3 German Model from HF with faster-whisper? #733

Closed

trungkienbkhn mentioned this pull request May 9, 2024

Gibberish Outputs #825

Closed

ahxxm mentioned this pull request May 19, 2024

Distil-Whisper support? m-bain/whisperX#558

Open

[feat] update distil-whisper #557

[feat] update distil-whisper #557

Conversation

metame-none commented Nov 11, 2023 • edited Loading

Purfview commented Nov 26, 2023

trungkienbkhn commented Nov 27, 2023 • edited Loading

trungkienbkhn Nov 27, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

metame-none commented Nov 28, 2023

trungkienbkhn commented Nov 29, 2023

metame-none commented Nov 29, 2023

trungkienbkhn Nov 30, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trungkienbkhn commented Nov 30, 2023

metame-none commented Nov 30, 2023

Purfview commented Dec 2, 2023 • edited Loading

trungkienbkhn commented Dec 3, 2023

funboarder13920 commented Dec 4, 2023

metame-none commented Dec 5, 2023

sanchit-gandhi commented Dec 5, 2023

Purfview commented Dec 5, 2023 • edited Loading

ostegm commented Dec 5, 2023

trungkienbkhn commented Dec 12, 2023

SinanAkkoyun commented Dec 16, 2023 • edited Loading

metame-none commented Jan 2, 2024

metame-none commented Jan 2, 2024 • edited Loading

SinanAkkoyun commented Jan 2, 2024

metame-none commented Jan 2, 2024

ostegm commented Jan 4, 2024

trungkienbkhn commented Jan 5, 2024

nguyendc-systran commented Jan 18, 2024

trungkienbkhn commented Jan 19, 2024

ostegm commented Jan 19, 2024

metame-none commented Jan 20, 2024

metame-none commented Jan 20, 2024

nguyendc-systran Jan 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

metame-none commented Nov 11, 2023 •

edited

Loading

trungkienbkhn commented Nov 27, 2023 •

edited

Loading

trungkienbkhn Nov 27, 2023 •

edited

Loading

trungkienbkhn Nov 30, 2023 •

edited

Loading

Purfview commented Dec 2, 2023 •

edited

Loading

Purfview commented Dec 5, 2023 •

edited

Loading

SinanAkkoyun commented Dec 16, 2023 •

edited

Loading

metame-none commented Jan 2, 2024 •

edited

Loading

nguyendc-systran Jan 23, 2024 •

edited

Loading