Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feat] update distil-whisper #557

Merged
merged 1 commit into from
Jan 24, 2024

Conversation

metame-none
Copy link
Contributor

@metame-none metame-none commented Nov 11, 2023

reference issue: #533

features: allowing the user to specify the chunk length and also the maximum generation length

usage:

model_size = "distil-large-v2"
# model_size = "distil-medium.en"
model = WhisperModel(model_size, device="cuda", compute_type="float16")
segments, info = model.transcribe("audio.mp3", beam_size=5, 
    language="en", max_new_tokens=128, condition_on_previous_text=False)

tested ok with version 3.22 of ctranslate2 converted model (hugginface models)

@Purfview
Copy link
Contributor

@nguyendc-systran , would be great if you could have a look.

@trungkienbkhn
Copy link
Collaborator

trungkienbkhn commented Nov 27, 2023

@metame-none , hello. How can you generate the ct2 conversion model from hf distil-whisper model ?
I tried to convert the distil-whisper model from Hugging face hub to ctranslate2 but the hf-to-ct2 script returns an error. More detail can be view in issue-1564.
After discussing with @nguyendc-systran and @minhthuc2502, we created a new PR to fix this script pr-1565 in ctranslate2.

Besides, I found that the config.json file in your hf model has alignments_head field, but it is different from the config.json file of the openai whisper large-v2 model ? Can you explain how to generate it ?
Thank in advance.

@@ -623,6 +636,10 @@ def generate_with_fallback(
max_initial_timestamp_index = int(
round(options.max_initial_timestamp / self.time_precision)
)
if options.max_new_tokens is not None:
max_length = min(self.max_length, len(prompt) + options.max_new_tokens)
Copy link
Collaborator

@trungkienbkhn trungkienbkhn Nov 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@metame-none ,
Can you re-check this logic ? I think we should use max_length option as input instead of creating new param max_new_tokens

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can event simplify like:
max_length = options.max_length if options.max_length is not None else self.max_length

And please rebase your branch on the last version of master to avoid conflict.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx for your advice, I use a new max_new_tokens is just for keeping the same logic as distil-whisper, where the setting is

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    torch_dtype=torch_dtype,
    device=device,
)

and the default max_length seems include the length of the prompt. So, to make these two compatable, I introduce a new param, not sure if this is a good way, glad to hear your advice. thx a lot.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remove the new param max_new_tokens, and use the options.max_length instead, thx

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@metame-none, hello.
I think we should rollback to both use max_new_tokens and max_length params, then update this logic similar to what HF transformer is doing:

# If the user passes `max_new_tokens`, increase its number to account for the prompt
if kwargs.get("max_new_tokens", None) is not None:
    kwargs["max_new_tokens"] += len(text_prompt_ids)
    if kwargs["max_new_tokens"] >= self.config.max_target_positions:
        raise ValueError(
            f"The length of the sliced `prompt_ids` is {len(text_prompt_ids)}, and the `max_new_tokens` "
            f"{kwargs['max_new_tokens'] - len(text_prompt_ids)}. Thus, the combined length of the sliced "
            f"`prompt_ids` and `max_new_tokens` is: {kwargs['max_new_tokens']}. This exceeds the "
            f"`max_target_positions` of the Whisper model: {self.config.max_target_positions}. "
            "You should either reduce the length of your prompt, or reduce the value of `max_new_tokens`, "
            f"so that their combined length is less that {self.config.max_target_positions}."
        )

=> raise error if max_new_tokens + len(prompt) > max_length
For more info, please check this pr.

Besides, I also agree that max_length includes the length of the prompt: https://huggingface.co/docs/transformers/main/en/main_classes/text_generation#transformers.GenerationConfig.max_length

Please take a look. Tks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@trungkienbkhn hi,
I rollback to the version with max_new_tokens, plz help to review, thx

@metame-none
Copy link
Contributor Author

ctranslate2

hi, by the time I converted, the version I used was 3.21 for ctranslate, and the lastest is 3.22 and it seems it doesn't test on distil-whisper.

And for the config.json, the alignments_head follows the logic as code, I also tried to figure out what was the alignments_head for distil-whisper, sadly no luck. But, I tested the word_timestamps for this setting, and it seems working good.

@trungkienbkhn
Copy link
Collaborator

@sanchit-gandhi, tks for your interesting repo, we are supporting to integrate it to faster-whisper. I'd like to clarify the alignment_heads field.
For info about aligment_heads:

To allow the Hugging Face version of Whisper to predict word-level timestamps, a new property alignment_heads must be added to the GenerationConfig object. This is a list of [layer, head] pairs that select the cross-attention heads that are highly correlated to word-level timing.

I found that this field does not exist in the generation_config.json file in your HF model, while the original openai model still exists. Could it affect the word-level timing of the transcription results ?

@metame-none
Copy link
Contributor Author

I find two issues with faster-distil-whisper:

  1. the running speed of faster-distil-whisper is not that fast as I was expected, especially sending numpy array to result = pipe(sample) as link, the transformers version seems faster (tested with 3090, roughly 300ms/transformers vs 400ms/faster-distil-whisper with 10s audio sample)
  2. the faster-distil-whisper works worse in the long audio setting, tested with link and the param condition_on_previous_text makes a huge impact.

@@ -642,6 +647,10 @@ def generate_with_fallback(
max_initial_timestamp_index = int(
round(options.max_initial_timestamp / self.time_precision)
)
if options.max_length is not None:
max_length = min(self.max_length, len(prompt) + options.max_length)
Copy link
Collaborator

@trungkienbkhn trungkienbkhn Nov 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@metame-none, hello. Why do you need to get min in this logic ? If we use max_length > default (448 for faster whisper), will it cause too negative effects ? If not, I think can make this logic more flexible by removing the min function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right, I dont know if it will cause negative effects, i'll remove it.

@trungkienbkhn
Copy link
Collaborator

@metame-none , for #557 (comment):

  1. I'm not sure about that. In my tests for both short-form and long-form audio, fw-distilled-whisper is always superior in timing due to the original fw-large-v2.
  2. I also confirm this. But if we use condition_on_previous_text=False, transcription quality has improved quite a lot.

Besides, I think shoud set chunk_length=15 in preprocessor_config.json file:

To enable chunking, pass the chunk_length_s parameter to the pipeline. For Distil-Whisper, a chunk length of 15-seconds is optimal.

@metame-none
Copy link
Contributor Author

@trungkienbkhn for #557 (comment)

  1. could you please share your testing code? I would like to re-check with that.
  2. Yes, condition_on_previous_text=False will imporve a lot.
  3. you mean, we should set chunk_length=15 as the default setting? I just copy it from the origin HF repo.
    thx for your time.

@Purfview
Copy link
Contributor

Purfview commented Dec 2, 2023

  1. fw-distilled is ~1.5 faster in my tests, that's OK if transcription accuracy wouldn't suffer so much...
  2. why fw-distilled is broken with condition_on_previous_text=True?
  3. imho, shorter chunk = less accurate

Conclusion from few non-extensive tests:
Instead of distill just use smaller fw model - much faster than distill, and with similar reduction on transcription accuracy... [I didn't looked at timestamps accuracy]

@trungkienbkhn
Copy link
Collaborator

@metame-none,

  1. Below is my testing code:
    model = WhisperModel('/tmp/distil-whisper-large-v2-ct2, device="cuda")
    segments, info = model.transcribe(jfk_path, word_timestamps=True, condition_on_previous_text=False, max_length=128)
    I tried converting the large v2 model from the distil-whisper and got a model similar to your HF model.
  2. Based on experience, I find that set condition_on_previous_text=False (default is True) will give much better quality. Hope someone can find the reason
  3. I think it's best to put chunk_length as an input variable when calling the transcribe function

@funboarder13920
Copy link

2. why fw-distilled is broken with `condition_on_previous_text=True`?

It is likely that the distillation was done without prompting/conditionning on previous text

@metame-none
Copy link
Contributor Author

@metame-none,

  1. Below is my testing code:
    model = WhisperModel('/tmp/distil-whisper-large-v2-ct2, device="cuda")
    segments, info = model.transcribe(jfk_path, word_timestamps=True, condition_on_previous_text=False, max_length=128)
    I tried converting the large v2 model from the distil-whisper and got a model similar to your HF model.
  2. Based on experience, I find that set condition_on_previous_text=False (default is True) will give much better quality. Hope someone can find the reason
  3. I think it's best to put chunk_length as an input variable when calling the transcribe function

@trungkienbkhn

  1. instead of sending the filepath, you can try load the audio with librosa and send the numpy array, and do the same with original distill-model, cause I found that when sending array to transformers distill model will be much faster.
  2. sounds good, I may look into that when I get time.

@sanchit-gandhi
Copy link
Contributor

Hey @trungkienbkhn, @metame-none, @Purfview, @funboarder13920!

First of all, thank you for maintaining this amazing repository! It's such a valuable resource for the open-source speech community 🙌 Super excited to see how it improves over the coming months 🚀

Answering your questions sequentially below:

  1. Alignment heads: I've not looked at the specific cross-attention plots, but we should be able to use the same alignment heads as in the original large-v2 (resp. medium.en) model for distil-large-v2 (resp. distil-medium.en). c.f. Retrieve alignment_heads of a fine-tuned model openai/whisper#1733 (reply in thread)
  2. The running speed of faster-distil-whisper is not that fast as I was expected: have you benchmarked Transformers Whisper vs faster-whisper? We're consistently seeing that Transformers Whisper is faster than faster-whisper, so it's likely this trend also holds for Transformers Distil-Whisper vs faster-distil-whisper. c.f. https://github.com/Vaibhavs10/insanely-fast-whisper#insanely-fast-whisper
  3. condition_on_previous_text: the model is trained with conditioning on previous context. However, not to the same proportion as the original Whisper model. It's also very hard to maintain the same distribution of previous context tokens as the original Whisper, since we don't have access to the original dataset details 😅 We tried our best to implement this using our intuition, but it might well be that it doesn't perform as well as the original Whisper model with prev text conditioning. Has anyone evaluated the word error rate difference with / without using prev text conditioning? Would be curious to see what the difference is, and possibly provide some insight as to how we can improve this!
  4. Chunk size 15s: this was set using the Transformers chunking algorithm. For your sequential long-form algorithm, it might well be that 30s is sufficient. This should probably be tuned by evaluating over a corpus of long-form audio and tuning the hyper parameter (as we did for our chunked algorithm)
  5. Instead of distill just use smaller fw model: Distil-Whisper should be significantly more performant than smaller checkpoints, with faster latency! We found this held across the board using our Transformers chunking algorithm.

@Purfview
Copy link
Contributor

Purfview commented Dec 5, 2023

@sanchit-gandhi

we should be able to use the same alignment heads as in the original

I think I tried that and model was outputting empty transcriptions.

Has anyone evaluated the word error rate difference with / without using prev text conditioning?

I didn't, but with the prev text conditioning after the first chunk it was outputting non-stop pure hallucinations .

5... Distil-Whisper should be significantly more performant than smaller checkpoints, with faster latency!

Not the case for faster-whisper. People would want to use distill because it's faster, but this effect is insignificant with fw. So with smaller fw model you get the wanted faster effect.
Frankly, I just quick tested "fw-distil-medium.en", "fw-small.en", "fw-medium.en". Btw, I don't remember if conditioning was enabled for fw non-distill models, anyway that would be a fair test.

@ostegm
Copy link

ostegm commented Dec 5, 2023

@sanchit-gandhi Wanted to ask for some clarification here:

  1. The running speed of faster-distil-whisper is not that fast as I was expected: have you benchmarked Transformers Whisper vs faster-whisper? We're consistently seeing that Transformers Whisper is faster than faster-whisper, so it's likely this trend also holds for Transformers Distil-Whisper vs faster-distil-whisper. c.f. https://github.com/Vaibhavs10/insanely-fast-whisper#insanely-fast-whisper

It was a month or two ago, prior to distill whisper, but I tested this and found that the results posted in "insanely fast whisper" are heavily dependent on batching. If you keep batch size at 1 I did not see transformers whisper outperforming faster-whisper. Am I missing something? Are there any other resources confirming that transformers-whisper is indeed faster?

@trungkienbkhn
Copy link
Collaborator

we should be able to use the same alignment heads as in the original

FYI, I tested my distil-whisper conversion model with the original alignment heads. But I encountered this error:

Screenshot from 2023-12-12 14-33-25

This error only occurs if I set word_timestamps=True :

model = WhisperModel('distil-large-v2', device='cuda')
segments, info = model.transcribe(jfk_path, condition_on_previous_text=False, word_timestamps=True)

After verifying, I found that it was stuck due to this logic:

result = self.model.align(
            encoder_output,
            tokenizer.sot_sequence,
            [text_tokens],
            num_frames,
            median_filter_width=median_filter_width,
        )[0]

Maybe this is a bug on ctranslate2 for the align function.

@SinanAkkoyun
Copy link

SinanAkkoyun commented Dec 16, 2023

Does anyone have latency figures on batch size 1 faster distill whisper on 2s audio clips (or at least Ns audio speed comparison between faster whisper and faster distill whisper)

@metame-none
Copy link
Contributor Author

we should be able to use the same alignment heads as in the original

FYI, I tested my distil-whisper conversion model with the original alignment heads. But I encountered this error:

Hi, @trungkienbkhn, what the alignment heads you use? I tested with the default alignment heads and it seems ok。And the origin alignment head for "large-v2" seems starts with layer 10, and the distill-whisper only has two decode layers.

According to the author of whisper from openai, openai/whisper#1388 (comment)

The heads were not specifically designed or constrained to be monotonically aligned, but some heads in the cross attention layers naturally learned to have the attention weights matching the time alignment. The _ALIGNMENT_HEADS masks are obtained post-training by rather manually selecting the heads that have clean alignment patterns.

So I suppose, the default alignment heads should work just fine.

@metame-none
Copy link
Contributor Author

metame-none commented Jan 2, 2024

Does anyone have latency figures on batch size 1 faster distill whisper on 2s audio clips (or at least Ns audio speed comparison between faster whisper and faster distill whisper)

hi, @SinanAkkoyun

I test with rtx 3090, with 10s audio sample:faster-distill-large-v2 takes 430ms, faster-whisper-large-v2 takes 960ms, and faster-whisper-medium-en takes 710ms.

@SinanAkkoyun
Copy link

@metame-none Thank you so so much!!

@metame-none
Copy link
Contributor Author

  1. I think it's best to put chunk_length as an input variable when calling the transcribe function

hi, @trungkienbkhn , I added the chunk_length as an input of transcribe, thx.

@ostegm
Copy link

ostegm commented Jan 4, 2024

Have you tried this with an initial prompt? I've been trying it out and it seems to break the transcripts when I provide a prompt.

@trungkienbkhn
Copy link
Collaborator

what the alignment heads you use?

Hi, @metame-none. I used the original alignment heads starting from layer 10, which caused an error because the decoder only has 2 layers. I agree with you that we should use default alignment heads. Tks.

@nguyendc-systran
Copy link
Collaborator

@trungkienbkhn please help to generate corresponding Systran models in the HUB.
@metame-none can you please add a note in the readme regarding impact of the "condition_on_previous_text=False" option when using the distil model?

@trungkienbkhn
Copy link
Collaborator

@metame-none , hello. FYI, we uploaded the Systran Faster Distil-Whisper conversion models to the HuggingFace hub:

Please verify then update those models in this project's utils in your pr.
And could you squad your commit? The logic now looks good to me. Tks.

@ostegm
Copy link

ostegm commented Jan 19, 2024

Have you tried this with an initial prompt? I've been trying it out and it seems to break the transcripts when I provide a prompt.

HI, just want to bump this comment.

@metame-none
Copy link
Contributor Author

Have you tried this with an initial prompt? I've been trying it out and it seems to break the transcripts when I provide a prompt.

hi, @ostegm
emprically, condition_on_previous_text=True will degrade the performance for long audio, and I think the initial prompt will likely have the same problem, since there are the same logically.

@metame-none
Copy link
Contributor Author

hi, @trungkienbkhn
I updated utils.py and readme with Systran models, and test ok.
btw, the alignment heads for the small.en model is wrong in distil-whisper/distil-small.en, so I make a PR in the distil-small.en. thx

@metame-none metame-none force-pushed the distil-whisper branch 2 times, most recently from 9294673 to ade1476 Compare January 20, 2024 08:26
README.md Outdated
| Implementation | Precision | Beam size | Time | Gigaspeech WER |
| --- | --- | --- | --- | --- |
| distil-whisper/distil-large-v2 | fp16 | 4 |- | 10.36 |
| [faster-distil-large-v2](https://huggingface.co/metame/faster-distil-whisper-large-v2) | fp16 | 5 | - | 10.28 |
Copy link
Collaborator

@nguyendc-systran nguyendc-systran Jan 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please update these links according to info in utils.py

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please update these links according to info in utils.py

thx, I updated the readme.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution, I merge it now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants