-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feat] update distil-whisper #557
Conversation
@nguyendc-systran , would be great if you could have a look. |
@metame-none , hello. How can you generate the ct2 conversion model from hf distil-whisper model ? Besides, I found that the config.json file in your hf model has alignments_head field, but it is different from the config.json file of the openai whisper large-v2 model ? Can you explain how to generate it ? |
faster_whisper/transcribe.py
Outdated
@@ -623,6 +636,10 @@ def generate_with_fallback( | |||
max_initial_timestamp_index = int( | |||
round(options.max_initial_timestamp / self.time_precision) | |||
) | |||
if options.max_new_tokens is not None: | |||
max_length = min(self.max_length, len(prompt) + options.max_new_tokens) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@metame-none ,
Can you re-check this logic ? I think we should use max_length option as input instead of creating new param max_new_tokens
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can event simplify like:
max_length = options.max_length if options.max_length is not None else self.max_length
And please rebase your branch on the last version of master to avoid conflict.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thx for your advice, I use a new max_new_tokens is just for keeping the same logic as distil-whisper, where the setting is
pipe = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
max_new_tokens=128,
torch_dtype=torch_dtype,
device=device,
)
and the default max_length seems include the length of the prompt. So, to make these two compatable, I introduce a new param, not sure if this is a good way, glad to hear your advice. thx a lot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remove the new param max_new_tokens, and use the options.max_length instead, thx
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@metame-none, hello.
I think we should rollback to both use max_new_tokens and max_length params, then update this logic similar to what HF transformer is doing:
# If the user passes `max_new_tokens`, increase its number to account for the prompt
if kwargs.get("max_new_tokens", None) is not None:
kwargs["max_new_tokens"] += len(text_prompt_ids)
if kwargs["max_new_tokens"] >= self.config.max_target_positions:
raise ValueError(
f"The length of the sliced `prompt_ids` is {len(text_prompt_ids)}, and the `max_new_tokens` "
f"{kwargs['max_new_tokens'] - len(text_prompt_ids)}. Thus, the combined length of the sliced "
f"`prompt_ids` and `max_new_tokens` is: {kwargs['max_new_tokens']}. This exceeds the "
f"`max_target_positions` of the Whisper model: {self.config.max_target_positions}. "
"You should either reduce the length of your prompt, or reduce the value of `max_new_tokens`, "
f"so that their combined length is less that {self.config.max_target_positions}."
)
=> raise error if max_new_tokens + len(prompt) > max_length
For more info, please check this pr.
Besides, I also agree that max_length includes the length of the prompt: https://huggingface.co/docs/transformers/main/en/main_classes/text_generation#transformers.GenerationConfig.max_length
Please take a look. Tks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@trungkienbkhn hi,
I rollback to the version with max_new_tokens, plz help to review, thx
hi, by the time I converted, the version I used was 3.21 for ctranslate, and the lastest is 3.22 and it seems it doesn't test on distil-whisper. And for the config.json, the alignments_head follows the logic as code, I also tried to figure out what was the alignments_head for distil-whisper, sadly no luck. But, I tested the word_timestamps for this setting, and it seems working good. |
@sanchit-gandhi, tks for your interesting repo, we are supporting to integrate it to faster-whisper. I'd like to clarify the alignment_heads field.
I found that this field does not exist in the generation_config.json file in your HF model, while the original openai model still exists. Could it affect the word-level timing of the transcription results ? |
3f7c765
to
2e152c5
Compare
I find two issues with faster-distil-whisper:
|
faster_whisper/transcribe.py
Outdated
@@ -642,6 +647,10 @@ def generate_with_fallback( | |||
max_initial_timestamp_index = int( | |||
round(options.max_initial_timestamp / self.time_precision) | |||
) | |||
if options.max_length is not None: | |||
max_length = min(self.max_length, len(prompt) + options.max_length) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@metame-none, hello. Why do you need to get min in this logic ? If we use max_length > default (448 for faster whisper), will it cause too negative effects ? If not, I think can make this logic more flexible by removing the min function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are right, I dont know if it will cause negative effects, i'll remove it.
@metame-none , for #557 (comment):
Besides, I think shoud set chunk_length=15 in preprocessor_config.json file:
|
@trungkienbkhn for #557 (comment)
|
Conclusion from few non-extensive tests: |
|
It is likely that the distillation was done without prompting/conditionning on previous text |
|
Hey @trungkienbkhn, @metame-none, @Purfview, @funboarder13920! First of all, thank you for maintaining this amazing repository! It's such a valuable resource for the open-source speech community 🙌 Super excited to see how it improves over the coming months 🚀 Answering your questions sequentially below:
|
I think I tried that and model was outputting empty transcriptions.
I didn't, but with the prev text conditioning after the first chunk it was outputting non-stop pure hallucinations .
Not the case for faster-whisper. People would want to use distill because it's faster, but this effect is insignificant with fw. So with smaller fw model you get the wanted faster effect. |
@sanchit-gandhi Wanted to ask for some clarification here:
It was a month or two ago, prior to distill whisper, but I tested this and found that the results posted in "insanely fast whisper" are heavily dependent on batching. If you keep batch size at 1 I did not see transformers whisper outperforming faster-whisper. Am I missing something? Are there any other resources confirming that transformers-whisper is indeed faster? |
FYI, I tested my distil-whisper conversion model with the original alignment heads. But I encountered this error: This error only occurs if I set
After verifying, I found that it was stuck due to this logic:
Maybe this is a bug on ctranslate2 for the align function. |
Does anyone have latency figures on batch size 1 faster distill whisper on 2s audio clips (or at least Ns audio speed comparison between faster whisper and faster distill whisper) |
Hi, @trungkienbkhn, what the alignment heads you use? I tested with the default alignment heads and it seems ok。And the origin alignment head for "large-v2" seems starts with layer 10, and the distill-whisper only has two decode layers. According to the author of whisper from openai, openai/whisper#1388 (comment)
So I suppose, the default alignment heads should work just fine. |
hi, @SinanAkkoyun I test with rtx 3090, with 10s audio sample: |
@metame-none Thank you so so much!! |
e9ee567
to
cd307c5
Compare
hi, @trungkienbkhn , I added the chunk_length as an input of transcribe, thx. |
Have you tried this with an initial prompt? I've been trying it out and it seems to break the transcripts when I provide a prompt. |
Hi, @metame-none. I used the original alignment heads starting from layer 10, which caused an error because the decoder only has 2 layers. I agree with you that we should use default alignment heads. Tks. |
@trungkienbkhn please help to generate corresponding Systran models in the HUB. |
@metame-none , hello. FYI, we uploaded the Systran Faster Distil-Whisper conversion models to the HuggingFace hub: Please verify then update those models in this project's utils in your pr. |
HI, just want to bump this comment. |
hi, @ostegm |
hi, @trungkienbkhn |
9294673
to
ade1476
Compare
README.md
Outdated
| Implementation | Precision | Beam size | Time | Gigaspeech WER | | ||
| --- | --- | --- | --- | --- | | ||
| distil-whisper/distil-large-v2 | fp16 | 4 |- | 10.36 | | ||
| [faster-distil-large-v2](https://huggingface.co/metame/faster-distil-whisper-large-v2) | fp16 | 5 | - | 10.28 | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please update these links according to info in utils.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please update these links according to info in
utils.py
thx, I updated the readme.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your contribution, I merge it now
ade1476
to
f992cbe
Compare
reference issue: #533
features:
allowing the user to specify the chunk length and also the maximum generation length
usage:
tested ok with version 3.22 of ctranslate2 converted model (hugginface models)