-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Whisper Conversion Script: Correct decoder_attention_heads and _download function #26834
Fix Whisper Conversion Script: Correct decoder_attention_heads and _download function #26834
Conversation
… positional argument: root"
…e integers or slices, not str"
Correct the assignment for `decoder_attention_heads` in the conversion script for the Whisper model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix, quite surprised that this went un-noticed for this long.
Seems like the docstring of the config is also wrong as the tiny has 6 heads but 4 layers not 6 layers 4 heads (if you want to update this as well!)
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
checkpoint_path = _download(_MODELS[checkpoint_path], root) | ||
original_checkpoint = torch.load(checkpoint_path, map_location="cpu") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually with this we end up loading the checkpoint twice, which can be a problem for bigger models, maybe juste adding the root path can do the trick WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. We are reading the model file twice.
Before the _download()
function returned raw bytes
, not a dict
. So the following code did not work:
def _download(url: str, root: str) -> bytes:
# [...]
return model_bytes
def convert_openai_whisper_to_tfms(checkpoint_path, pytorch_dump_folder_path):
if ".pt" not in checkpoint_path:
original_checkpoint = _download(_MODELS[checkpoint_path])
else:
original_checkpoint = torch.load(checkpoint_path, map_location="CPU")
# Here, it gave the "byte indices must be integers or slices, not str" error mentioned:
dimensions = original_checkpoint["dims"]
state_dict = original_checkpoint["model_state_dict"]
I have modified the _download()
code to reuse the read model_bytes
by converting it to io.BytesIO
, and then calling torch.load()
:
def _download(url: str, root: str) -> io.BytesIO:
# [...]
return torch.load(io.BytesIO(model_bytes))
Let me know if you know or find a better approach.
- Correct encoder/decoder layers and attention heads count. - Update model width (`d_model`) to 384.
Thanks for the review!
Nice catch! Indeed, the default values are also wrong, including the I updated the code with the following:
Let me know if you don't agree with any of the changes and thanks again for all the feedback! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. @sanchit-gandhi the default config changes are actually breaking the default initialization but it's a bug fix. How do you feel about it?
"""Converts a Whisper model in OpenAI format to Hugging Face format. | ||
|
||
Example: | ||
|
||
```bash | ||
# Converts the model from OpenAI to Hugging Face format: | ||
python convert_openai_to_hf.py \ | ||
--checkpoint_path tiny \ | ||
--pytorch_dump_folder_path whisper-tiny-hf | ||
``` | ||
|
||
```python | ||
>>> import torchaudio | ||
>>> from transformers import WhisperProcessor, WhisperForConditionalGeneration | ||
>>> from transformers.models.whisper.convert_openai_to_hf import convert_openai_whisper_to_tfms | ||
|
||
>>> # Converts the model from OpenAI to Hugging Face format: | ||
>>> convert_openai_whisper_to_tfms("tiny.en", "whisper-tiny.en-hf") # doctest: +IGNORE_RESULT | ||
|
||
>>> # Select an audio file: | ||
>>> audio_path = "https://huggingface.co/datasets/sanchit-gandhi/librispeech_long/resolve/main/audio.wav" | ||
|
||
>>> # Load the Whisper model in Hugging Face format: | ||
>>> processor = WhisperProcessor.from_pretrained("openai/whisper-tiny.en") | ||
>>> model = WhisperForConditionalGeneration.from_pretrained("whisper-tiny.en-hf") | ||
>>> model.config.forced_decoder_ids = None | ||
|
||
>>> # Select an audio file: | ||
>>> waveform, sampling_rate = torchaudio.load(audio_path) | ||
|
||
>>> # Use the model and processor to transcribe the audio: | ||
>>> input_features = processor( | ||
... waveform.squeeze().numpy(), sampling_rate=sampling_rate, return_tensors="pt" | ||
... ).input_features | ||
|
||
>>> # Generate token ids | ||
>>> predicted_ids = model.generate(input_features) | ||
|
||
>>> # Decode token ids to text | ||
>>> transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True) | ||
|
||
>>> transcription[0] | ||
' Chapter 16. I might have told you of the beginning of this liaison in a few lines' | ||
``` | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mmmm this should rather go in the whisper.md
if not already there! I'd rather not have it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure! I moved it to the whisper.md
documentation file.
I checked the file with doctest, and passed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Config changes LGTM, but would be in favour of directly promoting the pre-trained models on the Hub, rather than the conversion script.
Note that we have converted all official OpenAI checkpoints to Transformers already, so users should be able to download them from pre-trained from the Hub!
docs/source/en/model_doc/whisper.md
Outdated
Here is a step-by-step guide to transcribing an audio sample using a pre-trained Whisper model: | ||
|
||
```python | ||
>>> import torchaudio |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should avoid using extra dependencies like torchaudio
in our code snippets - could you maybe refactor this to only use dependencies in the Transformers library?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to use datasets.load_dataset
.
docs/source/en/model_doc/whisper.md
Outdated
>>> from transformers import WhisperProcessor, WhisperForConditionalGeneration | ||
|
||
>>> # Select an audio file: | ||
>>> audio_path = "https://huggingface.co/datasets/sanchit-gandhi/librispeech_long/resolve/main/audio.wav" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use a shorter audio file - this will take a long time to transcribe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to use hf-internal-testing/librispeech_asr_dummy
dataset.
docs/source/en/model_doc/whisper.md
Outdated
>>> # Load the Whisper model in Hugging Face format: | ||
>>> processor = WhisperProcessor.from_pretrained("openai/whisper-tiny.en") | ||
>>> model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny.en") | ||
>>> model.config.forced_decoder_ids = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need - this API is deprecated now (cc @ArthurZucker)
>>> model.config.forced_decoder_ids = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't know. Removed.
docs/source/en/model_doc/whisper.md
Outdated
>>> transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True) | ||
|
||
>>> transcription[0] | ||
' Chapter 16. I might have told you of the beginning of this liaison in a few lines' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We get a different result here to the one we got before (Chapter 16
only)? Is one of them incorrect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not know why, but the transcriptions in HF and OpenAI do not match using this audio. To further explore this, I’ve created a Kaggle notebook and transcribed the audio here using Tiny.en
from both libraries. This also includes what happens when we convert between formats using the script here (convert_openai_to_hf.py
) and the one in #26854 (convert_hf_to_openai.py
).
Here are the transcription results:
Model | Transcription[:30] | SHA256 Prefix |
---|---|---|
OpenAI | " Chapter 16 I might have told " |
80e1a202 |
Hugging Face | " Chapter 16." |
c9f30e2e |
OpenAI -> HF | " Chapter 16. I might have told" |
090c43de |
HF -> OpenAI | " Chapter 16 I might have told " |
80e1a202 |
From the results, transcriptions using the OpenAI library are consistent, even after conversion from the HF model, suggesting the conversion is accurate. However, the transcriptions using the HF library, both with your model in https://huggingface.co/openai/whisper-tiny.en, and after converting the original tiny.en
from OpenAI, differ.
I ignore the reason for this. Maybe some post-processing step. I have also not tested other model sizes or audios.
Should I open a new ticket with this?
docs/source/en/model_doc/whisper.md
Outdated
' Chapter 16. I might have told you of the beginning of this liaison in a few lines' | ||
``` | ||
|
||
This step is not usually required if we are using the models already [provided by OpenAI in the Hugging Face Hub](https://huggingface.co/openai). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's just promote this directly instead and remove the notes on the conversion script: there is no need for users to have to use the conversion script since we've already converted all official Whisper checkpoints on the Hub https://huggingface.co/collections/openai/whisper-release-6501bba2cf999715fd953013
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK! Conversion example completely removed.
@@ -151,7 +155,7 @@ def convert_openai_whisper_to_tfms(checkpoint_path, pytorch_dump_folder_path): | |||
encoder_layers=dimensions["n_audio_layer"], | |||
encoder_attention_heads=dimensions["n_audio_head"], | |||
decoder_layers=dimensions["n_text_layer"], | |||
decoder_attention_heads=dimensions["n_text_state"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Surprised we managed to convert the original checkpoints with this bug @ArthurZucker 🤔 The state dicts surely won't have matched? Maybe we hardcoded this before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, but I think I hardcoded the values when converting and then later on made it automatic. I checked by actually re-running the script and seeing that this was a nice type 🤣 but good sign that no one else tried to convert the checkpoints !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about the history behind this. But looking at the blame
, the script was correct once:
decoder_attention_heads=dimensions["n_text_head"], |
Then it was deleted and recovered in: #20600
That's where the problem seems to come from. So the original script you used may have worked properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice digging! Yep I think I uploaded an old version late by a few commits
Thanks for the review, @sanchit-gandhi. I think I have made all the changes requested. Please, could you check our comments on the transcription differences? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - thanks for fixing this @zuazo and for the explanation on the differences! Since the logits for a single forward pass are equivalent, this is most likely the accumulation of small numerical errors.
If you can just fix the conflict @zuazo we'll be able to merge! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the wait!
…ownload function (huggingface#26834) * Fix error in convert_openai_to_hf.py: "_download() missing 1 required positional argument: root" * Fix error in convert_openai_to_hf.py: "TypeError: byte indices must be integers or slices, not str" * Fix decoder_attention_heads value in convert_openai_to_hf.py. Correct the assignment for `decoder_attention_heads` in the conversion script for the Whisper model. * Black reformat convert_openai_to_hf.py file. * Fix Whisper model configuration defaults (for Tiny). - Correct encoder/decoder layers and attention heads count. - Update model width (`d_model`) to 384. * Add docstring to the convert_openai_to_hf.py script with a doctest * Add shebang and +x permission to the convert_openai_to_hf.py * convert_openai_to_hf.py: reuse the read model_bytes in the _download() function * Move convert_openai_to_hf.py doctest example to whisper.md * whisper.md: Add an inference example to the Conversion section. * whisper.md: remove `model.config.forced_decoder_ids` from examples (deprecated) * whisper.md: Remove "## Format Conversion" section; not used by users * whisper.md: Use librispeech_asr_dummy dataset and load_dataset()
What does this PR do?
This PR addresses two issues in the
convert_openai_whisper_to_hf.py
script for it to work correctly.It corrects the
decoder_attention_heads
value. This did not produce any error, but the models converted did not transcribe correctly.It also fixes the
_download()
function:root
parameter, previously gave the following error:Before submitting
Who can review?