Fix Whisper Conversion Script: Correct decoder_attention_heads and _download function #26834

zuazo · 2023-10-16T11:42:08Z

What does this PR do?

This PR addresses two issues in the convert_openai_whisper_to_hf.py script for it to work correctly.

It corrects the decoder_attention_heads value. This did not produce any error, but the models converted did not transcribe correctly.
It also fixes the _download() function:

Adds the root parameter, previously gave the following error:

$ python convert_openai_to_hf.py \
  --checkpoint_path tiny \
  --pytorch_dump_folder_path pytorch_model_hf.bin

Traceback (most recent call last):
  File "convert_openai_to_hf.py", line 184, in <module>
    convert_openai_whisper_to_tfms(args.checkpoint_path, args.pytorch_dump_folder_path)
  File "convert_openai_to_hf.py", line 133, in convert_openai_whisper_to_tfms
    original_checkpoint = _download(_MODELS[checkpoint_path])
TypeError: _download() missing 1 required positional argument: 'root'

Returns the download path instead of the model bytes, it produced the following error before:

$ python convert_openai_to_hf.py \
  --checkpoint_path tiny \
  --pytorch_dump_folder_path pytorch_model_hf.bin

100%|████████████████████████████████| 72.1M/72.1M [00:01<00:00, 41.8MiB/s]
Traceback (most recent call last):
  File "convert_openai_to_hf.py", line 185, in <module>
    convert_openai_whisper_to_tfms(args.checkpoint_path, args.pytorch_dump_folder_path)
  File "convert_openai_to_hf.py", line 137, in convert_openai_whisper_to_tfms
    dimensions = original_checkpoint["dims"]
TypeError: byte indices must be integers or slices, not str

Before submitting

I've read the contributor guideline.
This was discussed/approved via a Github issue or the forum. Please add a link to it if that's the case.
I have updated the documentation with my changes where necessary.
I have written any new necessary tests.

Who can review?

Library:
- tokenizers: @ArthurZucker (based on issue Add-whisper-conversion #20600).

… positional argument: root"

…e integers or slices, not str"

Correct the assignment for `decoder_attention_heads` in the conversion script for the Whisper model.

ArthurZucker

Thanks for the fix, quite surprised that this went un-noticed for this long.
Seems like the docstring of the config is also wrong as the tiny has 6 heads but 4 layers not 6 layers 4 heads (if you want to update this as well!)

HuggingFaceDocBuilderDev · 2023-10-17T07:14:33Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

ArthurZucker · 2023-10-17T07:56:00Z

src/transformers/models/whisper/convert_openai_to_hf.py

+        checkpoint_path = _download(_MODELS[checkpoint_path], root)
+    original_checkpoint = torch.load(checkpoint_path, map_location="cpu")


actually with this we end up loading the checkpoint twice, which can be a problem for bigger models, maybe juste adding the root path can do the trick WDYT?

Good point. We are reading the model file twice.

Before the _download() function returned raw bytes, not a dict. So the following code did not work:

def _download(url: str, root: str) -> bytes: # [...] return model_bytes def convert_openai_whisper_to_tfms(checkpoint_path, pytorch_dump_folder_path): if ".pt" not in checkpoint_path: original_checkpoint = _download(_MODELS[checkpoint_path]) else: original_checkpoint = torch.load(checkpoint_path, map_location="CPU") # Here, it gave the "byte indices must be integers or slices, not str" error mentioned: dimensions = original_checkpoint["dims"] state_dict = original_checkpoint["model_state_dict"]

I have modified the _download() code to reuse the read model_bytes by converting it to io.BytesIO, and then calling torch.load():

def _download(url: str, root: str) -> io.BytesIO: # [...] return torch.load(io.BytesIO(model_bytes))

Let me know if you know or find a better approach.

- Correct encoder/decoder layers and attention heads count. - Update model width (`d_model`) to 384.

…) function

zuazo · 2023-10-17T20:00:59Z

Thanks for the review!

Seems like the docstring of the config is also wrong as the tiny has 6 heads but 4 layers not 6 layers 4 heads (if you want to update this as well!)

Nice catch! Indeed, the default values are also wrong, including the d_model (Width) value.

I updated the code with the following:

Fix those default values of the WhisperConfig to match the Tiny size.
Add a docstring on top with a doctest.
Add a shebang and +x.
Change the _download() logic to avoid loading it twice (as you advised).

Let me know if you don't agree with any of the changes and thanks again for all the feedback!

ArthurZucker

Thanks. @sanchit-gandhi the default config changes are actually breaking the default initialization but it's a bug fix. How do you feel about it?

ArthurZucker · 2023-10-18T09:34:40Z

src/transformers/models/whisper/convert_openai_to_hf.py

+"""Converts a Whisper model in OpenAI format to Hugging Face format.
+
+Example:
+
+```bash
+# Converts the model from OpenAI to Hugging Face format:
+python convert_openai_to_hf.py \
+  --checkpoint_path tiny \
+  --pytorch_dump_folder_path whisper-tiny-hf
+```
+
+```python
+>>> import torchaudio
+>>> from transformers import WhisperProcessor, WhisperForConditionalGeneration
+>>> from transformers.models.whisper.convert_openai_to_hf import convert_openai_whisper_to_tfms
+
+>>> # Converts the model from OpenAI to Hugging Face format:
+>>> convert_openai_whisper_to_tfms("tiny.en", "whisper-tiny.en-hf")  # doctest: +IGNORE_RESULT
+
+>>> # Select an audio file:
+>>> audio_path = "https://huggingface.co/datasets/sanchit-gandhi/librispeech_long/resolve/main/audio.wav"
+
+>>> # Load the Whisper model in Hugging Face format:
+>>> processor = WhisperProcessor.from_pretrained("openai/whisper-tiny.en")
+>>> model = WhisperForConditionalGeneration.from_pretrained("whisper-tiny.en-hf")
+>>> model.config.forced_decoder_ids = None
+
+>>> # Select an audio file:
+>>> waveform, sampling_rate = torchaudio.load(audio_path)
+
+>>> # Use the model and processor to transcribe the audio:
+>>> input_features = processor(
+...     waveform.squeeze().numpy(), sampling_rate=sampling_rate, return_tensors="pt"
+... ).input_features
+
+>>> # Generate token ids
+>>> predicted_ids = model.generate(input_features)
+
+>>> # Decode token ids to text
+>>> transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
+
+>>> transcription[0]
+' Chapter 16. I might have told you of the beginning of this liaison in a few lines'
+```
+"""


Mmmm this should rather go in the whisper.md if not already there! I'd rather not have it here.

Sure! I moved it to the whisper.md documentation file.

I checked the file with doctest, and passed.

sanchit-gandhi

Config changes LGTM, but would be in favour of directly promoting the pre-trained models on the Hub, rather than the conversion script.

Note that we have converted all official OpenAI checkpoints to Transformers already, so users should be able to download them from pre-trained from the Hub!

sanchit-gandhi · 2023-10-24T17:19:49Z