[`Whisper`] Add conversion script for the tokenizer #27338

ArthurZucker · 2023-11-07T09:33:25Z

What does this PR do?

Aligned with #27336 this PR adds the conversion of the tokenizer form tiktoken to transformers

…28de0182b17605a98631ee`

HuggingFaceDocBuilderDev · 2023-11-07T10:20:58Z

The documentation is not available anymore as the PR was closed or merged.

Co-authored-by: Joshua Lochner <admin@xenova.com>

src/transformers/models/whisper/tokenization_whisper.py

amyeroberts

Thanks for adding!

docs/source/en/model_doc/whisper.md

src/transformers/models/whisper/convert_openai_to_hf.py

src/transformers/models/whisper/tokenization_whisper.py

…er-v3-nots

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

ArthurZucker · 2023-11-07T13:17:58Z

docs/source/en/model_doc/whisper.md

-This model was contributed by [Arthur Zucker](https://huggingface.co/ArthurZ). The Tensorflow version of this model was contributed by [amyeroberts](https://huggingface.co/amyeroberts).
-The original code can be found [here](https://github.com/openai/whisper).


was duplicated in #26834

sanchit-gandhi

Thanks for the speedy support @ArthurZucker!

src/transformers/models/whisper/tokenization_whisper.py

sanchit-gandhi · 2023-11-07T13:27:14Z

src/transformers/models/whisper/convert_openai_to_hf.py

+            for bpe_tokens in merges:
+                writer.write(bpe_tokens + "\n")
+
+        hf_tokenizer = WhisperTokenizer(vocab_file, merge_file)


Do we need to convert the fast tokenizer as well? Or all good with just the slow?

Fast can always be converted from slow when loading with autoTokenizer, so no need I'd say but can add a comment

sanchit-gandhi · 2023-11-07T13:27:49Z

src/transformers/models/whisper/convert_openai_to_hf.py

    args = parser.parse_args()

+    if args.convert_tokenizer:


To me it's more intuitive to always convert the tokenizer, since we can't use the model without it

Yes but not BC because this requires tiktoken

sanchit-gandhi · 2023-11-07T13:28:16Z

src/transformers/models/whisper/convert_openai_to_hf.py

+        else:
+            from tiktoken.load import load_tiktoken_bpe
+
+            NUM_LANGUAGES_PER_RELEASE = {1: 99, 2: 99, 3: 100}


These could be fetched from the model metadata no? Rather than having the user input?

I decided to not use the full model's data to keep it seperate, otherwise I have to either change the conversion function with a new argument or fetch the full tokenizer, which requires whisper (the package). Think this is simpler IMO

….git.luolix.top>

HuggingFaceDocBuilderDev · 2023-11-07T14:08:56Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

* draft * updates * full conversion taken from `https://gist.github.com/xenova/a452a6474428de0182b17605a98631ee` * psuh * nits * updates * more nits * Add co author Co-authored-by: Joshua Lochner <admin@xenova.com> * fixup * cleanup * styling * add proper path * update * nits * don't push the exit * clean * update whisper doc * don't error out if tiktoken is not here * make sure we are BC with conversion * nit * Update docs/source/en/model_doc/whisper.md Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * merge and update * update markdwon * Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com> --------- Co-authored-by: Joshua Lochner <admin@xenova.com> Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

draft

63a860c

ArthurZucker mentioned this pull request Nov 7, 2023

[Whisper] Add large-v3 version support #27336

Merged

5 tasks

ArthurZucker added 3 commits November 7, 2023 10:51

updates

8e04b79

full conversion taken from https://gist.github.com/xenova/a452a64744…

f1b3b5b

…28de0182b17605a98631ee`

psuh

412bd98

nits

deb624a

ArthurZucker force-pushed the whisper-v3-nots branch from d1c25fa to deb624a Compare November 7, 2023 10:22

ArthurZucker and others added 12 commits November 7, 2023 11:31

updates

491352c

more nits

4ddc1e9

Add co author

16d0152

Co-authored-by: Joshua Lochner <admin@xenova.com>

fixup

5f63066

cleanup

1512ea1

styling

26aa3ab

add proper path

10e611f

update

18325b1

nits

8a9c954

don't push the exit

a78464c

clean

7f826a7

update whisper doc

fcbe06b

ArthurZucker marked this pull request as ready for review November 7, 2023 11:23

ArthurZucker added 3 commits November 7, 2023 12:33

don't error out if tiktoken is not here

c7cbe88

make sure we are BC with conversion

1efcd15

nit

608ca8c

ArthurZucker commented Nov 7, 2023

View reviewed changes

src/transformers/models/whisper/tokenization_whisper.py Show resolved Hide resolved

ArthurZucker requested review from sanchit-gandhi and amyeroberts November 7, 2023 11:43

amyeroberts approved these changes Nov 7, 2023

View reviewed changes

docs/source/en/model_doc/whisper.md Outdated Show resolved Hide resolved

src/transformers/models/whisper/convert_openai_to_hf.py Outdated Show resolved Hide resolved

src/transformers/models/whisper/tokenization_whisper.py Show resolved Hide resolved

ArthurZucker and others added 2 commits November 7, 2023 13:41

Merge branch 'main' of github.com:huggingface/transformers into whisp…

8eedcb0

…er-v3-nots

Update docs/source/en/model_doc/whisper.md

d9eeee8

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

amyeroberts mentioned this pull request Nov 7, 2023

OpenAI to HF: Add large-v3 to conversion script #27335

Closed

5 tasks

ArthurZucker added 2 commits November 7, 2023 14:07

merge and update

2350a28

update markdwon

47c3065

ArthurZucker commented Nov 7, 2023

View reviewed changes

sanchit-gandhi approved these changes Nov 7, 2023

View reviewed changes

Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply…

ea11820

….git.luolix.top>

ArthurZucker merged commit 88832c0 into main Nov 7, 2023

ArthurZucker deleted the whisper-v3-nots branch November 7, 2023 14:07

ArthurZucker restored the whisper-v3-nots branch November 7, 2023 14:08

ArthurZucker deleted the whisper-v3-nots branch November 7, 2023 14:08

ArthurZucker restored the whisper-v3-nots branch November 7, 2023 14:08

ArthurZucker deleted the whisper-v3-nots branch November 7, 2023 14:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[`Whisper`] Add conversion script for the tokenizer #27338

[`Whisper`] Add conversion script for the tokenizer #27338

ArthurZucker commented Nov 7, 2023

HuggingFaceDocBuilderDev commented Nov 7, 2023 •

edited

Loading

amyeroberts left a comment

ArthurZucker Nov 7, 2023

sanchit-gandhi left a comment

sanchit-gandhi Nov 7, 2023

ArthurZucker Nov 7, 2023

sanchit-gandhi Nov 7, 2023

ArthurZucker Nov 7, 2023

sanchit-gandhi Nov 7, 2023

ArthurZucker Nov 7, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Nov 7, 2023

		This model was contributed by [Arthur Zucker](https://huggingface.co/ArthurZ). The Tensorflow version of this model was contributed by [amyeroberts](https://huggingface.co/amyeroberts).
		The original code can be found [here](https://github.com/openai/whisper).

[Whisper] Add conversion script for the tokenizer #27338

[Whisper] Add conversion script for the tokenizer #27338

Conversation

ArthurZucker commented Nov 7, 2023

What does this PR do?

HuggingFaceDocBuilderDev commented Nov 7, 2023 • edited Loading

amyeroberts left a comment

Choose a reason for hiding this comment

ArthurZucker Nov 7, 2023

Choose a reason for hiding this comment

sanchit-gandhi left a comment

Choose a reason for hiding this comment

sanchit-gandhi Nov 7, 2023

Choose a reason for hiding this comment

ArthurZucker Nov 7, 2023

Choose a reason for hiding this comment

sanchit-gandhi Nov 7, 2023

Choose a reason for hiding this comment

ArthurZucker Nov 7, 2023

Choose a reason for hiding this comment

sanchit-gandhi Nov 7, 2023

Choose a reason for hiding this comment

ArthurZucker Nov 7, 2023 • edited Loading

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Nov 7, 2023

[`Whisper`] Add conversion script for the tokenizer #27338

[`Whisper`] Add conversion script for the tokenizer #27338

HuggingFaceDocBuilderDev commented Nov 7, 2023 •

edited

Loading

ArthurZucker Nov 7, 2023 •

edited

Loading