Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Whisper] Add conversion script for the tokenizer #27338

Merged
merged 25 commits into from
Nov 7, 2023
Merged

Conversation

ArthurZucker
Copy link
Collaborator

What does this PR do?

Aligned with #27336 this PR adds the conversion of the tokenizer form tiktoken to transformers

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Nov 7, 2023

The documentation is not available anymore as the PR was closed or merged.

@ArthurZucker ArthurZucker marked this pull request as ready for review November 7, 2023 11:23
Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding!

ArthurZucker and others added 2 commits November 7, 2023 13:41
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
Comment on lines -37 to -38
This model was contributed by [Arthur Zucker](https://huggingface.co/ArthurZ). The Tensorflow version of this model was contributed by [amyeroberts](https://huggingface.co/amyeroberts).
The original code can be found [here](https://github.com/openai/whisper).
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was duplicated in #26834

Copy link
Contributor

@sanchit-gandhi sanchit-gandhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the speedy support @ArthurZucker!

for bpe_tokens in merges:
writer.write(bpe_tokens + "\n")

hf_tokenizer = WhisperTokenizer(vocab_file, merge_file)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to convert the fast tokenizer as well? Or all good with just the slow?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fast can always be converted from slow when loading with autoTokenizer, so no need I'd say but can add a comment

args = parser.parse_args()

if args.convert_tokenizer:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me it's more intuitive to always convert the tokenizer, since we can't use the model without it

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but not BC because this requires tiktoken

else:
from tiktoken.load import load_tiktoken_bpe

NUM_LANGUAGES_PER_RELEASE = {1: 99, 2: 99, 3: 100}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These could be fetched from the model metadata no? Rather than having the user input?

Copy link
Collaborator Author

@ArthurZucker ArthurZucker Nov 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to not use the full model's data to keep it seperate, otherwise I have to either change the conversion function with a new argument or fetch the full tokenizer, which requires whisper (the package). Think this is simpler IMO

@ArthurZucker ArthurZucker merged commit 88832c0 into main Nov 7, 2023
@ArthurZucker ArthurZucker deleted the whisper-v3-nots branch November 7, 2023 14:07
@ArthurZucker ArthurZucker restored the whisper-v3-nots branch November 7, 2023 14:08
@ArthurZucker ArthurZucker deleted the whisper-v3-nots branch November 7, 2023 14:08
@ArthurZucker ArthurZucker restored the whisper-v3-nots branch November 7, 2023 14:08
@ArthurZucker ArthurZucker deleted the whisper-v3-nots branch November 7, 2023 14:08
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

EduardoPach pushed a commit to EduardoPach/transformers that referenced this pull request Nov 19, 2023
* draft

* updates

* full conversion taken from `https://gist.github.com/xenova/a452a6474428de0182b17605a98631ee`

* psuh

* nits

* updates

* more nits

* Add co author

Co-authored-by: Joshua Lochner <admin@xenova.com>

* fixup

* cleanup

* styling

* add proper path

* update

* nits

* don't  push the exit

* clean

* update whisper doc

* don't error out if tiktoken is not here

* make sure we are BC with conversion

* nit

* Update docs/source/en/model_doc/whisper.md

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* merge and update

* update markdwon

* Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>

---------

Co-authored-by: Joshua Lochner <admin@xenova.com>
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants