-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Whisper
] Add conversion script for the tokenizer
#27338
Conversation
The documentation is not available anymore as the PR was closed or merged. |
d1c25fa
to
deb624a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding!
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
This model was contributed by [Arthur Zucker](https://huggingface.co/ArthurZ). The Tensorflow version of this model was contributed by [amyeroberts](https://huggingface.co/amyeroberts). | ||
The original code can be found [here](https://github.com/openai/whisper). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was duplicated in #26834
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the speedy support @ArthurZucker!
for bpe_tokens in merges: | ||
writer.write(bpe_tokens + "\n") | ||
|
||
hf_tokenizer = WhisperTokenizer(vocab_file, merge_file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to convert the fast tokenizer as well? Or all good with just the slow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fast can always be converted from slow when loading with autoTokenizer, so no need I'd say but can add a comment
args = parser.parse_args() | ||
|
||
if args.convert_tokenizer: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me it's more intuitive to always convert the tokenizer, since we can't use the model without it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes but not BC because this requires tiktoken
else: | ||
from tiktoken.load import load_tiktoken_bpe | ||
|
||
NUM_LANGUAGES_PER_RELEASE = {1: 99, 2: 99, 3: 100} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These could be fetched from the model metadata no? Rather than having the user input?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I decided to not use the full model's data to keep it seperate, otherwise I have to either change the conversion function with a new argument or fetch the full tokenizer, which requires whisper (the package). Think this is simpler IMO
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
* draft * updates * full conversion taken from `https://gist.github.com/xenova/a452a6474428de0182b17605a98631ee` * psuh * nits * updates * more nits * Add co author Co-authored-by: Joshua Lochner <admin@xenova.com> * fixup * cleanup * styling * add proper path * update * nits * don't push the exit * clean * update whisper doc * don't error out if tiktoken is not here * make sure we are BC with conversion * nit * Update docs/source/en/model_doc/whisper.md Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> * merge and update * update markdwon * Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com> --------- Co-authored-by: Joshua Lochner <admin@xenova.com> Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
What does this PR do?
Aligned with #27336 this PR adds the conversion of the tokenizer form
tiktoken
totransformers