Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use tiktoken #1044

Merged
merged 9 commits into from
Mar 13, 2023
Merged

Use tiktoken #1044

merged 9 commits into from
Mar 13, 2023

Conversation

jongwook
Copy link
Collaborator

@jongwook jongwook commented Mar 7, 2023

Using tiktoken to replace HuggingFace Tokenizers allows faster tokenization and removing tensorflow as a transitive dependency.

A downside is that tiktoken does not yet provide aarch64 linux wheels while tokenizers is built even for ppc64le and s390x. So it may be a blocker for some users..

whisper/tokenizer.py Outdated Show resolved Hide resolved
whisper/tokenizer.py Outdated Show resolved Hide resolved
@petterreinholdtsen
Copy link
Contributor

I am aware of a university running whisper on powerpc, at least, so their upgrade path will be blocked until tiktoken supports more architectures.

zackees pushed a commit to zackees/whisper that referenced this pull request May 5, 2023
* use tiktoken==0.3.0

* formatting

* tuple should be safer

* Update whisper/tokenizer.py

Co-authored-by: Ruhollah Majdoddin <r.majdodin@gmail.com>

* use tiktoken 0.3.1

* reflecting suggestions

* cleanup

* bypassing load_tiktoken_bpe to avoid blobfile dep

---------

Co-authored-by: Ruhollah Majdoddin <r.majdodin@gmail.com>
ilanit1997 pushed a commit to ilanit1997/whisper that referenced this pull request May 16, 2023
* use tiktoken==0.3.0

* formatting

* tuple should be safer

* Update whisper/tokenizer.py

Co-authored-by: Ruhollah Majdoddin <r.majdodin@gmail.com>

* use tiktoken 0.3.1

* reflecting suggestions

* cleanup

* bypassing load_tiktoken_bpe to avoid blobfile dep

---------

Co-authored-by: Ruhollah Majdoddin <r.majdodin@gmail.com>
vocab_path = os.path.join(os.path.dirname(__file__), "assets", f"{name}.tiktoken")
ranks = {
base64.b64decode(token): int(rank)
for token, rank in (line.split() for line in open(vocab_path) if line)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, you opened the file and left a unclosed handler.

abyesilyurt pushed a commit to abyesilyurt/whisper that referenced this pull request Nov 13, 2023
* use tiktoken==0.3.0

* formatting

* tuple should be safer

* Update whisper/tokenizer.py

Co-authored-by: Ruhollah Majdoddin <r.majdodin@gmail.com>

* use tiktoken 0.3.1

* reflecting suggestions

* cleanup

* bypassing load_tiktoken_bpe to avoid blobfile dep

---------

Co-authored-by: Ruhollah Majdoddin <r.majdodin@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants