Use tiktoken #1044

jongwook · 2023-03-07T10:25:10Z

Using tiktoken to replace HuggingFace Tokenizers allows faster tokenization and removing tensorflow as a transitive dependency.

A downside is that tiktoken does not yet provide aarch64 linux wheels while tokenizers is built even for ppc64le and s390x. So it may be a blocker for some users..

whisper/tokenizer.py

Co-authored-by: Ruhollah Majdoddin <r.majdodin@gmail.com>

petterreinholdtsen · 2023-04-18T09:32:46Z

I am aware of a university running whisper on powerpc, at least, so their upgrade path will be blocked until tiktoken supports more architectures.

* use tiktoken==0.3.0 * formatting * tuple should be safer * Update whisper/tokenizer.py Co-authored-by: Ruhollah Majdoddin <r.majdodin@gmail.com> * use tiktoken 0.3.1 * reflecting suggestions * cleanup * bypassing load_tiktoken_bpe to avoid blobfile dep --------- Co-authored-by: Ruhollah Majdoddin <r.majdodin@gmail.com>

wsysuper · 2023-07-02T04:27:18Z

whisper/tokenizer.py

+    vocab_path = os.path.join(os.path.dirname(__file__), "assets", f"{name}.tiktoken")
+    ranks = {
+        base64.b64decode(token): int(rank)
+        for token, rank in (line.split() for line in open(vocab_path) if line)


Here, you opened the file and left a unclosed handler.

* use tiktoken==0.3.0 * formatting * tuple should be safer * Update whisper/tokenizer.py Co-authored-by: Ruhollah Majdoddin <r.majdodin@gmail.com> * use tiktoken 0.3.1 * reflecting suggestions * cleanup * bypassing load_tiktoken_bpe to avoid blobfile dep --------- Co-authored-by: Ruhollah Majdoddin <r.majdodin@gmail.com>

jongwook added 3 commits March 7, 2023 02:24

use tiktoken==0.3.0

5e35893

formatting

39237a3

tuple should be safer

67e8805

Majdoddin reviewed Mar 9, 2023

View reviewed changes

whisper/tokenizer.py Outdated Show resolved Hide resolved

Majdoddin reviewed Mar 10, 2023

View reviewed changes

whisper/tokenizer.py Outdated Show resolved Hide resolved

jongwook and others added 6 commits March 13, 2023 01:18

Update whisper/tokenizer.py

117ed3e

Co-authored-by: Ruhollah Majdoddin <r.majdodin@gmail.com>

Merge branch 'main' into use-tiktoken

06e59be

use tiktoken 0.3.1

2a14e80

reflecting suggestions

6869cd8

cleanup

72e5e67

bypassing load_tiktoken_bpe to avoid blobfile dep

a0bd014

jongwook merged commit 839639a into main Mar 13, 2023

jongwook deleted the use-tiktoken branch March 14, 2023 19:37

debloper mentioned this pull request Mar 20, 2023

docs(readme): remove instructions for installing huggingface tokenizer #1123

Closed

jumon mentioned this pull request Apr 1, 2023

Tokenizer object has no attribute 'tokenizer' jumon/whisper-punctuator#7

Closed

ivan-gorin mentioned this pull request Apr 6, 2023

conversion script pt to ggml not working ggerganov/whisper.cpp#724

Closed

llimllib mentioned this pull request Apr 10, 2023

the ggml conversion script is broken ggerganov/whisper.cpp#741

Closed

wsysuper reviewed Jul 2, 2023

View reviewed changes

kyakuno mentioned this pull request Dec 28, 2023

Update whisper decoding algorithm axinc-ai/ailia-models#1355

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use tiktoken #1044

Use tiktoken #1044

jongwook commented Mar 7, 2023 •

edited

Loading

petterreinholdtsen commented Apr 18, 2023

wsysuper Jul 2, 2023

Use tiktoken #1044

Use tiktoken #1044

Conversation

jongwook commented Mar 7, 2023 • edited Loading

petterreinholdtsen commented Apr 18, 2023

wsysuper Jul 2, 2023

Choose a reason for hiding this comment

jongwook commented Mar 7, 2023 •

edited

Loading