Fix tokenizer write on read only file system #373

EricLBuehler · 2024-06-03T00:53:23Z

No description provided.

github-actions · 2024-06-03T00:54:41Z

Code Metrics Report

  ===============================================================================
 Language            Files        Lines         Code     Comments       Blanks
===============================================================================
 Dockerfile              1           34           25            0            9
 Happy                   1          442          369            0           73
 JSON                    9           21           21            0            0
 Python                 24          864          731           25          108
 TOML                   15          402          364            1           37
-------------------------------------------------------------------------------
 Jupyter Notebooks       1            0            0            0            0
 |- Markdown             1           60           30           22            8
 |- Python               1           96           87            1            8
 (Total)                            156          117           23           16
-------------------------------------------------------------------------------
 Markdown               16         1056            0          782          274
 |- BASH                 6          203          190            0           13
 |- Python               6          121          110            0           11
 |- Rust                 3          185          172            9            4
 (Total)                           1565          472          791          302
-------------------------------------------------------------------------------
 Rust                   91        29568        26948          424         2196
 |- Markdown            46          467            0          454           13
 (Total)                          30035        26948          878         2209
===============================================================================
 Total                 159        32387        28458         1232         2697
===============================================================================

polarathene · 2024-06-03T02:49:08Z

Not sure if this is relevant to the original reason for the utility, but a recent change in tokenizers ignores added tokens in the decoder. Which may affect the relevance of this utility / workaround? 🤷‍♂️

For context on this utility method, while the logic itself is pretty clear that it adds the token content field as a lookup key for the associated ID field of the token object in added_tokens, it may be helpful to have a little reference of the file content. So if anyone lands here via git blame I documented the following here (collapsed section at end of linked comment):

Relevant context snippets (click to view)

The utility itself is to avoid these log lines (related discussion), that I only seemed to trigger with a llama3 tokenizer.json:

Similar to what is found in added_tokens.json (this file is at least present in NousResearch/Hermes-2-Pro-Mistral-7B which lacks a tokenizer.json and has tokenizer.model instead).

ArthurZucker · 2024-06-07T12:19:33Z

The added_tokens.json is an old format btw

EricLBuehler · 2024-06-07T12:23:29Z

The added_tokens.json is an old format btw

We do not use added_tokens.json as a backend, but thank you for letting us know!

Fix tokenizer on read only file system

0ba558e

EricLBuehler linked an issue Jun 3, 2024 that may be closed by this pull request

Fails on a read-only volume #372

Closed

EricLBuehler merged commit b6fffdd into master Jun 3, 2024
11 checks passed

EricLBuehler deleted the readonly_support branch June 3, 2024 09:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix tokenizer write on read only file system #373

Fix tokenizer write on read only file system #373

EricLBuehler commented Jun 3, 2024

github-actions bot commented Jun 3, 2024

polarathene commented Jun 3, 2024 •

edited

Loading

ArthurZucker commented Jun 7, 2024

EricLBuehler commented Jun 7, 2024

Fix tokenizer write on read only file system #373

Fix tokenizer write on read only file system #373

Conversation

EricLBuehler commented Jun 3, 2024

github-actions bot commented Jun 3, 2024

polarathene commented Jun 3, 2024 • edited Loading

ArthurZucker commented Jun 7, 2024

EricLBuehler commented Jun 7, 2024

polarathene commented Jun 3, 2024 •

edited

Loading