Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix tokenizer write on read only file system #373

Merged
merged 1 commit into from
Jun 3, 2024

Conversation

EricLBuehler
Copy link
Owner

No description provided.

@EricLBuehler EricLBuehler linked an issue Jun 3, 2024 that may be closed by this pull request
Copy link

github-actions bot commented Jun 3, 2024

Code Metrics Report
  ===============================================================================
 Language            Files        Lines         Code     Comments       Blanks
===============================================================================
 Dockerfile              1           34           25            0            9
 Happy                   1          442          369            0           73
 JSON                    9           21           21            0            0
 Python                 24          864          731           25          108
 TOML                   15          402          364            1           37
-------------------------------------------------------------------------------
 Jupyter Notebooks       1            0            0            0            0
 |- Markdown             1           60           30           22            8
 |- Python               1           96           87            1            8
 (Total)                            156          117           23           16
-------------------------------------------------------------------------------
 Markdown               16         1056            0          782          274
 |- BASH                 6          203          190            0           13
 |- Python               6          121          110            0           11
 |- Rust                 3          185          172            9            4
 (Total)                           1565          472          791          302
-------------------------------------------------------------------------------
 Rust                   91        29568        26948          424         2196
 |- Markdown            46          467            0          454           13
 (Total)                          30035        26948          878         2209
===============================================================================
 Total                 159        32387        28458         1232         2697
===============================================================================
  

@polarathene
Copy link
Contributor

polarathene commented Jun 3, 2024

Not sure if this is relevant to the original reason for the utility, but a recent change in tokenizers ignores added tokens in the decoder. Which may affect the relevance of this utility / workaround? 🤷‍♂️


For context on this utility method, while the logic itself is pretty clear that it adds the token content field as a lookup key for the associated ID field of the token object in added_tokens, it may be helpful to have a little reference of the file content. So if anyone lands here via git blame I documented the following here (collapsed section at end of linked comment):

Relevant context snippets (click to view)

image
image

The utility itself is to avoid these log lines (related discussion), that I only seemed to trigger with a llama3 tokenizer.json:

image
image

Similar to what is found in added_tokens.json (this file is at least present in NousResearch/Hermes-2-Pro-Mistral-7B which lacks a tokenizer.json and has tokenizer.model instead).

@EricLBuehler EricLBuehler merged commit b6fffdd into master Jun 3, 2024
11 checks passed
@EricLBuehler EricLBuehler deleted the readonly_support branch June 3, 2024 09:50
@ArthurZucker
Copy link

The added_tokens.json is an old format btw

@EricLBuehler
Copy link
Owner Author

The added_tokens.json is an old format btw

We do not use added_tokens.json as a backend, but thank you for letting us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fails on a read-only volume
3 participants