-
Notifications
You must be signed in to change notification settings - Fork 382
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tokenizer ignoring multiple spaces #40
Comments
Interestingly all old model checkpoints also has the same issue if one uses
returns
returns
returns I thought first maybe my contributions, which was to enable I already opened 3 PRs each to the 3B, 7B and 13B models which allows https://huggingface.co/openlm-research/open_llama_13b_600bt/discussions/1 But my main Q is still whether spaces are supposed to be independently tokenized? Ie is 2 spaces just 2 tokens, and 3 spaces = 3 individual space tokens, and not like the original LLAMA where 2 spaces = token id X, 3 spaces = token id Y etc. PS for the meantime, you can use my tokenizers which already implements
|
Thanks @danielhanchen . I was just trying your tokenizers, thanks. They do "solve" the spaces issue. As for the space merging, I think it depends on whether the vocab has a token for multiple spaces or not (and how many). Testing the original llama tokenizer, we can see they do have them up until 16 (!) spaces.
The OpenLLama tokenizer only has the single space though:
I think it is smart to have multiple spaces tokenized as a single token. When it comes to code data for example, it represents an enormous saving of tokens. Just think of all the tokens spent to encode simple indentations... If OpenLLama was indeed trained like this, that's very unfortunate. |
Coolies on trying out my temporary tokenizers! :) Interesting find on llama's support up to 16 spaces! I think Openllama did do the individual digit splitting correctly, just maybe not the spaces. Quote from https://arxiv.org/pdf/2302.13971.pdf:
The original llama paper doesn't really mention on spaces, so presumably it's just treated like other tokens. |
When fine-tuning the code data downstream with https://github.com/young-geng/EasyLM/tree/main, there will be significant issues. Spaces are usually used for indentation. the result is that the indentations disappear. the code without indentation such as
|
@joytianya I coincidentally opened 3 PRs to fix the 3B, 7B and 13B tokenizers :) If you're in a rush, you can temporary use my tokenizers which were the ones I pushed to the Openllama team's repo:
|
This is indeed a mistake on our side, as we have misconfigured the tokenizer to remove repeated spaces. I've updated that configuration and now the tokenizer should preserve all spaces. Please try it out. |
@danielhanchen What are the differences between the 3B, 7B, and 13B tokenizers? I ask because I've been working for a few days to create a client-side JavaScript tokenizer for LLaMA, and I used the 13B tokenizer as a reference. I assumed that the tokenizer is the same for these different LLaMA versions, but maybe it's not? |
When I compare the three tokenizers, they seem to be the same:
|
@belladoreai yep as @codesoap showed, it seems like the OpenLLAMA team most likely trained 1 tokenizer on the entire 1T token RJ dataset, then used all 3 for each of the 3 models. But anyways it seems like @young-geng has successfully fixed the tokenizers - I just checked all 3 (3B, 7B, 13B): For eg:
successfully returns: I also updated the |
We've just release a 7B v2 model with a better tokenizer and pretrained with a lot of code data. Check that out! |
@young-geng Congrats on the 7B v2 release! I can see multiple spaces are now tokenized properly! Good work! |
It appears the tokenizer is ignoring more than one consecutive space.
This behaviour is not observed with the original LLama tokenizer. See examples below.
Is this some issue with the configuration of the HF tokenizer? Or has the model really been trained like this?
This seems like a very big deal for everything concerning code understanding/generation.
OpenLLaMA
Original LLaMA
The text was updated successfully, but these errors were encountered: