Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding models to the list in convert-hf-to-gguf-update.py #8357

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

perpendicularai
Copy link

Hi,

This is a pull request to add the functionality to be able to specify the model name and/or tokenizer type and url. The updated code contains a list with no models in convert-hf-to-gguf-update.py. And so avoids performing any unexpected downloads.

07.07.2024_16.57.55_REC.mp4

@github-actions github-actions bot added the python python script changes label Jul 8, 2024
Copy link
Collaborator

@compilade compilade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The updated code contains a list with no models in convert-hf-to-gguf-update.py. And so avoids performing any unexpected downloads.

This seems to remove a lot of stuff that was recently merged (e.g. Gemma2 #8156, OpenELM #7359, T5 #8141, JAIS #8118, snake_case renaming #8305, and others (sign of a rebase gone wrong)) and it also looks like it would make the conversion unconditionally fail for models which use BPE tokenizers.

The goal of convert_hf_to_gguf_update.py is to automate the updates of the hash list in convert_hf_to_gguf.py. If the string used to make the token hashes changes, having the script to fetch all known tokenizers (note that it only downloads what is not already there) and update the list is useful.

If each (existing!) pre-tokenizer has to be added manually by anyone wanting to convert a model, I think this defeats its purpose. Especially since any new pre-tokenizer name needs to be handled in src/llama.cpp too, so a user would have to guess a valid name, otherwise the produced model will make llama.cpp crash on load.

Although making it simpler to add new pre-tokenizers does seem useful, the (accidental) removal of recent new features needs to be fixed to make this usable.

@perpendicularai
Copy link
Author

I don't normally entertain comments but for the shake of this exercise will make I make an exception.

Let's say you wanted to convert two different models with the same tokenizer name, you'll find that it is not possie. And so I hope you get the point.

@compilade
Copy link
Collaborator

I don't normally entertain comments but for the shake of this exercise will make I make an exception.

Sorry if my comment came out as rude, that was not my intention. You obviously want to make this project better (otherwise you would not have opened a PR), so I respect that, and I apologize if my written tone isn't as friendly as it should.

Let's say you wanted to convert two different models with the same tokenizer name, you'll find that it is not possie. And so I hope you get the point.

You're right that the current way this works on master is not ideal.

There are some cases for which it works well, but some cases for which it doesn't.

For example, if different models have exactly the same tokenizer (e.g. Mamba uses the same tokenizer as GPT-NeoX, and so does OLMo), they will get assigned the same pre-tokenizer name, which is convenient.

However, if they have different tokenizers which result in different hashes, yet they still have the same pre-tokenizer (like, Phi-2, GPT-2, and others, all use the GPT-2 pre-tokenizer), then they have to use different pre-tokenizer names, even though they could have used the same name.

llama.cpp/src/llama.cpp

Lines 5396 to 5403 in 7d0e23d

tokenizer_pre == "gpt-2" ||
tokenizer_pre == "phi-2" ||
tokenizer_pre == "jina-es" ||
tokenizer_pre == "jina-de" ||
tokenizer_pre == "jina-v2-es" ||
tokenizer_pre == "jina-v2-de" ||
tokenizer_pre == "jina-v2-code") {
vocab.type_pre = LLAMA_VOCAB_PRE_TYPE_GPT2;

But sometimes the differences picked up by the hash are useful, like for some workarounds with those jina models:

llama.cpp/src/llama.cpp

Lines 5712 to 5714 in 7d0e23d

if (_contains_any(tokenizer_pre, {"jina-v2-de", "jina-v2-es", "jina-v2-code"})) {
_set_token_attr("<mask>", LLAMA_TOKEN_ATTR_LSTRIP, true);
} else if (_contains_any(model_name, {"phi-3", "phi3"})) {

To be clear, I think making it easier to add new tokenizers to the pre-tokenizer list is useful, because otherwise it can be a bit of a chore. Doing it interactively like you're proposing can be a nice way to do this.

But the (likely accidental) removal of code in convert_hf_to_gguf.py should be fixed if you want to get this into a merge-able state, because otherwise this removes support for conversion of Gemma 2 and other popular models.

@mofosyne mofosyne added the Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix label Jul 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python python script changes Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants