-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding models to the list in convert-hf-to-gguf-update.py #8357
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The updated code contains a list with no models in convert-hf-to-gguf-update.py. And so avoids performing any unexpected downloads.
This seems to remove a lot of stuff that was recently merged (e.g. Gemma2 #8156, OpenELM #7359, T5 #8141, JAIS #8118, snake_case renaming #8305, and others (sign of a rebase gone wrong)) and it also looks like it would make the conversion unconditionally fail for models which use BPE tokenizers.
The goal of convert_hf_to_gguf_update.py
is to automate the updates of the hash list in convert_hf_to_gguf.py
. If the string used to make the token hashes changes, having the script to fetch all known tokenizers (note that it only downloads what is not already there) and update the list is useful.
If each (existing!) pre-tokenizer has to be added manually by anyone wanting to convert a model, I think this defeats its purpose. Especially since any new pre-tokenizer name needs to be handled in src/llama.cpp
too, so a user would have to guess a valid name, otherwise the produced model will make llama.cpp
crash on load.
Although making it simpler to add new pre-tokenizers does seem useful, the (accidental) removal of recent new features needs to be fixed to make this usable.
I don't normally entertain comments but for the shake of this exercise will make I make an exception. Let's say you wanted to convert two different models with the same tokenizer name, you'll find that it is not possie. And so I hope you get the point. |
Sorry if my comment came out as rude, that was not my intention. You obviously want to make this project better (otherwise you would not have opened a PR), so I respect that, and I apologize if my written tone isn't as friendly as it should.
You're right that the current way this works on There are some cases for which it works well, but some cases for which it doesn't. For example, if different models have exactly the same tokenizer (e.g. Mamba uses the same tokenizer as GPT-NeoX, and so does OLMo), they will get assigned the same pre-tokenizer name, which is convenient. However, if they have different tokenizers which result in different hashes, yet they still have the same pre-tokenizer (like, Phi-2, GPT-2, and others, all use the GPT-2 pre-tokenizer), then they have to use different pre-tokenizer names, even though they could have used the same name. Lines 5396 to 5403 in 7d0e23d
But sometimes the differences picked up by the hash are useful, like for some workarounds with those Lines 5712 to 5714 in 7d0e23d
To be clear, I think making it easier to add new tokenizers to the pre-tokenizer list is useful, because otherwise it can be a bit of a chore. Doing it interactively like you're proposing can be a nice way to do this. But the (likely accidental) removal of code in |
Hi,
This is a pull request to add the functionality to be able to specify the model name and/or tokenizer type and url. The updated code contains a list with no models in convert-hf-to-gguf-update.py. And so avoids performing any unexpected downloads.
07.07.2024_16.57.55_REC.mp4