Adding models to the list in convert-hf-to-gguf-update.py #8357

perpendicularai · 2024-07-08T00:00:31Z

Hi,

This is a pull request to add the functionality to be able to specify the model name and/or tokenizer type and url. The updated code contains a list with no models in convert-hf-to-gguf-update.py. And so avoids performing any unexpected downloads.

07.07.2024_16.57.55_REC.mp4

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

compilade

The updated code contains a list with no models in convert-hf-to-gguf-update.py. And so avoids performing any unexpected downloads.

This seems to remove a lot of stuff that was recently merged (e.g. Gemma2 #8156, OpenELM #7359, T5 #8141, JAIS #8118, snake_case renaming #8305, and others (sign of a rebase gone wrong)) and it also looks like it would make the conversion unconditionally fail for models which use BPE tokenizers.

The goal of convert_hf_to_gguf_update.py is to automate the updates of the hash list in convert_hf_to_gguf.py. If the string used to make the token hashes changes, having the script to fetch all known tokenizers (note that it only downloads what is not already there) and update the list is useful.

If each (existing!) pre-tokenizer has to be added manually by anyone wanting to convert a model, I think this defeats its purpose. Especially since any new pre-tokenizer name needs to be handled in src/llama.cpp too, so a user would have to guess a valid name, otherwise the produced model will make llama.cpp crash on load.

Although making it simpler to add new pre-tokenizers does seem useful, the (accidental) removal of recent new features needs to be fixed to make this usable.

perpendicularai · 2024-07-09T04:16:30Z

I don't normally entertain comments but for the shake of this exercise will make I make an exception.

Let's say you wanted to convert two different models with the same tokenizer name, you'll find that it is not possie. And so I hope you get the point.

compilade · 2024-07-09T05:34:08Z

I don't normally entertain comments but for the shake of this exercise will make I make an exception.

Sorry if my comment came out as rude, that was not my intention. You obviously want to make this project better (otherwise you would not have opened a PR), so I respect that, and I apologize if my written tone isn't as friendly as it should.

Let's say you wanted to convert two different models with the same tokenizer name, you'll find that it is not possie. And so I hope you get the point.

You're right that the current way this works on master is not ideal.

There are some cases for which it works well, but some cases for which it doesn't.

For example, if different models have exactly the same tokenizer (e.g. Mamba uses the same tokenizer as GPT-NeoX, and so does OLMo), they will get assigned the same pre-tokenizer name, which is convenient.

However, if they have different tokenizers which result in different hashes, yet they still have the same pre-tokenizer (like, Phi-2, GPT-2, and others, all use the GPT-2 pre-tokenizer), then they have to use different pre-tokenizer names, even though they could have used the same name.

llama.cpp/src/llama.cpp

Lines 5396 to 5403 in 7d0e23d

    
               tokenizer_pre == "gpt-2"   || 
        
               tokenizer_pre == "phi-2"   || 
        
               tokenizer_pre == "jina-es" || 
        
               tokenizer_pre == "jina-de" || 
        
               tokenizer_pre == "jina-v2-es" || 
        
               tokenizer_pre == "jina-v2-de" || 
        
               tokenizer_pre == "jina-v2-code") { 
        
           vocab.type_pre = LLAMA_VOCAB_PRE_TYPE_GPT2;

But sometimes the differences picked up by the hash are useful, like for some workarounds with those jina models:

llama.cpp/src/llama.cpp

Lines 5712 to 5714 in 7d0e23d

    
           if (_contains_any(tokenizer_pre, {"jina-v2-de", "jina-v2-es", "jina-v2-code"})) { 
        
               _set_token_attr("<mask>", LLAMA_TOKEN_ATTR_LSTRIP, true); 
        
           } else if (_contains_any(model_name, {"phi-3", "phi3"})) {

To be clear, I think making it easier to add new tokenizers to the pre-tokenizer list is useful, because otherwise it can be a bit of a chore. Doing it interactively like you're proposing can be a nice way to do this.

But the (likely accidental) removal of code in convert_hf_to_gguf.py should be fixed if you want to get this into a merge-able state, because otherwise this removes support for conversion of Gemma 2 and other popular models.

perpendicularai added 2 commits July 7, 2024 16:50

Update convert_hf_to_gguf_update.py

9cac9ce

Update convert_hf_to_gguf.py

072d7c9

github-actions bot added the python python script changes label Jul 8, 2024

compilade requested changes Jul 9, 2024

View reviewed changes

mofosyne added the Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix label Jul 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding models to the list in convert-hf-to-gguf-update.py #8357

Adding models to the list in convert-hf-to-gguf-update.py #8357

perpendicularai commented Jul 8, 2024

compilade left a comment •

edited

Loading

perpendicularai commented Jul 9, 2024

compilade commented Jul 9, 2024

Adding models to the list in convert-hf-to-gguf-update.py #8357

Are you sure you want to change the base?

Adding models to the list in convert-hf-to-gguf-update.py #8357

Conversation

perpendicularai commented Jul 8, 2024

compilade left a comment • edited Loading

Choose a reason for hiding this comment

perpendicularai commented Jul 9, 2024

compilade commented Jul 9, 2024

compilade left a comment •

edited

Loading