Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Mistral-Nemo-Instruct-2407 128K #8577

Open
4 tasks done
mirek190 opened this issue Jul 18, 2024 · 54 comments
Open
4 tasks done

Support Mistral-Nemo-Instruct-2407 128K #8577

mirek190 opened this issue Jul 18, 2024 · 54 comments
Labels
enhancement New feature or request

Comments

@mirek190
Copy link

Prerequisites

  • I am running the latest code. Mention the version if possible as well.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

Any plans for support Mistral-Nemo-Instruct-2407 128K ?

https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407

Motivation

enhancement

Possible Implementation

No response

@mirek190 mirek190 added the enhancement New feature or request label Jul 18, 2024
@0wwafa
Copy link

0wwafa commented Jul 18, 2024

yes, please. this one is going to be good and soon finetunes will start to popup...

@0wwafa
Copy link

0wwafa commented Jul 18, 2024

And this: https://huggingface.co/nvidia/Mistral-NeMo-12B-Instruct

@delphijb
Copy link

delphijb commented Jul 18, 2024

I second the request. This model is likely to become the reference for the 7-12b segment, and finetuning's version will indeed appear rapidly. Thx in advance

@stduhpf
Copy link
Contributor

stduhpf commented Jul 18, 2024

They claim it can be a drop-in replacement of Mistral 7B, so surely it shouldn't be too much work to make it work with ggml since Mistral 7B works.

@EliEron
Copy link

EliEron commented Jul 18, 2024

They claim it can be a drop-in replacement of Mistral 7B, so surely it shouldn't be too much work to make it work with ggml since Mistral 7B works.

The issue is that it uses a custom tokenizer named Tekken. That's not an issue for any program that uses Transformers. As their tokenizer system supports the custom tokenizer. Which is why they call it a drop in replacement.

For llama.cpp however the custom tokenizer has to be implemented manually. And implementing new tokenizers correctly is usually not easy. Gemma-2 and Llama-3's tokenizer for instance took quite a while to implement properly, and it took multiple attempts to do so as bugs were found over time.

m18coppola added a commit to m18coppola/llama.cpp that referenced this issue Jul 18, 2024
@iamlemec
Copy link
Collaborator

I actually think the tokenizer might not be too different from others. It's listed as GPT2Tokenizer in the tokenizer_config.json and it has a pre-tokenizer of the usual form. I was able to add it in the standard fashion with pre-tokenizer and the update script.

The other issue is that the tensors shapes relating to attention are not the sizes expected by the current implementation of Mistral (see my other comment here #8576 (comment)). I was able to brute force hack it into at least running, and I'm getting sensible output, which makes me think the tokenizer is doing okay. For example:

PROMPT: What is the capital of South Korea in Hangul?
RESPONSE: The capital of South Korea in Hangul is 서울 (Seoul).

@netrunnereve
Copy link
Contributor

If this model works well we should also try to add FP8 support into llama.cpp and make full use of the QAT. That will take more work to compute compared to Q8_0 without native FP8 support but it'll probably end up being memory bound anyways.

m18coppola added a commit to m18coppola/llama.cpp that referenced this issue Jul 19, 2024
Removed uneeded `vocab.tokenizer_clean_spaces` assignment
@iamlemec
Copy link
Collaborator

iamlemec commented Jul 19, 2024

For those curious, I have a seemingly working (and now non-hacky) inference implementation here: https://github.com/iamlemec/llama.cpp/tree/mistral-nemo. Will push a PR for it once #8579 is merged and I can rebase off of it.

@mirek190
Copy link
Author

For those curious, I have a seemingly working (and now non-hacky) inference implementation here: https://github.com/iamlemec/llama.cpp/tree/mistral-nemo. Will push a PR for it once #8579 is merged and I can rebase off of it.

Can't wait to test ;)

@foldl
Copy link
Contributor

foldl commented Jul 20, 2024

Also, for those who are interested, chatllm.cpp supports this.

@stduhpf
Copy link
Contributor

stduhpf commented Jul 20, 2024

For those curious, I have a seemingly working (and now non-hacky) inference implementation here: https://github.com/iamlemec/llama.cpp/tree/mistral-nemo. Will push a PR for it once #8579 is merged and I can rebase off of it.

Seems to work perfectly so far. Nice job.

ggerganov added a commit that referenced this issue Jul 20, 2024
* llama : Added support for Tekken pre-tokenizer (#8577)

Removed uneeded `vocab.tokenizer_clean_spaces` assignment

* llama : fix order of pre-tokenizers

* * Tekken pre-tokenizer no longer uses clean_up_tokenization_spaces
* Updated chkhsh for Tekken tokenizer

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
@muhammadyusuf-kurbonov
Copy link

#8579 is merged

@legraphista
Copy link
Contributor

Just quantized Mistral-Nemo-Instruct and trying to run it I get the following error:

llm_load_tensors: ggml ctx size =    0.17 MiB
llama_model_load: error loading model: check_tensor_dims: tensor 'blk.0.attn_q.weight' has wrong shape; expected  5120,  5120, got  5120,  4096,     1,     1
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model './Mistral-Nemo-Instruct-2407.Q8_0.gguf'
main: error: unable to load model

Looks like there's a shape mismatch.

According to the config file, the hidden size should be 5120 https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407/blob/main/config.json#L10

@mirek190
Copy link
Author

where I can find a proper gguf?

@maziyarpanahi
Copy link

Hi @legraphista

I have a new build from the main branch with the new PR merged, I am also using convert_hf_to_gguf.py. But I am getting this error:

llama.cpp/gguf-py/gguf/metadata.py", line 233, in get_model_id_components
    if at_start and ((len(t) == 0 and part[0].isalpha()) or "version" in t):
IndexError: string index out of range

Could you please let me know if I am missing something here?

@stduhpf
Copy link
Contributor

stduhpf commented Jul 20, 2024

Hi @legraphista

I have a new build from the main branch with the new PR merged, I am also using convert_hf_to_gguf.py. But I am getting this error:

llama.cpp/gguf-py/gguf/metadata.py", line 233, in get_model_id_components
    if at_start and ((len(t) == 0 and part[0].isalpha()) or "version" in t):
IndexError: string index out of range

Could you please let me know if I am missing something here?

Only the tokenizer support has been added by the PR. If you want to run the model, you çan use @iamlemec's fork.

@iamlemec
Copy link
Collaborator

@mirek190 I have a Q5_K of Instruct here: https://huggingface.co/CompendiumLabs/mistral-nemo-instruct-2407-gguf. Happy to add more variants if needed.

@maziyarpanahi
Copy link

Hi @legraphista
I have a new build from the main branch with the new PR merged, I am also using convert_hf_to_gguf.py. But I am getting this error:

llama.cpp/gguf-py/gguf/metadata.py", line 233, in get_model_id_components
    if at_start and ((len(t) == 0 and part[0].isalpha()) or "version" in t):
IndexError: string index out of range

Could you please let me know if I am missing something here?

Only the tokenizer support has been added by the PR. If you want to run the model, you çan use @iamlemec's fork.

I am actually trying to quantize it at the moment, since I saw it happened successfully here, so I was wondering.

@mirek190
Copy link
Author

@mirek190 I have a Q5_K of Instruct here: https://huggingface.co/CompendiumLabs/mistral-nemo-instruct-2407-gguf. Happy to add more variants if needed.

ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size =    0.34 MiB
llama_model_load: error loading model: check_tensor_dims: tensor 'blk.0.attn_q.weight' has wrong shape; expected  5120,  5120, got  5120,  4096,     1,     1
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model 'models/new3/mistral-nemo-instruct-q5_k.gguf'
main: error: unable to load model

You version also doesn't work

@cwillu
Copy link

cwillu commented Jul 20, 2024

@mirek190 try running a make clean first; the project makefiles don't appear to be 100% reliable.

@EricGrange
Copy link

Just tried with latest quants at QuantFactory/Mistral-Nemo-Instruct-2407-GGUF and build 3437, getting the following error (Xeon E-2176G with 64GB RAM, under Debian). Apparently it's trying to allocate a very large CPU buffer

ggml_backend_cpu_buffer_type_alloc_buffer: failed to allocate buffer of size 167772160032
llama_kv_cache_init: failed to allocate buffer for kv cache
llama_new_context_with_model: llama_kv_cache_init() failed for self-attention cache
llama_init_from_gpt_params: error: failed to create context with model 'Mistral-Nemo-Instruct-2407.Q4_K_S.gguf'
 ERR [              load_model] unable to load model | tid="140094037281280" timestamp=1721650251 model="Mistral-Nemo-Instruct-2407.Q4_K_S.gguf"

@sbelenki
Copy link

Just tried with latest quants at QuantFactory/Mistral-Nemo-Instruct-2407-GGUF and build 3437, getting the following error (Xeon E-2176G with 64GB RAM, under Debian). Apparently it's trying to allocate a very large CPU buffer

I have encountered the same problem while testing new build, the advice that helped me was to use -c parameter

@EricGrange
Copy link

Thanks confirmed, passing the context size explicitly does the trick.
It seems to work correctly also when using "-c 131072" (128k if I'm not mistaken)

@jthack
Copy link

jthack commented Jul 22, 2024

i did an ollama update and still getting the same error.

@MoonRide303
Copy link

MoonRide303 commented Jul 22, 2024

Initial tests with small context (-c 8192) look good to me - no issues observed so far. Both conversion and inference with llama.cpp b3438 (using Q6_K quant).

It doesn't seem to work well with bigger contexts, though. I tried the the commands extraction test (same as for Phi 3, #8262 (comment), using ~20k tokens as single input), and answer was nowhere near correct. Online version from NV (https://build.nvidia.com/nv-mistralai/mistral-nemo-12b-instruct) failed with error on the same test, too.

@mirek190
Copy link
Author

Tested and very disappointed ....

Tested 8bit version

llama-cli.exe --model models/new3/Mistral-Nemo-Instruct-2407-Q8_0.gguf --color --threads 30 --keep -1 --n-predict -1 --repeat-penalty 1.1 --ctx-size 100000 --interactive -ngl 99 --simple-io -e --multiline-input --no-display-prompt --conversation --no-mmap --temp 0.35 --in-prefix " </s>[INST] " --in-suffix " [/INST] " -p "<s>[INST] You are a helpful assistant.\n\n<" -fa 

prompt "Can you put that text into paragraphs?"

Screenshot 2024-07-22 215646

stopped after literally after sentences and asked to continue and stopped again ....

Screenshot 2024-07-22 215734

Otherwise gemma 2 -27b - no problem at all to the end and made good job.

Screenshot 2024-07-21 222932
Screenshot 2024-07-21 222957

@0wwafa
Copy link

0wwafa commented Jul 22, 2024

tested. seems to work.

@mirek190
Copy link
Author

mirek190 commented Jul 22, 2024

How long text?
I tested on 8k context text and failed badly every time. Context for model 100k .

ps.

After few hours of testing I can say that model is better than llama 3 8b but worse than gemma 2 9b.
Advantage for Mistral-Nemo ( not count bad hallucinations ) is huge context.

@mirek190
Copy link
Author

still testing .. I think -fa ( flash attention makes problems - degradation and performance loss ) as this model has huge ctx I used -fa

@foldl
Copy link
Contributor

foldl commented Jul 23, 2024

I think this model uses SWA. The memory requirement is much less if SWA is used.

I have tested it with 4.7k tokens (using SWA), and it looks ok.

@mirek190
Copy link
Author

I think this model uses SWA. The memory requirement is much less if SWA is used.

I have tested it with 4.7k tokens (using SWA), and it looks ok.

4.7 k tokens with 128k model ... yes very productive....

@foldl
Copy link
Contributor

foldl commented Jul 24, 2024

@mirek190 the point is that with SWA, 128k context length won't blow up your memory.

@ehartford
Copy link

I try to do ./convert_hf_to_gguf.py /workspace/axolotl/dolphin-2.9.3-mistral-nemo-hf
I get NotImplementedError: BPE pre-tokenizer was not recognized - update get_vocab_base_pre()

@ehartford
Copy link

I already saw this line in convert_hf_to_gguf_update.py

{"name": "tekken",         "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/mistralai/Mistral-Nemo-Base-2407", },

I already executed convert_hf_to_gguf_update.py

but still it doesn't work.

As a guess, I tried this:

python ./convert_hf_to_gguf.py /workspace/axolotl/dolphin-2.9.3-mistral-nemo-hf --model-name tekken

But that didnt' work either

@iamlemec
Copy link
Collaborator

@ehartford They made some changes to tokenizer_config.json a day or two after release. See the commit here: https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407/commit/dac9c9e98f83322b32e32b48c118f079930772d6. Updating yours similarly should make the checksum match up.

@ehartford
Copy link

Thanks I'll do that

@rkinas
Copy link

rkinas commented Jul 25, 2024

Hi @ehartford , have you managed to find solution for converting fine-tuned Mistral-Nemo to 16bit gguf? I encountered the same problem you described.

@ehartford
Copy link

Yes I got past it by adding my model to the -update.py and running it then passing --model-name

arthw pushed a commit to arthw/llama.cpp that referenced this issue Jul 27, 2024
* llama : Added support for Tekken pre-tokenizer (ggerganov#8577)

Removed uneeded `vocab.tokenizer_clean_spaces` assignment

* llama : fix order of pre-tokenizers

* * Tekken pre-tokenizer no longer uses clean_up_tokenization_spaces
* Updated chkhsh for Tekken tokenizer

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
@AllyourBaseBelongToUs
Copy link

Yes I got past it by adding my model to the -update.py and running it then passing --model-name

Hi Eric, love your Dolphin models.

Are the Hugging Face GGUF models not updated?

Causing the error: pre-tokenizer type ‘dolphin12b not recognized.

The model files also show the dolphin12b tokenizer instead of Tekken :(

Any way we can help you there?

P.S.

Your models work better with your systemprompts instead of jailbreaks <3

@ehartford
Copy link

ehartford commented Jul 29, 2024

it works for me, and on ollama too
https://ollama.com/CognitiveComputations/dolphin-mistral-nemo

which I created from this quant https://huggingface.co/cognitivecomputations/dolphin-2.9.3-mistral-nemo-12b-gguf

Though I am not a llama.cpp expert.

I have got what I needed from this effort. I am happy to take PRs though.

@AllyourBaseBelongToUs
Copy link

it works for me, and on ollama too https://ollama.com/CognitiveComputations/dolphin-mistral-nemo

it works on LM studio too when we use hex Editor to change the pre_tokenizer from "dolphin12b" to "Tekken"

though not with Llama_cpp_python itself :/

How long would it take to quantize it ourselves?

EDIT: thank you so much for changing the pre tokenizer in all your GGUF uploads on HF <3

you're the best!!!

By the way is there any volunteer work we can do for you?

@ehartford
Copy link

Yes definitely - I'm totally overwhelmed

@rmusser01
Copy link

Yes definitely - I'm totally overwhelmed

Would you mind elaborating on what might be helpful for you/how people can help?

@netrunnereve
Copy link
Contributor

Does anyone know if the FP8 QAT used by Nemo is in E4M3 or E5M2? My guess is E4M3 but I couldn't find info on that anywhere, with Mistral only saying that they used FP8.

@Djip007
Copy link
Contributor

Djip007 commented Aug 31, 2024

I can't find what Mistral use and how (can be nice to know!!!)
The only thing I can find is:
https://huggingface.co/neuralmagic/Mistral-Nemo-Instruct-2407-FP8/tree/main?show_file_info=model-00001-of-00003.safetensors
That use E4M3 + weight_scale&input_scale

@netrunnereve
Copy link
Contributor

That seems to be a vLLM/Neural Magic quant format, which they also use for Llama as well. It's a generic quantization algorithm like our Q8_0 and isn't necessarily the format Mistral trained with.

@HyunjoonCho
Copy link

I am not sure whether this would be the proper thread to ask this - truly sorry if not,
is there any future plan to support Mistral-NeMo-Base-2407 officially?

I found some user uploaded versions but could not find one from the official page.

Thanks!

@EliEron
Copy link

EliEron commented Sep 12, 2024

I am not sure whether this would be the proper thread to ask this - truly sorry if not, is there any future plan to support Mistral-NeMo-Base-2407 officially?

I found some user uploaded versions but could not find one from the official page.

Thanks!

Ollama is an entirely seperate project from llama.cpp. While they use llama.cpp for inference there is no official partnership between the projects so this is indeed the wrong place to ask.

I'd suggest posting in the Ollama repo instead if you feel strongly about it, but i suspect you won't get a lot of traction. Base models aren't usually considered a high priority, and most just use third party uploads.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests