Support Mistral-Nemo-Instruct-2407 128K #8577

mirek190 · 2024-07-18T19:59:40Z

Prerequisites

I am running the latest code. Mention the version if possible as well.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

Any plans for support Mistral-Nemo-Instruct-2407 128K ?

https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407

Motivation

enhancement

Possible Implementation

No response

0wwafa · 2024-07-18T20:04:30Z

yes, please. this one is going to be good and soon finetunes will start to popup...

0wwafa · 2024-07-18T20:05:07Z

And this: https://huggingface.co/nvidia/Mistral-NeMo-12B-Instruct

delphijb · 2024-07-18T20:58:06Z

I second the request. This model is likely to become the reference for the 7-12b segment, and finetuning's version will indeed appear rapidly. Thx in advance

stduhpf · 2024-07-18T21:03:33Z

They claim it can be a drop-in replacement of Mistral 7B, so surely it shouldn't be too much work to make it work with ggml since Mistral 7B works.

EliEron · 2024-07-18T21:41:36Z

They claim it can be a drop-in replacement of Mistral 7B, so surely it shouldn't be too much work to make it work with ggml since Mistral 7B works.

The issue is that it uses a custom tokenizer named Tekken. That's not an issue for any program that uses Transformers. As their tokenizer system supports the custom tokenizer. Which is why they call it a drop in replacement.

For llama.cpp however the custom tokenizer has to be implemented manually. And implementing new tokenizers correctly is usually not easy. Gemma-2 and Llama-3's tokenizer for instance took quite a while to implement properly, and it took multiple attempts to do so as bugs were found over time.

iamlemec · 2024-07-18T22:05:36Z

I actually think the tokenizer might not be too different from others. It's listed as GPT2Tokenizer in the tokenizer_config.json and it has a pre-tokenizer of the usual form. I was able to add it in the standard fashion with pre-tokenizer and the update script.

The other issue is that the tensors shapes relating to attention are not the sizes expected by the current implementation of Mistral (see my other comment here #8576 (comment)). I was able to brute force hack it into at least running, and I'm getting sensible output, which makes me think the tokenizer is doing okay. For example:

PROMPT: What is the capital of South Korea in Hangul?
RESPONSE: The capital of South Korea in Hangul is 서울 (Seoul).

netrunnereve · 2024-07-19T01:17:25Z

If this model works well we should also try to add FP8 support into llama.cpp and make full use of the QAT. That will take more work to compute compared to Q8_0 without native FP8 support but it'll probably end up being memory bound anyways.

Removed uneeded `vocab.tokenizer_clean_spaces` assignment

iamlemec · 2024-07-19T21:13:41Z

For those curious, I have a seemingly working (and now non-hacky) inference implementation here: https://github.com/iamlemec/llama.cpp/tree/mistral-nemo. Will push a PR for it once #8579 is merged and I can rebase off of it.

mirek190 · 2024-07-19T23:21:09Z

For those curious, I have a seemingly working (and now non-hacky) inference implementation here: https://github.com/iamlemec/llama.cpp/tree/mistral-nemo. Will push a PR for it once #8579 is merged and I can rebase off of it.

Can't wait to test ;)

foldl · 2024-07-20T09:31:14Z

Also, for those who are interested, chatllm.cpp supports this.

stduhpf · 2024-07-20T13:23:49Z

For those curious, I have a seemingly working (and now non-hacky) inference implementation here: https://github.com/iamlemec/llama.cpp/tree/mistral-nemo. Will push a PR for it once #8579 is merged and I can rebase off of it.

Seems to work perfectly so far. Nice job.

* llama : Added support for Tekken pre-tokenizer (#8577) Removed uneeded `vocab.tokenizer_clean_spaces` assignment * llama : fix order of pre-tokenizers * * Tekken pre-tokenizer no longer uses clean_up_tokenization_spaces * Updated chkhsh for Tekken tokenizer --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

muhammadyusuf-kurbonov · 2024-07-20T16:00:11Z

#8579 is merged

legraphista · 2024-07-20T16:30:29Z

Just quantized Mistral-Nemo-Instruct and trying to run it I get the following error:

llm_load_tensors: ggml ctx size =    0.17 MiB
llama_model_load: error loading model: check_tensor_dims: tensor 'blk.0.attn_q.weight' has wrong shape; expected  5120,  5120, got  5120,  4096,     1,     1
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model './Mistral-Nemo-Instruct-2407.Q8_0.gguf'
main: error: unable to load model

Looks like there's a shape mismatch.

According to the config file, the hidden size should be 5120 https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407/blob/main/config.json#L10

mirek190 · 2024-07-20T18:30:46Z

where I can find a proper gguf?

maziyarpanahi · 2024-07-20T19:28:54Z

Hi @legraphista

I have a new build from the main branch with the new PR merged, I am also using convert_hf_to_gguf.py. But I am getting this error:

llama.cpp/gguf-py/gguf/metadata.py", line 233, in get_model_id_components
    if at_start and ((len(t) == 0 and part[0].isalpha()) or "version" in t):
IndexError: string index out of range

Could you please let me know if I am missing something here?

stduhpf · 2024-07-20T19:37:11Z

Hi @legraphista

I have a new build from the main branch with the new PR merged, I am also using convert_hf_to_gguf.py. But I am getting this error:
llama.cpp/gguf-py/gguf/metadata.py", line 233, in get_model_id_components
    if at_start and ((len(t) == 0 and part[0].isalpha()) or "version" in t):
IndexError: string index out of range
Could you please let me know if I am missing something here?

Only the tokenizer support has been added by the PR. If you want to run the model, you çan use @iamlemec's fork.

iamlemec · 2024-07-20T19:44:25Z

@mirek190 I have a Q5_K of Instruct here: https://huggingface.co/CompendiumLabs/mistral-nemo-instruct-2407-gguf. Happy to add more variants if needed.

maziyarpanahi · 2024-07-20T19:44:43Z

Hi @legraphista
I have a new build from the main branch with the new PR merged, I am also using convert_hf_to_gguf.py. But I am getting this error:
llama.cpp/gguf-py/gguf/metadata.py", line 233, in get_model_id_components
    if at_start and ((len(t) == 0 and part[0].isalpha()) or "version" in t):
IndexError: string index out of range
Could you please let me know if I am missing something here?
Only the tokenizer support has been added by the PR. If you want to run the model, you çan use @iamlemec's fork.

I am actually trying to quantize it at the moment, since I saw it happened successfully here, so I was wondering.

mirek190 · 2024-07-20T20:46:00Z

@mirek190 I have a Q5_K of Instruct here: https://huggingface.co/CompendiumLabs/mistral-nemo-instruct-2407-gguf. Happy to add more variants if needed.

ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size =    0.34 MiB
llama_model_load: error loading model: check_tensor_dims: tensor 'blk.0.attn_q.weight' has wrong shape; expected  5120,  5120, got  5120,  4096,     1,     1
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model 'models/new3/mistral-nemo-instruct-q5_k.gguf'
main: error: unable to load model

You version also doesn't work

cwillu · 2024-07-20T21:31:51Z

@mirek190 try running a make clean first; the project makefiles don't appear to be 100% reliable.

EricGrange · 2024-07-22T12:13:55Z

Just tried with latest quants at QuantFactory/Mistral-Nemo-Instruct-2407-GGUF and build 3437, getting the following error (Xeon E-2176G with 64GB RAM, under Debian). Apparently it's trying to allocate a very large CPU buffer

ggml_backend_cpu_buffer_type_alloc_buffer: failed to allocate buffer of size 167772160032
llama_kv_cache_init: failed to allocate buffer for kv cache
llama_new_context_with_model: llama_kv_cache_init() failed for self-attention cache
llama_init_from_gpt_params: error: failed to create context with model 'Mistral-Nemo-Instruct-2407.Q4_K_S.gguf'
 ERR [              load_model] unable to load model | tid="140094037281280" timestamp=1721650251 model="Mistral-Nemo-Instruct-2407.Q4_K_S.gguf"

sbelenki · 2024-07-22T12:23:46Z

Just tried with latest quants at QuantFactory/Mistral-Nemo-Instruct-2407-GGUF and build 3437, getting the following error (Xeon E-2176G with 64GB RAM, under Debian). Apparently it's trying to allocate a very large CPU buffer

I have encountered the same problem while testing new build, the advice that helped me was to use -c parameter

EricGrange · 2024-07-22T12:38:01Z

Thanks confirmed, passing the context size explicitly does the trick.
It seems to work correctly also when using "-c 131072" (128k if I'm not mistaken)

jthack · 2024-07-22T13:23:18Z

i did an ollama update and still getting the same error.

MoonRide303 · 2024-07-22T13:28:54Z

Initial tests with small context (-c 8192) look good to me - no issues observed so far. Both conversion and inference with llama.cpp b3438 (using Q6_K quant).

It doesn't seem to work well with bigger contexts, though. I tried the the commands extraction test (same as for Phi 3, #8262 (comment), using ~20k tokens as single input), and answer was nowhere near correct. Online version from NV (https://build.nvidia.com/nv-mistralai/mistral-nemo-12b-instruct) failed with error on the same test, too.

mirek190 · 2024-07-22T21:00:33Z

Tested and very disappointed ....

Tested 8bit version

llama-cli.exe --model models/new3/Mistral-Nemo-Instruct-2407-Q8_0.gguf --color --threads 30 --keep -1 --n-predict -1 --repeat-penalty 1.1 --ctx-size 100000 --interactive -ngl 99 --simple-io -e --multiline-input --no-display-prompt --conversation --no-mmap --temp 0.35 --in-prefix " </s>[INST] " --in-suffix " [/INST] " -p "<s>[INST] You are a helpful assistant.\n\n<" -fa

prompt "Can you put that text into paragraphs?"

stopped after literally after sentences and asked to continue and stopped again ....

Otherwise gemma 2 -27b - no problem at all to the end and made good job.

0wwafa · 2024-07-22T21:06:49Z

tested. seems to work.

mirek190 · 2024-07-22T21:16:26Z

How long text?
I tested on 8k context text and failed badly every time. Context for model 100k .

ps.

After few hours of testing I can say that model is better than llama 3 8b but worse than gemma 2 9b.
Advantage for Mistral-Nemo ( not count bad hallucinations ) is huge context.

mirek190 · 2024-07-22T23:44:24Z

still testing .. I think -fa ( flash attention makes problems - degradation and performance loss ) as this model has huge ctx I used -fa

foldl · 2024-07-23T11:46:10Z

I think this model uses SWA. The memory requirement is much less if SWA is used.

I have tested it with 4.7k tokens (using SWA), and it looks ok.

mirek190 · 2024-07-23T12:41:06Z

I think this model uses SWA. The memory requirement is much less if SWA is used.

I have tested it with 4.7k tokens (using SWA), and it looks ok.

4.7 k tokens with 128k model ... yes very productive....

foldl · 2024-07-24T01:16:14Z

@mirek190 the point is that with SWA, 128k context length won't blow up your memory.

ehartford · 2024-07-24T09:31:25Z

I try to do ./convert_hf_to_gguf.py /workspace/axolotl/dolphin-2.9.3-mistral-nemo-hf
I get NotImplementedError: BPE pre-tokenizer was not recognized - update get_vocab_base_pre()

ehartford · 2024-07-24T15:38:49Z

I already saw this line in convert_hf_to_gguf_update.py

{"name": "tekken",         "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/mistralai/Mistral-Nemo-Base-2407", },

I already executed convert_hf_to_gguf_update.py

but still it doesn't work.

As a guess, I tried this:

python ./convert_hf_to_gguf.py /workspace/axolotl/dolphin-2.9.3-mistral-nemo-hf --model-name tekken

But that didnt' work either

iamlemec · 2024-07-24T17:31:38Z

@ehartford They made some changes to tokenizer_config.json a day or two after release. See the commit here: https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407/commit/dac9c9e98f83322b32e32b48c118f079930772d6. Updating yours similarly should make the checksum match up.

ehartford · 2024-07-24T18:33:12Z

Thanks I'll do that

rkinas · 2024-07-25T18:19:56Z

Hi @ehartford , have you managed to find solution for converting fine-tuned Mistral-Nemo to 16bit gguf? I encountered the same problem you described.

ehartford · 2024-07-25T18:49:34Z

Yes I got past it by adding my model to the -update.py and running it then passing --model-name

* llama : Added support for Tekken pre-tokenizer (ggerganov#8577) Removed uneeded `vocab.tokenizer_clean_spaces` assignment * llama : fix order of pre-tokenizers * * Tekken pre-tokenizer no longer uses clean_up_tokenization_spaces * Updated chkhsh for Tekken tokenizer --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

AllyourBaseBelongToUs · 2024-07-29T07:45:16Z

Yes I got past it by adding my model to the -update.py and running it then passing --model-name

Hi Eric, love your Dolphin models.

Are the Hugging Face GGUF models not updated?

Causing the error: pre-tokenizer type ‘dolphin12b not recognized.

The model files also show the dolphin12b tokenizer instead of Tekken :(

Any way we can help you there?

P.S.

Your models work better with your systemprompts instead of jailbreaks <3

ehartford · 2024-07-29T18:13:29Z

it works for me, and on ollama too
https://ollama.com/CognitiveComputations/dolphin-mistral-nemo

which I created from this quant https://huggingface.co/cognitivecomputations/dolphin-2.9.3-mistral-nemo-12b-gguf

Though I am not a llama.cpp expert.

I have got what I needed from this effort. I am happy to take PRs though.

AllyourBaseBelongToUs · 2024-08-08T15:11:27Z

it works for me, and on ollama too https://ollama.com/CognitiveComputations/dolphin-mistral-nemo

it works on LM studio too when we use hex Editor to change the pre_tokenizer from "dolphin12b" to "Tekken"

though not with Llama_cpp_python itself :/

How long would it take to quantize it ourselves?

EDIT: thank you so much for changing the pre tokenizer in all your GGUF uploads on HF <3

you're the best!!!

By the way is there any volunteer work we can do for you?

ehartford · 2024-08-09T02:49:52Z

Yes definitely - I'm totally overwhelmed

rmusser01 · 2024-08-17T02:37:29Z

Yes definitely - I'm totally overwhelmed

Would you mind elaborating on what might be helpful for you/how people can help?

netrunnereve · 2024-08-30T17:04:45Z

Does anyone know if the FP8 QAT used by Nemo is in E4M3 or E5M2? My guess is E4M3 but I couldn't find info on that anywhere, with Mistral only saying that they used FP8.

Djip007 · 2024-08-31T23:47:39Z

I can't find what Mistral use and how (can be nice to know!!!)
The only thing I can find is:
https://huggingface.co/neuralmagic/Mistral-Nemo-Instruct-2407-FP8/tree/main?show_file_info=model-00001-of-00003.safetensors
That use E4M3 + weight_scale&input_scale

netrunnereve · 2024-09-01T16:34:28Z

That seems to be a vLLM/Neural Magic quant format, which they also use for Llama as well. It's a generic quantization algorithm like our Q8_0 and isn't necessarily the format Mistral trained with.

HyunjoonCho · 2024-09-12T06:44:39Z

I am not sure whether this would be the proper thread to ask this - truly sorry if not,
is there any future plan to support Mistral-NeMo-Base-2407 officially?

I found some user uploaded versions but could not find one from the official page.

Thanks!

EliEron · 2024-09-12T11:31:15Z

I am not sure whether this would be the proper thread to ask this - truly sorry if not, is there any future plan to support Mistral-NeMo-Base-2407 officially?

I found some user uploaded versions but could not find one from the official page.

Thanks!

Ollama is an entirely seperate project from llama.cpp. While they use llama.cpp for inference there is no official partnership between the projects so this is indeed the wrong place to ask.

I'd suggest posting in the Ollama repo instead if you feel strongly about it, but i suspect you won't get a lot of traction. Base models aren't usually considered a high priority, and most just use third party uploads.

mirek190 added the enhancement New feature or request label Jul 18, 2024

HanClinto mentioned this issue Jul 18, 2024

Feature Request: Nemotron-4-340B-Instruct Support #7966

Closed

4 tasks

HanClinto mentioned this issue Jul 18, 2024

WIP for adding support for Tekken tokenizer needed for Mistral NeMo #8578

Closed

4 tasks

m18coppola added a commit to m18coppola/llama.cpp that referenced this issue Jul 18, 2024

llama : Added support for Viking pre-tokenizer (ggerganov#8577)

b76150c

m18coppola mentioned this issue Jul 18, 2024

llama : Added support for Tekken pre-tokenizer (#8577) #8579

Merged

4 tasks

m18coppola added a commit to m18coppola/llama.cpp that referenced this issue Jul 18, 2024

llama : Added support for Tekken pre-tokenizer (ggerganov#8577)

003fcaf

rick-github mentioned this issue Jul 18, 2024

Mistral Nemo Please! ollama/ollama#5777

Closed

m18coppola added a commit to m18coppola/llama.cpp that referenced this issue Jul 19, 2024

llama : Added support for Tekken pre-tokenizer (ggerganov#8577)

7fc8505

Removed uneeded `vocab.tokenizer_clean_spaces` assignment

iamlemec mentioned this issue Jul 20, 2024

Mistral Nemo inference support (#8577) #8604

Merged

4 tasks

usatenko mentioned this issue Jul 22, 2024

Llama cpp compilation error, runpod TrelisResearch/one-click-llms#7

Closed

Support Mistral-Nemo-Instruct-2407 128K #8577

Support Mistral-Nemo-Instruct-2407 128K #8577

Comments

mirek190 commented Jul 18, 2024

Prerequisites

Feature Description

Motivation

Possible Implementation

0wwafa commented Jul 18, 2024

0wwafa commented Jul 18, 2024

delphijb commented Jul 18, 2024 • edited Loading

stduhpf commented Jul 18, 2024

EliEron commented Jul 18, 2024 • edited Loading

iamlemec commented Jul 18, 2024

netrunnereve commented Jul 19, 2024

iamlemec commented Jul 19, 2024 • edited Loading

mirek190 commented Jul 19, 2024

foldl commented Jul 20, 2024

stduhpf commented Jul 20, 2024

muhammadyusuf-kurbonov commented Jul 20, 2024

legraphista commented Jul 20, 2024

mirek190 commented Jul 20, 2024

maziyarpanahi commented Jul 20, 2024

stduhpf commented Jul 20, 2024

iamlemec commented Jul 20, 2024

maziyarpanahi commented Jul 20, 2024

mirek190 commented Jul 20, 2024

cwillu commented Jul 20, 2024

EricGrange commented Jul 22, 2024

sbelenki commented Jul 22, 2024

EricGrange commented Jul 22, 2024

jthack commented Jul 22, 2024

MoonRide303 commented Jul 22, 2024 • edited Loading

mirek190 commented Jul 22, 2024

0wwafa commented Jul 22, 2024

mirek190 commented Jul 22, 2024 • edited Loading

mirek190 commented Jul 22, 2024

foldl commented Jul 23, 2024

mirek190 commented Jul 23, 2024

foldl commented Jul 24, 2024

ehartford commented Jul 24, 2024

ehartford commented Jul 24, 2024

iamlemec commented Jul 24, 2024

ehartford commented Jul 24, 2024

rkinas commented Jul 25, 2024

ehartford commented Jul 25, 2024

AllyourBaseBelongToUs commented Jul 29, 2024

ehartford commented Jul 29, 2024 • edited Loading

AllyourBaseBelongToUs commented Aug 8, 2024

ehartford commented Aug 9, 2024

rmusser01 commented Aug 17, 2024

netrunnereve commented Aug 30, 2024

Djip007 commented Aug 31, 2024

netrunnereve commented Sep 1, 2024

HyunjoonCho commented Sep 12, 2024

EliEron commented Sep 12, 2024

delphijb commented Jul 18, 2024 •

edited

Loading

EliEron commented Jul 18, 2024 •

edited

Loading

iamlemec commented Jul 19, 2024 •

edited

Loading

MoonRide303 commented Jul 22, 2024 •

edited

Loading

mirek190 commented Jul 22, 2024 •

edited

Loading

ehartford commented Jul 29, 2024 •

edited

Loading