Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Command R Plus support #6491

Merged
merged 15 commits into from
Apr 9, 2024
Merged

Add Command R Plus support #6491

merged 15 commits into from
Apr 9, 2024

Conversation

RefractAI
Copy link
Contributor

Updated tensor mapping to add Command R Plus support for GGUF conversion.

@bartowski1182
Copy link
Contributor

Probably shouldn't do the model_max_length mapping and should instead force it to be in config.json, otherwise these changes worked for me as well

@N8python
Copy link

N8python commented Apr 4, 2024

Do the ggufs produced w/ this work w/ the main branch once they are quantized?

@bartowski1182
Copy link
Contributor

@N8python i'll let you know as soon as I have a finished one, I would be slightly surprised if they worked without this PR at inference time but i'm not sure how it all works

my conversion to f16.gguf is almost done, will be making a Q2 immediately after and seeing if that runs from master

@Noeda
Copy link
Contributor

Noeda commented Apr 4, 2024

I think bunch of people are rushing to implement this. I have a slightly more complete code here (https://github.com/Noeda/llama.cpp/tree/commandr-plus
), but something with reading tensors is off at inference time. I think I'm getting the llama.cpp equivalent of Torch being mad about badly interleaved tensors in underlying storage. I'm playing with views to see can I hack it to work or do I have to actually start understanding how tensors are laid out in memory here. The application of the new qk norms is failing.

I made Q4_K and Q8_0 quants for myself; those seem fine but inference is not.

If you want you can pull my code into yours but it doesn't work. I have a bit of limited time and might have to stop hacking until evening or later tomorrow; but I'll try to get it working. I think adding the new layernorms for query and value should be enough; didn't see other differences in the Transformers code.

(I'll comment here if I have to get off, so no one waits for me if I have to go. I'm currently hacking and trying to figure out what's going on with my assert failures. This is another of those models that gets lots of excited people out of woods including myself :D to hacking but I don't want people to wait on me because the times I can work are unpredictable and tend to come in bursts and I might have to suddenly disappear)

Link for easier reading from my diff: it's not exactly lots of lines of code: master...Noeda:llama.cpp:commandr-plus

@bartowski1182
Copy link
Contributor

as Noeda suspected this change was not enough to make it work, conversion to f16.gguf "worked" but going to Q2 failed with

"gguf_init_from_file: invalid magic characters ''"

@candre23
Copy link

candre23 commented Apr 4, 2024

Probably shouldn't do the model_max_length mapping and should instead force it to be in config.json, otherwise these changes worked for me as well

I pinged Cohere on HF and they added model_max_length to the config.json. So no more need to compensate for that oversight in the code.

@Noeda
Copy link
Contributor

Noeda commented Apr 4, 2024

I'm no longer crashing on the spot in my branch but it's clearly not correct (prompt was just "hello", repeats token digit '8'):

Screenshot 2024-04-04 at 12 35 25 PM

Will need to review things a bit.

For those curious GGUF sizes for this thing seem to be:

Q4_K: 58G
Q8_0: 103G
f16: 193G

I can't run the f16 at all with my machinery. I'm doing testing on the Q4 one.

@N8python
Copy link

N8python commented Apr 4, 2024

Here's the mlx impl:

ml-explore/mlx-examples#650

@sammcj
Copy link

sammcj commented Apr 4, 2024

FYI converting to fp16 on macOS works with this PR, but quantizing segfaults.

~/git/llama.cpp/convert-hf-to-gguf.py ./CohereForAI_c4ai-command-r-plus --outtype f16 --outfile CohereForAI_c4ai-command-r-plus.fp16.bin
~/git/llama.cpp/convert-hf-to-gguf.py ./CohereForAI_c4ai-command-r-plus --outtype f16 --outfile CohereForAI_c4ai-command-r-plus.fp16.bin
Loading model: CohereForAI_c4ai-command-r-plus
gguf: This GGUF file is for Little Endian only
Set model parameters
gguf: context length = 131072
gguf: embedding length = 12288
gguf: feed forward length = 33792
gguf: head count = 96
gguf: key-value head count = 8
gguf: rope theta = 75000000.0
gguf: layer norm epsilon = 1e-05
gguf: file type = 1
Set model tokenizer
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
gguf: Adding 253333 merge(s).
gguf: Setting special token type bos to 5
gguf: Setting special token type eos to 255001
gguf: Setting special token type pad to 0
gguf: Setting add_bos_token to True
gguf: Setting add_eos_token to False
Exporting model to 'CohereForAI_c4ai-command-r-plus.fp16.bin'
gguf: loading model part 'model-00001-of-00044.safetensors'
token_embd.weight, n_dims = 2, torch.float16 --> float16
gguf: loading model part 'model-00002-of-00044.safetensors'
blk.0.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.0.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.0.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.0.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.0.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.0.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.0.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.0.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
blk.0.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.0.attn_v.weight, n_dims = 2, torch.float16 --> float16
blk.1.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.1.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.1.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.1.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.1.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
blk.1.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.1.attn_v.weight, n_dims = 2, torch.float16 --> float16
gguf: loading model part 'model-00003-of-00044.safetensors'
blk.1.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.1.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.1.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.2.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.2.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.2.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.2.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.2.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.2.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.2.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.2.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
blk.2.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.2.attn_v.weight, n_dims = 2, torch.float16 --> float16
blk.3.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.3.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
gguf: loading model part 'model-00004-of-00044.safetensors'
blk.3.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.3.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.3.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.3.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.3.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.3.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.3.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.3.attn_v.weight, n_dims = 2, torch.float16 --> float16
blk.4.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.4.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.4.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.4.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.4.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
blk.4.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.4.attn_v.weight, n_dims = 2, torch.float16 --> float16
gguf: loading model part 'model-00005-of-00044.safetensors'
blk.4.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.4.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.4.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.5.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.5.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.5.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.5.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.5.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.5.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.5.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.5.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
blk.5.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.5.attn_v.weight, n_dims = 2, torch.float16 --> float16
blk.6.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.6.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
gguf: loading model part 'model-00006-of-00044.safetensors'
blk.6.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.6.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.6.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.6.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.6.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.6.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.6.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.6.attn_v.weight, n_dims = 2, torch.float16 --> float16
blk.7.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.7.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.7.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.7.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.7.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
blk.7.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.7.attn_v.weight, n_dims = 2, torch.float16 --> float16
gguf: loading model part 'model-00007-of-00044.safetensors'
blk.7.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.7.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.7.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.8.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.8.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.8.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.8.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.8.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.8.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.8.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.8.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
blk.8.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.8.attn_v.weight, n_dims = 2, torch.float16 --> float16
blk.9.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.9.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
gguf: loading model part 'model-00008-of-00044.safetensors'
blk.10.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.10.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.10.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.10.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.10.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
blk.10.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.10.attn_v.weight, n_dims = 2, torch.float16 --> float16
blk.9.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.9.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.9.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.9.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.9.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.9.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.9.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.9.attn_v.weight, n_dims = 2, torch.float16 --> float16
gguf: loading model part 'model-00009-of-00044.safetensors'
blk.10.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.10.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.10.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.11.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.11.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.11.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.11.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.11.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.11.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.11.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.11.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
blk.11.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.11.attn_v.weight, n_dims = 2, torch.float16 --> float16
blk.12.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.12.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
gguf: loading model part 'model-00010-of-00044.safetensors'
blk.12.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.12.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.12.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.12.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.12.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.12.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.12.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.12.attn_v.weight, n_dims = 2, torch.float16 --> float16
blk.13.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.13.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.13.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.13.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.13.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
blk.13.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.13.attn_v.weight, n_dims = 2, torch.float16 --> float16
gguf: loading model part 'model-00011-of-00044.safetensors'
blk.13.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.13.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.13.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.14.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.14.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.14.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.14.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.14.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.14.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.14.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.14.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
blk.14.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.14.attn_v.weight, n_dims = 2, torch.float16 --> float16
blk.15.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.15.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
gguf: loading model part 'model-00012-of-00044.safetensors'
blk.15.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.15.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.15.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.15.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.15.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.15.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.15.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.15.attn_v.weight, n_dims = 2, torch.float16 --> float16
blk.16.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.16.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.16.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.16.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.16.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
blk.16.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.16.attn_v.weight, n_dims = 2, torch.float16 --> float16
gguf: loading model part 'model-00013-of-00044.safetensors'
blk.16.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.16.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.16.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.17.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.17.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.17.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.17.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.17.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.17.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.17.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.17.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
blk.17.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.17.attn_v.weight, n_dims = 2, torch.float16 --> float16
blk.18.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.18.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
gguf: loading model part 'model-00014-of-00044.safetensors'
blk.18.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.18.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.18.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.18.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.18.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.18.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.18.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.18.attn_v.weight, n_dims = 2, torch.float16 --> float16
blk.19.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.19.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.19.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.19.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.19.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
blk.19.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.19.attn_v.weight, n_dims = 2, torch.float16 --> float16
gguf: loading model part 'model-00015-of-00044.safetensors'
blk.19.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.19.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.19.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.20.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.20.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.20.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.20.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.20.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.20.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.20.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.20.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
blk.20.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.20.attn_v.weight, n_dims = 2, torch.float16 --> float16
blk.21.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.21.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
gguf: loading model part 'model-00016-of-00044.safetensors'
blk.21.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.21.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.21.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.21.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.21.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.21.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.21.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.21.attn_v.weight, n_dims = 2, torch.float16 --> float16
blk.22.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.22.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.22.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.22.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.22.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
blk.22.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.22.attn_v.weight, n_dims = 2, torch.float16 --> float16
gguf: loading model part 'model-00017-of-00044.safetensors'
blk.22.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.22.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.22.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.23.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.23.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.23.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.23.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.23.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.23.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.23.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.23.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
blk.23.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.23.attn_v.weight, n_dims = 2, torch.float16 --> float16
blk.24.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.24.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
gguf: loading model part 'model-00018-of-00044.safetensors'
blk.24.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.24.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.24.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.24.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.24.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.24.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.24.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.24.attn_v.weight, n_dims = 2, torch.float16 --> float16
blk.25.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.25.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.25.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.25.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.25.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
blk.25.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.25.attn_v.weight, n_dims = 2, torch.float16 --> float16
gguf: loading model part 'model-00019-of-00044.safetensors'
blk.25.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.25.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.25.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.26.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.26.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.26.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.26.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.26.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.26.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.26.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.26.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
blk.26.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.26.attn_v.weight, n_dims = 2, torch.float16 --> float16
blk.27.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.27.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
gguf: loading model part 'model-00020-of-00044.safetensors'
blk.27.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.27.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.27.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.27.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.27.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.27.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.27.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.27.attn_v.weight, n_dims = 2, torch.float16 --> float16
blk.28.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.28.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.28.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.28.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.28.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
blk.28.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.28.attn_v.weight, n_dims = 2, torch.float16 --> float16
gguf: loading model part 'model-00021-of-00044.safetensors'
blk.28.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.28.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.28.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.29.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.29.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.29.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.29.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.29.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.29.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.29.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.29.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
blk.29.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.29.attn_v.weight, n_dims = 2, torch.float16 --> float16
blk.30.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.30.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
gguf: loading model part 'model-00022-of-00044.safetensors'
blk.30.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.30.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.30.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.30.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.30.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.30.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.30.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.30.attn_v.weight, n_dims = 2, torch.float16 --> float16
blk.31.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.31.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.31.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.31.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.31.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
blk.31.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.31.attn_v.weight, n_dims = 2, torch.float16 --> float16
gguf: loading model part 'model-00023-of-00044.safetensors'
blk.31.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.31.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.31.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.32.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.32.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.32.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.32.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.32.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.32.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.32.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.32.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
blk.32.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.32.attn_v.weight, n_dims = 2, torch.float16 --> float16
blk.33.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.33.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
gguf: loading model part 'model-00024-of-00044.safetensors'
blk.33.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.33.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.33.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.33.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.33.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.33.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.33.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.33.attn_v.weight, n_dims = 2, torch.float16 --> float16
blk.34.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.34.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.34.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.34.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.34.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
blk.34.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.34.attn_v.weight, n_dims = 2, torch.float16 --> float16
gguf: loading model part 'model-00025-of-00044.safetensors'
blk.34.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.34.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.34.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.35.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.35.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.35.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.35.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.35.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.35.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.35.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.35.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
blk.35.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.35.attn_v.weight, n_dims = 2, torch.float16 --> float16
blk.36.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.36.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
gguf: loading model part 'model-00026-of-00044.safetensors'
blk.36.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.36.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.36.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.36.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.36.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.36.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.36.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.36.attn_v.weight, n_dims = 2, torch.float16 --> float16
blk.37.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.37.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.37.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.37.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.37.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
blk.37.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.37.attn_v.weight, n_dims = 2, torch.float16 --> float16
gguf: loading model part 'model-00027-of-00044.safetensors'
blk.37.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.37.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.37.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.38.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.38.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.38.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.38.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.38.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.38.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.38.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.38.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
blk.38.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.38.attn_v.weight, n_dims = 2, torch.float16 --> float16
blk.39.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.39.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
gguf: loading model part 'model-00028-of-00044.safetensors'
blk.39.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.39.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.39.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.39.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.39.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.39.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.39.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.39.attn_v.weight, n_dims = 2, torch.float16 --> float16
blk.40.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.40.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.40.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.40.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.40.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
blk.40.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.40.attn_v.weight, n_dims = 2, torch.float16 --> float16
gguf: loading model part 'model-00029-of-00044.safetensors'
blk.40.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.40.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.40.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.41.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.41.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.41.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.41.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.41.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.41.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.41.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.41.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
blk.41.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.41.attn_v.weight, n_dims = 2, torch.float16 --> float16
blk.42.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.42.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
gguf: loading model part 'model-00030-of-00044.safetensors'
blk.42.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.42.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.42.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.42.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.42.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.42.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.42.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.42.attn_v.weight, n_dims = 2, torch.float16 --> float16
blk.43.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.43.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.43.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.43.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.43.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
blk.43.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.43.attn_v.weight, n_dims = 2, torch.float16 --> float16
gguf: loading model part 'model-00031-of-00044.safetensors'
blk.43.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.43.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.43.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.44.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.44.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.44.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.44.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.44.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.44.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.44.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.44.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
blk.44.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.44.attn_v.weight, n_dims = 2, torch.float16 --> float16
blk.45.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.45.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
gguf: loading model part 'model-00032-of-00044.safetensors'
blk.45.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.45.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.45.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.45.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.45.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.45.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.45.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.45.attn_v.weight, n_dims = 2, torch.float16 --> float16
blk.46.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.46.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.46.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.46.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.46.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
blk.46.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.46.attn_v.weight, n_dims = 2, torch.float16 --> float16
gguf: loading model part 'model-00033-of-00044.safetensors'
blk.46.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.46.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.46.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.47.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.47.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.47.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.47.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.47.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.47.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.47.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.47.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
blk.47.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.47.attn_v.weight, n_dims = 2, torch.float16 --> float16
blk.48.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.48.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
gguf: loading model part 'model-00034-of-00044.safetensors'
blk.48.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.48.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.48.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.48.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.48.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.48.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.48.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.48.attn_v.weight, n_dims = 2, torch.float16 --> float16
blk.49.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.49.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.49.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.49.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.49.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
blk.49.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.49.attn_v.weight, n_dims = 2, torch.float16 --> float16
gguf: loading model part 'model-00035-of-00044.safetensors'
blk.49.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.49.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.49.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.50.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.50.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.50.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.50.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.50.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.50.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.50.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.50.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
blk.50.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.50.attn_v.weight, n_dims = 2, torch.float16 --> float16
blk.51.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.51.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
gguf: loading model part 'model-00036-of-00044.safetensors'
blk.51.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.51.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.51.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.51.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.51.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.51.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.51.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.51.attn_v.weight, n_dims = 2, torch.float16 --> float16
blk.52.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.52.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.52.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.52.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.52.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
blk.52.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.52.attn_v.weight, n_dims = 2, torch.float16 --> float16
gguf: loading model part 'model-00037-of-00044.safetensors'
blk.52.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.52.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.52.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.53.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.53.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.53.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.53.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.53.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.53.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.53.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.53.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
blk.53.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.53.attn_v.weight, n_dims = 2, torch.float16 --> float16
blk.54.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.54.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
gguf: loading model part 'model-00038-of-00044.safetensors'
blk.54.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.54.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.54.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.54.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.54.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.54.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.54.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.54.attn_v.weight, n_dims = 2, torch.float16 --> float16
blk.55.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.55.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.55.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.55.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.55.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
blk.55.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.55.attn_v.weight, n_dims = 2, torch.float16 --> float16
gguf: loading model part 'model-00039-of-00044.safetensors'
blk.55.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.55.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.55.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.56.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.56.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.56.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.56.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.56.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.56.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.56.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.56.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
blk.56.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.56.attn_v.weight, n_dims = 2, torch.float16 --> float16
blk.57.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.57.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
gguf: loading model part 'model-00040-of-00044.safetensors'
blk.57.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.57.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.57.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.57.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.57.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.57.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.57.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.57.attn_v.weight, n_dims = 2, torch.float16 --> float16
blk.58.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.58.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.58.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.58.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.58.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
blk.58.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.58.attn_v.weight, n_dims = 2, torch.float16 --> float16
gguf: loading model part 'model-00041-of-00044.safetensors'
blk.58.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.58.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.58.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.59.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.59.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.59.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.59.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.59.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.59.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.59.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.59.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
blk.59.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.59.attn_v.weight, n_dims = 2, torch.float16 --> float16
blk.60.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.60.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
gguf: loading model part 'model-00042-of-00044.safetensors'
blk.60.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.60.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.60.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.60.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.60.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.60.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.60.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.60.attn_v.weight, n_dims = 2, torch.float16 --> float16
blk.61.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.61.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.61.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.61.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.61.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
blk.61.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.61.attn_v.weight, n_dims = 2, torch.float16 --> float16
gguf: loading model part 'model-00043-of-00044.safetensors'
blk.61.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.61.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.61.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.62.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.62.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.62.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.62.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.62.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.62.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.62.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.62.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
blk.62.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.62.attn_v.weight, n_dims = 2, torch.float16 --> float16
blk.63.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.63.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
gguf: loading model part 'model-00044-of-00044.safetensors'
blk.63.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.63.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.63.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.63.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.63.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.63.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.63.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.63.attn_v.weight, n_dims = 2, torch.float16 --> float16
output_norm.weight, n_dims = 1, torch.float16 --> float32
Model successfully exported to 'CohereForAI_c4ai-command-r-plus.fp16.bin'
quantize CohereForAI_c4ai-command-r-plus.fp16.bin CohereForAI_c4ai-command-r-plus-Q4_K_M.gguf Q4_K_M
ggml_opencl: selecting platform: 'Apple'
ggml_opencl: selecting device: 'Apple M2 Max'
main: build = 1213 (a307375c)
main: built with Apple clang version 15.0.0 (clang-1500.3.9.4) for arm64-apple-darwin23.4.0
main: quantizing 'CohereForAI_c4ai-command-r-plus.fp16.bin' to 'CohereForAI_c4ai-command-r-plus-Q4_K_M.gguf' as Q4_K_M
zsh: segmentation fault  quantize CohereForAI_c4ai-command-r-plus.fp16.bin  Q4_K_M

@RefractAI
Copy link
Contributor Author

RefractAI commented Apr 4, 2024

Quantizing to Q5_0 works, but the llm_build_norm() function doesn't accept a 2D layer norm for the new q_norm and k_norm parameters.

The tensor is 12288 but appears it should be evaluated as 128x96 by the layer norm.

@sammcj
Copy link

sammcj commented Apr 4, 2024

Your latest push seems to have fixed Q3_K_M and Q4_K_M creation:

48G Apr  4 22:01 CohereForAI_c4ai-command-r-plus-Q3_K_M.gguf
59G Apr  4 21:55 CohereForAI_c4ai-command-r-plus-Q4_K_M.gguf

Here's the Q3_K_M quantization log if it's interesting:

quantize CohereForAI_c4ai-command-r-plus.fp16.bin CohereForAI_c4ai-command-r-plus-Q3_K_M.gguf Q3_K_M
quantize CohereForAI_c4ai-command-r-plus.fp16.bin CohereForAI_c4ai-command-r-plus-Q3_K_M.gguf Q3_K_M
main: build = 2612 (e4b2e2d)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: quantizing 'CohereForAI_c4ai-command-r-plus.fp16.bin' to 'CohereForAI_c4ai-command-r-plus-Q3_K_M.gguf' as Q3_K_M
llama_model_loader: loaded meta data with 22 key-value pairs and 642 tensors from CohereForAI_c4ai-command-r-plus.fp16.bin (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = command-r
llama_model_loader: - kv   1:                               general.name str              = CohereForAI_c4ai-command-r-plus
llama_model_loader: - kv   2:                      command-r.block_count u32              = 64
llama_model_loader: - kv   3:                   command-r.context_length u32              = 131072
llama_model_loader: - kv   4:                 command-r.embedding_length u32              = 12288
llama_model_loader: - kv   5:              command-r.feed_forward_length u32              = 33792
llama_model_loader: - kv   6:             command-r.attention.head_count u32              = 96
llama_model_loader: - kv   7:          command-r.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:                   command-r.rope.freq_base f32              = 75000000.000000
llama_model_loader: - kv   9:     command-r.attention.layer_norm_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 1
llama_model_loader: - kv  11:                      command-r.logit_scale f32              = 0.833333
llama_model_loader: - kv  12:                command-r.rope.scaling.type str              = none
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr[str,256000]  = ["<PAD>", "<UNK>", "<CLS>", "<SEP>", ...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,256000]  = [3, 3, 3, 3, 3, 3, 3, 3, 1, 1, 1, 1, ...
llama_model_loader: - kv  16:                      tokenizer.ggml.merges arr[str,253333]  = ["Ġ Ġ", "Ġ t", "e r", "i n", "Ġ a...
llama_model_loader: - kv  17:                tokenizer.ggml.bos_token_id u32              = 5
llama_model_loader: - kv  18:                tokenizer.ggml.eos_token_id u32              = 255001
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  20:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  21:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type  f16:  577 tensors
llama_model_quantize_internal: meta size = 10904864 bytes
[   1/ 642]                    token_embd.weight - [12288, 256000,     1,     1], type =    f16, converting to q6_K .. size =  6000.00 MiB ->  2460.94 MiB
[   2/ 642]               blk.0.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[   3/ 642]                blk.0.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q5_K .. size =   792.00 MiB ->   272.25 MiB
[   4/ 642]                blk.0.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[   5/ 642]                  blk.0.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[   6/ 642]             blk.0.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[   7/ 642]                  blk.0.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[   8/ 642]             blk.0.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[   9/ 642]             blk.0.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[  10/ 642]                  blk.0.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[  11/ 642]                  blk.0.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q5_K .. size =    24.00 MiB ->     8.25 MiB
[  12/ 642]                blk.1.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[  13/ 642]             blk.1.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[  14/ 642]                  blk.1.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[  15/ 642]             blk.1.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[  16/ 642]             blk.1.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[  17/ 642]                  blk.1.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[  18/ 642]                  blk.1.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q5_K .. size =    24.00 MiB ->     8.25 MiB
[  19/ 642]               blk.1.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[  20/ 642]                blk.1.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q5_K .. size =   792.00 MiB ->   272.25 MiB
[  21/ 642]                  blk.1.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[  22/ 642]               blk.2.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[  23/ 642]                blk.2.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q5_K .. size =   792.00 MiB ->   272.25 MiB
[  24/ 642]                blk.2.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[  25/ 642]                  blk.2.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[  26/ 642]             blk.2.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[  27/ 642]                  blk.2.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[  28/ 642]             blk.2.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[  29/ 642]             blk.2.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[  30/ 642]                  blk.2.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[  31/ 642]                  blk.2.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[  32/ 642]             blk.3.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[  33/ 642]             blk.3.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[  34/ 642]               blk.3.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[  35/ 642]                blk.3.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q5_K .. size =   792.00 MiB ->   272.25 MiB
[  36/ 642]                blk.3.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[  37/ 642]                  blk.3.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[  38/ 642]                  blk.3.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[  39/ 642]             blk.3.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[  40/ 642]                  blk.3.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[  41/ 642]                  blk.3.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[  42/ 642]                blk.4.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[  43/ 642]             blk.4.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[  44/ 642]                  blk.4.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[  45/ 642]             blk.4.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[  46/ 642]             blk.4.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[  47/ 642]                  blk.4.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[  48/ 642]                  blk.4.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[  49/ 642]               blk.4.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[  50/ 642]                blk.4.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[  51/ 642]                  blk.4.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[  52/ 642]               blk.5.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[  53/ 642]                blk.5.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[  54/ 642]                blk.5.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[  55/ 642]                  blk.5.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[  56/ 642]             blk.5.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[  57/ 642]                  blk.5.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[  58/ 642]             blk.5.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[  59/ 642]             blk.5.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[  60/ 642]                  blk.5.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[  61/ 642]                  blk.5.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[  62/ 642]             blk.6.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[  63/ 642]             blk.6.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[  64/ 642]               blk.6.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[  65/ 642]                blk.6.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[  66/ 642]                blk.6.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[  67/ 642]                  blk.6.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[  68/ 642]                  blk.6.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[  69/ 642]             blk.6.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[  70/ 642]                  blk.6.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[  71/ 642]                  blk.6.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[  72/ 642]                blk.7.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[  73/ 642]             blk.7.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[  74/ 642]                  blk.7.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[  75/ 642]             blk.7.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[  76/ 642]             blk.7.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[  77/ 642]                  blk.7.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[  78/ 642]                  blk.7.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[  79/ 642]               blk.7.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[  80/ 642]                blk.7.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[  81/ 642]                  blk.7.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[  82/ 642]               blk.8.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[  83/ 642]                blk.8.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[  84/ 642]                blk.8.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[  85/ 642]                  blk.8.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[  86/ 642]             blk.8.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[  87/ 642]                  blk.8.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[  88/ 642]             blk.8.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[  89/ 642]             blk.8.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[  90/ 642]                  blk.8.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[  91/ 642]                  blk.8.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[  92/ 642]             blk.9.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[  93/ 642]             blk.9.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[  94/ 642]               blk.10.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[  95/ 642]            blk.10.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[  96/ 642]                 blk.10.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[  97/ 642]            blk.10.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[  98/ 642]            blk.10.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[  99/ 642]                 blk.10.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 100/ 642]                 blk.10.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 101/ 642]               blk.9.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 102/ 642]                blk.9.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 103/ 642]                blk.9.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 104/ 642]                  blk.9.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 105/ 642]                  blk.9.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 106/ 642]             blk.9.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 107/ 642]                  blk.9.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 108/ 642]                  blk.9.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 109/ 642]              blk.10.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 110/ 642]               blk.10.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 111/ 642]                 blk.10.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 112/ 642]              blk.11.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 113/ 642]               blk.11.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 114/ 642]               blk.11.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 115/ 642]                 blk.11.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 116/ 642]            blk.11.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 117/ 642]                 blk.11.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 118/ 642]            blk.11.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 119/ 642]            blk.11.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 120/ 642]                 blk.11.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 121/ 642]                 blk.11.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 122/ 642]            blk.12.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 123/ 642]            blk.12.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 124/ 642]              blk.12.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 125/ 642]               blk.12.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 126/ 642]               blk.12.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 127/ 642]                 blk.12.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 128/ 642]                 blk.12.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 129/ 642]            blk.12.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 130/ 642]                 blk.12.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 131/ 642]                 blk.12.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 132/ 642]               blk.13.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 133/ 642]            blk.13.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 134/ 642]                 blk.13.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 135/ 642]            blk.13.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 136/ 642]            blk.13.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 137/ 642]                 blk.13.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 138/ 642]                 blk.13.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 139/ 642]              blk.13.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 140/ 642]               blk.13.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 141/ 642]                 blk.13.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 142/ 642]              blk.14.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 143/ 642]               blk.14.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 144/ 642]               blk.14.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 145/ 642]                 blk.14.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 146/ 642]            blk.14.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 147/ 642]                 blk.14.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 148/ 642]            blk.14.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 149/ 642]            blk.14.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 150/ 642]                 blk.14.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 151/ 642]                 blk.14.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 152/ 642]            blk.15.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 153/ 642]            blk.15.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 154/ 642]              blk.15.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 155/ 642]               blk.15.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 156/ 642]               blk.15.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 157/ 642]                 blk.15.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 158/ 642]                 blk.15.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 159/ 642]            blk.15.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 160/ 642]                 blk.15.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 161/ 642]                 blk.15.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 162/ 642]               blk.16.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 163/ 642]            blk.16.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 164/ 642]                 blk.16.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 165/ 642]            blk.16.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 166/ 642]            blk.16.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 167/ 642]                 blk.16.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 168/ 642]                 blk.16.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 169/ 642]              blk.16.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 170/ 642]               blk.16.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 171/ 642]                 blk.16.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 172/ 642]              blk.17.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 173/ 642]               blk.17.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 174/ 642]               blk.17.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 175/ 642]                 blk.17.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 176/ 642]            blk.17.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 177/ 642]                 blk.17.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 178/ 642]            blk.17.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 179/ 642]            blk.17.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 180/ 642]                 blk.17.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 181/ 642]                 blk.17.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 182/ 642]            blk.18.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 183/ 642]            blk.18.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 184/ 642]              blk.18.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 185/ 642]               blk.18.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 186/ 642]               blk.18.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 187/ 642]                 blk.18.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 188/ 642]                 blk.18.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 189/ 642]            blk.18.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 190/ 642]                 blk.18.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 191/ 642]                 blk.18.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 192/ 642]               blk.19.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 193/ 642]            blk.19.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 194/ 642]                 blk.19.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 195/ 642]            blk.19.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 196/ 642]            blk.19.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 197/ 642]                 blk.19.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 198/ 642]                 blk.19.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 199/ 642]              blk.19.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 200/ 642]               blk.19.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 201/ 642]                 blk.19.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 202/ 642]              blk.20.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 203/ 642]               blk.20.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 204/ 642]               blk.20.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 205/ 642]                 blk.20.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 206/ 642]            blk.20.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 207/ 642]                 blk.20.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 208/ 642]            blk.20.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 209/ 642]            blk.20.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 210/ 642]                 blk.20.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 211/ 642]                 blk.20.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 212/ 642]            blk.21.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 213/ 642]            blk.21.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 214/ 642]              blk.21.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 215/ 642]               blk.21.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 216/ 642]               blk.21.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 217/ 642]                 blk.21.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 218/ 642]                 blk.21.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 219/ 642]            blk.21.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 220/ 642]                 blk.21.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 221/ 642]                 blk.21.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 222/ 642]               blk.22.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 223/ 642]            blk.22.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 224/ 642]                 blk.22.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 225/ 642]            blk.22.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 226/ 642]            blk.22.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 227/ 642]                 blk.22.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 228/ 642]                 blk.22.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 229/ 642]              blk.22.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 230/ 642]               blk.22.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 231/ 642]                 blk.22.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 232/ 642]              blk.23.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 233/ 642]               blk.23.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 234/ 642]               blk.23.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 235/ 642]                 blk.23.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 236/ 642]            blk.23.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 237/ 642]                 blk.23.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 238/ 642]            blk.23.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 239/ 642]            blk.23.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 240/ 642]                 blk.23.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 241/ 642]                 blk.23.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 242/ 642]            blk.24.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 243/ 642]            blk.24.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 244/ 642]              blk.24.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 245/ 642]               blk.24.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 246/ 642]               blk.24.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 247/ 642]                 blk.24.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 248/ 642]                 blk.24.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 249/ 642]            blk.24.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 250/ 642]                 blk.24.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 251/ 642]                 blk.24.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 252/ 642]               blk.25.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 253/ 642]            blk.25.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 254/ 642]                 blk.25.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 255/ 642]            blk.25.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 256/ 642]            blk.25.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 257/ 642]                 blk.25.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 258/ 642]                 blk.25.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 259/ 642]              blk.25.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 260/ 642]               blk.25.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 261/ 642]                 blk.25.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 262/ 642]              blk.26.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 263/ 642]               blk.26.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 264/ 642]               blk.26.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 265/ 642]                 blk.26.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 266/ 642]            blk.26.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 267/ 642]                 blk.26.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 268/ 642]            blk.26.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 269/ 642]            blk.26.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 270/ 642]                 blk.26.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 271/ 642]                 blk.26.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 272/ 642]            blk.27.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 273/ 642]            blk.27.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 274/ 642]              blk.27.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 275/ 642]               blk.27.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 276/ 642]               blk.27.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 277/ 642]                 blk.27.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 278/ 642]                 blk.27.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 279/ 642]            blk.27.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 280/ 642]                 blk.27.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 281/ 642]                 blk.27.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 282/ 642]               blk.28.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 283/ 642]            blk.28.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 284/ 642]                 blk.28.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 285/ 642]            blk.28.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 286/ 642]            blk.28.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 287/ 642]                 blk.28.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 288/ 642]                 blk.28.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 289/ 642]              blk.28.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 290/ 642]               blk.28.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 291/ 642]                 blk.28.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 292/ 642]              blk.29.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 293/ 642]               blk.29.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 294/ 642]               blk.29.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 295/ 642]                 blk.29.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 296/ 642]            blk.29.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 297/ 642]                 blk.29.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 298/ 642]            blk.29.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 299/ 642]            blk.29.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 300/ 642]                 blk.29.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 301/ 642]                 blk.29.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 302/ 642]            blk.30.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 303/ 642]            blk.30.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 304/ 642]              blk.30.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 305/ 642]               blk.30.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 306/ 642]               blk.30.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 307/ 642]                 blk.30.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 308/ 642]                 blk.30.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 309/ 642]            blk.30.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 310/ 642]                 blk.30.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 311/ 642]                 blk.30.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 312/ 642]               blk.31.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 313/ 642]            blk.31.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 314/ 642]                 blk.31.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 315/ 642]            blk.31.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 316/ 642]            blk.31.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 317/ 642]                 blk.31.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 318/ 642]                 blk.31.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 319/ 642]              blk.31.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 320/ 642]               blk.31.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 321/ 642]                 blk.31.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 322/ 642]              blk.32.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 323/ 642]               blk.32.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 324/ 642]               blk.32.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 325/ 642]                 blk.32.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 326/ 642]            blk.32.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 327/ 642]                 blk.32.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 328/ 642]            blk.32.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 329/ 642]            blk.32.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 330/ 642]                 blk.32.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 331/ 642]                 blk.32.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 332/ 642]            blk.33.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 333/ 642]            blk.33.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 334/ 642]              blk.33.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 335/ 642]               blk.33.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 336/ 642]               blk.33.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 337/ 642]                 blk.33.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 338/ 642]                 blk.33.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 339/ 642]            blk.33.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 340/ 642]                 blk.33.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 341/ 642]                 blk.33.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 342/ 642]               blk.34.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 343/ 642]            blk.34.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 344/ 642]                 blk.34.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 345/ 642]            blk.34.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 346/ 642]            blk.34.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 347/ 642]                 blk.34.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 348/ 642]                 blk.34.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 349/ 642]              blk.34.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 350/ 642]               blk.34.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 351/ 642]                 blk.34.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 352/ 642]              blk.35.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 353/ 642]               blk.35.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 354/ 642]               blk.35.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 355/ 642]                 blk.35.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 356/ 642]            blk.35.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 357/ 642]                 blk.35.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 358/ 642]            blk.35.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 359/ 642]            blk.35.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 360/ 642]                 blk.35.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 361/ 642]                 blk.35.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 362/ 642]            blk.36.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 363/ 642]            blk.36.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 364/ 642]              blk.36.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 365/ 642]               blk.36.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 366/ 642]               blk.36.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 367/ 642]                 blk.36.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 368/ 642]                 blk.36.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 369/ 642]            blk.36.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 370/ 642]                 blk.36.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 371/ 642]                 blk.36.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 372/ 642]               blk.37.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 373/ 642]            blk.37.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 374/ 642]                 blk.37.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 375/ 642]            blk.37.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 376/ 642]            blk.37.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 377/ 642]                 blk.37.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 378/ 642]                 blk.37.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 379/ 642]              blk.37.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 380/ 642]               blk.37.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 381/ 642]                 blk.37.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 382/ 642]              blk.38.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 383/ 642]               blk.38.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 384/ 642]               blk.38.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 385/ 642]                 blk.38.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 386/ 642]            blk.38.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 387/ 642]                 blk.38.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 388/ 642]            blk.38.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 389/ 642]            blk.38.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 390/ 642]                 blk.38.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 391/ 642]                 blk.38.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 392/ 642]            blk.39.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 393/ 642]            blk.39.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 394/ 642]              blk.39.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 395/ 642]               blk.39.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 396/ 642]               blk.39.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 397/ 642]                 blk.39.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 398/ 642]                 blk.39.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 399/ 642]            blk.39.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 400/ 642]                 blk.39.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 401/ 642]                 blk.39.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 402/ 642]               blk.40.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 403/ 642]            blk.40.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 404/ 642]                 blk.40.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 405/ 642]            blk.40.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 406/ 642]            blk.40.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 407/ 642]                 blk.40.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 408/ 642]                 blk.40.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 409/ 642]              blk.40.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 410/ 642]               blk.40.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 411/ 642]                 blk.40.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 412/ 642]              blk.41.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 413/ 642]               blk.41.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 414/ 642]               blk.41.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 415/ 642]                 blk.41.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 416/ 642]            blk.41.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 417/ 642]                 blk.41.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 418/ 642]            blk.41.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 419/ 642]            blk.41.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 420/ 642]                 blk.41.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 421/ 642]                 blk.41.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 422/ 642]            blk.42.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 423/ 642]            blk.42.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 424/ 642]              blk.42.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 425/ 642]               blk.42.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 426/ 642]               blk.42.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 427/ 642]                 blk.42.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 428/ 642]                 blk.42.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 429/ 642]            blk.42.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 430/ 642]                 blk.42.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 431/ 642]                 blk.42.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 432/ 642]               blk.43.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 433/ 642]            blk.43.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 434/ 642]                 blk.43.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 435/ 642]            blk.43.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 436/ 642]            blk.43.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 437/ 642]                 blk.43.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 438/ 642]                 blk.43.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 439/ 642]              blk.43.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 440/ 642]               blk.43.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 441/ 642]                 blk.43.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 442/ 642]              blk.44.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 443/ 642]               blk.44.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 444/ 642]               blk.44.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 445/ 642]                 blk.44.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 446/ 642]            blk.44.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 447/ 642]                 blk.44.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 448/ 642]            blk.44.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 449/ 642]            blk.44.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 450/ 642]                 blk.44.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 451/ 642]                 blk.44.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 452/ 642]            blk.45.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 453/ 642]            blk.45.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 454/ 642]              blk.45.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 455/ 642]               blk.45.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 456/ 642]               blk.45.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 457/ 642]                 blk.45.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 458/ 642]                 blk.45.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 459/ 642]            blk.45.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 460/ 642]                 blk.45.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 461/ 642]                 blk.45.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 462/ 642]               blk.46.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 463/ 642]            blk.46.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 464/ 642]                 blk.46.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 465/ 642]            blk.46.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 466/ 642]            blk.46.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 467/ 642]                 blk.46.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 468/ 642]                 blk.46.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 469/ 642]              blk.46.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 470/ 642]               blk.46.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 471/ 642]                 blk.46.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 472/ 642]              blk.47.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 473/ 642]               blk.47.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 474/ 642]               blk.47.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 475/ 642]                 blk.47.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 476/ 642]            blk.47.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 477/ 642]                 blk.47.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 478/ 642]            blk.47.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 479/ 642]            blk.47.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 480/ 642]                 blk.47.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 481/ 642]                 blk.47.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 482/ 642]            blk.48.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 483/ 642]            blk.48.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 484/ 642]              blk.48.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 485/ 642]               blk.48.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 486/ 642]               blk.48.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 487/ 642]                 blk.48.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 488/ 642]                 blk.48.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 489/ 642]            blk.48.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 490/ 642]                 blk.48.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 491/ 642]                 blk.48.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 492/ 642]               blk.49.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 493/ 642]            blk.49.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 494/ 642]                 blk.49.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 495/ 642]            blk.49.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 496/ 642]            blk.49.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 497/ 642]                 blk.49.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 498/ 642]                 blk.49.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 499/ 642]              blk.49.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 500/ 642]               blk.49.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 501/ 642]                 blk.49.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 502/ 642]              blk.50.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 503/ 642]               blk.50.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 504/ 642]               blk.50.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 505/ 642]                 blk.50.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 506/ 642]            blk.50.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 507/ 642]                 blk.50.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 508/ 642]            blk.50.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 509/ 642]            blk.50.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 510/ 642]                 blk.50.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 511/ 642]                 blk.50.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 512/ 642]            blk.51.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 513/ 642]            blk.51.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 514/ 642]              blk.51.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 515/ 642]               blk.51.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 516/ 642]               blk.51.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 517/ 642]                 blk.51.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 518/ 642]                 blk.51.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 519/ 642]            blk.51.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 520/ 642]                 blk.51.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 521/ 642]                 blk.51.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 522/ 642]               blk.52.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 523/ 642]            blk.52.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 524/ 642]                 blk.52.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 525/ 642]            blk.52.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 526/ 642]            blk.52.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 527/ 642]                 blk.52.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 528/ 642]                 blk.52.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 529/ 642]              blk.52.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 530/ 642]               blk.52.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 531/ 642]                 blk.52.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 532/ 642]              blk.53.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 533/ 642]               blk.53.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 534/ 642]               blk.53.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 535/ 642]                 blk.53.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 536/ 642]            blk.53.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 537/ 642]                 blk.53.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 538/ 642]            blk.53.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 539/ 642]            blk.53.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 540/ 642]                 blk.53.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 541/ 642]                 blk.53.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 542/ 642]            blk.54.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 543/ 642]            blk.54.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 544/ 642]              blk.54.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 545/ 642]               blk.54.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 546/ 642]               blk.54.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 547/ 642]                 blk.54.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 548/ 642]                 blk.54.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 549/ 642]            blk.54.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 550/ 642]                 blk.54.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 551/ 642]                 blk.54.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 552/ 642]               blk.55.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 553/ 642]            blk.55.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 554/ 642]                 blk.55.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 555/ 642]            blk.55.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 556/ 642]            blk.55.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 557/ 642]                 blk.55.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 558/ 642]                 blk.55.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 559/ 642]              blk.55.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 560/ 642]               blk.55.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 561/ 642]                 blk.55.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 562/ 642]              blk.56.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 563/ 642]               blk.56.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 564/ 642]               blk.56.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 565/ 642]                 blk.56.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 566/ 642]            blk.56.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 567/ 642]                 blk.56.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 568/ 642]            blk.56.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 569/ 642]            blk.56.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 570/ 642]                 blk.56.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 571/ 642]                 blk.56.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 572/ 642]            blk.57.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 573/ 642]            blk.57.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 574/ 642]              blk.57.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 575/ 642]               blk.57.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 576/ 642]               blk.57.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 577/ 642]                 blk.57.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 578/ 642]                 blk.57.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 579/ 642]            blk.57.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 580/ 642]                 blk.57.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 581/ 642]                 blk.57.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 582/ 642]               blk.58.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 583/ 642]            blk.58.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 584/ 642]                 blk.58.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 585/ 642]            blk.58.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 586/ 642]            blk.58.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 587/ 642]                 blk.58.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 588/ 642]                 blk.58.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 589/ 642]              blk.58.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 590/ 642]               blk.58.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 591/ 642]                 blk.58.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 592/ 642]              blk.59.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 593/ 642]               blk.59.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 594/ 642]               blk.59.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 595/ 642]                 blk.59.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 596/ 642]            blk.59.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 597/ 642]                 blk.59.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 598/ 642]            blk.59.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 599/ 642]            blk.59.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 600/ 642]                 blk.59.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 601/ 642]                 blk.59.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 602/ 642]            blk.60.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 603/ 642]            blk.60.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 604/ 642]              blk.60.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 605/ 642]               blk.60.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 606/ 642]               blk.60.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 607/ 642]                 blk.60.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 608/ 642]                 blk.60.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 609/ 642]            blk.60.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 610/ 642]                 blk.60.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 611/ 642]                 blk.60.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 612/ 642]               blk.61.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 613/ 642]            blk.61.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 614/ 642]                 blk.61.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 615/ 642]            blk.61.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 616/ 642]            blk.61.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 617/ 642]                 blk.61.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 618/ 642]                 blk.61.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 619/ 642]              blk.61.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 620/ 642]               blk.61.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 621/ 642]                 blk.61.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 622/ 642]              blk.62.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 623/ 642]               blk.62.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 624/ 642]               blk.62.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 625/ 642]                 blk.62.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 626/ 642]            blk.62.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 627/ 642]                 blk.62.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 628/ 642]            blk.62.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 629/ 642]            blk.62.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 630/ 642]                 blk.62.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 631/ 642]                 blk.62.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 632/ 642]            blk.63.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 633/ 642]            blk.63.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 634/ 642]              blk.63.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 635/ 642]               blk.63.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 636/ 642]               blk.63.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 637/ 642]                 blk.63.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 638/ 642]                 blk.63.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 639/ 642]            blk.63.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 640/ 642]                 blk.63.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 641/ 642]                 blk.63.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 642/ 642]                   output_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
llama_model_quantize_internal: model size  = 198004.67 MB
llama_model_quantize_internal: quant size  = 48607.44 MB
llama_model_quantize_internal: WARNING: 128 of 577 tensor(s) required fallback quantization

main: quantize time = 338672.88 ms
main:    total time = 338672.88 ms

@N8python
Copy link

N8python commented Apr 4, 2024

So they work now?

Copy link
Contributor

github-actions bot commented Apr 4, 2024

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 499 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=9414.96ms p(90)=26375.41ms fails=0, finish reason: stop=499 truncated=0
  • Prompt processing (pp): avg=243.1tk/s p(90)=734.8tk/s total=198.12tk/s
  • Token generation (tg): avg=97.59tk/s p(90)=260.19tk/s total=130.44tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=master commit=ec613b856c91d219b6d6efb7852e286b862fe797

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 499 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1712318663 --> 1712319289
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 378.73, 378.73, 378.73, 378.73, 378.73, 680.85, 680.85, 680.85, 680.85, 680.85, 676.32, 676.32, 676.32, 676.32, 676.32, 678.41, 678.41, 678.41, 678.41, 678.41, 697.47, 697.47, 697.47, 697.47, 697.47, 703.74, 703.74, 703.74, 703.74, 703.74, 711.3, 711.3, 711.3, 711.3, 711.3, 720.09, 720.09, 720.09, 720.09, 720.09, 726.53, 726.53, 726.53, 726.53, 726.53, 723.13, 723.13, 723.13, 723.13, 723.13, 729.69, 729.69, 729.69, 729.69, 729.69, 740.46, 740.46, 740.46, 740.46, 740.46, 754.32, 754.32, 754.32, 754.32, 754.32, 737.94, 737.94, 737.94, 737.94, 737.94, 706.59, 706.59, 706.59, 706.59, 706.59, 714.81, 714.81, 714.81, 714.81, 714.81, 715.69, 715.69, 715.69, 715.69, 715.69, 713.64, 713.64, 713.64, 713.64, 713.64, 722.62, 722.62, 722.62, 722.62, 722.62, 722.91, 722.91, 722.91, 722.91, 722.91, 721.62, 721.62, 721.62, 721.62, 721.62, 719.16, 719.16, 719.16, 719.16, 719.16, 719.07, 719.07, 719.07, 719.07, 719.07, 719.27, 719.27, 719.27, 719.27, 719.27, 726.47, 726.47, 726.47, 726.47, 726.47, 705.55, 705.55, 705.55, 705.55, 705.55, 707.87, 707.87, 707.87, 707.87, 707.87, 708.87, 708.87, 708.87, 708.87, 708.87, 713.01, 713.01, 713.01, 713.01, 713.01, 711.79, 711.79, 711.79, 711.79, 711.79, 711.47, 711.47, 711.47, 711.47, 711.47, 710.31, 710.31, 710.31, 710.31, 710.31, 707.69, 707.69, 707.69, 707.69, 707.69, 710.59, 710.59, 710.59, 710.59, 710.59, 709.37, 709.37, 709.37, 709.37, 709.37, 710.85, 710.85, 710.85, 710.85, 710.85, 712.17, 712.17, 712.17, 712.17, 712.17, 711.65, 711.65, 711.65, 711.65, 711.65, 721.68, 721.68, 721.68, 721.68, 721.68, 724.08, 724.08, 724.08, 724.08, 724.08, 723.86, 723.86, 723.86, 723.86, 723.86, 721.8, 721.8, 721.8, 721.8, 721.8, 721.5, 721.5, 721.5, 721.5, 721.5, 723.45, 723.45, 723.45, 723.45, 723.45, 721.35, 721.35, 721.35, 721.35, 721.35, 724.78, 724.78, 724.78, 724.78, 724.78, 724.94, 724.94, 724.94, 724.94, 724.94, 717.2, 717.2, 717.2, 717.2, 717.2, 715.39, 715.39, 715.39, 715.39, 715.39, 713.86, 713.86, 713.86, 713.86, 713.86, 712.93, 712.93, 712.93, 712.93, 712.93, 712.48, 712.48, 712.48, 712.48, 712.48, 718.16, 718.16, 718.16, 718.16, 718.16, 718.93, 718.93, 718.93, 718.93, 718.93, 720.05, 720.05, 720.05, 720.05, 720.05, 718.07, 718.07, 718.07, 718.07, 718.07, 719.49, 719.49, 719.49, 719.49, 719.49, 718.62, 718.62, 718.62, 718.62, 718.62, 719.37, 719.37, 719.37, 719.37, 719.37, 720.59, 720.59, 720.59, 720.59, 720.59, 721.52, 721.52, 721.52, 721.52]
                    
Loading
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 499 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1712318663 --> 1712319289
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 29.82, 29.82, 29.82, 29.82, 29.82, 22.94, 22.94, 22.94, 22.94, 22.94, 17.63, 17.63, 17.63, 17.63, 17.63, 18.69, 18.69, 18.69, 18.69, 18.69, 19.0, 19.0, 19.0, 19.0, 19.0, 19.6, 19.6, 19.6, 19.6, 19.6, 20.31, 20.31, 20.31, 20.31, 20.31, 20.46, 20.46, 20.46, 20.46, 20.46, 20.49, 20.49, 20.49, 20.49, 20.49, 20.39, 20.39, 20.39, 20.39, 20.39, 20.39, 20.39, 20.39, 20.39, 20.39, 19.9, 19.9, 19.9, 19.9, 19.9, 19.56, 19.56, 19.56, 19.56, 19.56, 19.3, 19.3, 19.3, 19.3, 19.3, 19.04, 19.04, 19.04, 19.04, 19.04, 18.38, 18.38, 18.38, 18.38, 18.38, 18.47, 18.47, 18.47, 18.47, 18.47, 18.64, 18.64, 18.64, 18.64, 18.64, 18.78, 18.78, 18.78, 18.78, 18.78, 18.61, 18.61, 18.61, 18.61, 18.61, 18.53, 18.53, 18.53, 18.53, 18.53, 18.44, 18.44, 18.44, 18.44, 18.44, 18.31, 18.31, 18.31, 18.31, 18.31, 18.23, 18.23, 18.23, 18.23, 18.23, 18.34, 18.34, 18.34, 18.34, 18.34, 18.31, 18.31, 18.31, 18.31, 18.31, 18.38, 18.38, 18.38, 18.38, 18.38, 18.42, 18.42, 18.42, 18.42, 18.42, 18.45, 18.45, 18.45, 18.45, 18.45, 18.43, 18.43, 18.43, 18.43, 18.43, 18.27, 18.27, 18.27, 18.27, 18.27, 18.23, 18.23, 18.23, 18.23, 18.23, 18.27, 18.27, 18.27, 18.27, 18.27, 18.32, 18.32, 18.32, 18.32, 18.32, 18.39, 18.39, 18.39, 18.39, 18.39, 18.48, 18.48, 18.48, 18.48, 18.48, 18.55, 18.55, 18.55, 18.55, 18.55, 18.51, 18.51, 18.51, 18.51, 18.51, 18.47, 18.47, 18.47, 18.47, 18.47, 18.44, 18.44, 18.44, 18.44, 18.44, 18.37, 18.37, 18.37, 18.37, 18.37, 18.34, 18.34, 18.34, 18.34, 18.34, 18.37, 18.37, 18.37, 18.37, 18.37, 18.43, 18.43, 18.43, 18.43, 18.43, 18.46, 18.46, 18.46, 18.46, 18.46, 18.52, 18.52, 18.52, 18.52, 18.52, 18.42, 18.42, 18.42, 18.42, 18.42, 18.32, 18.32, 18.32, 18.32, 18.32, 18.12, 18.12, 18.12, 18.12, 18.12, 17.97, 17.97, 17.97, 17.97, 17.97, 17.85, 17.85, 17.85, 17.85, 17.85, 17.66, 17.66, 17.66, 17.66, 17.66, 17.56, 17.56, 17.56, 17.56, 17.56, 17.64, 17.64, 17.64, 17.64, 17.64, 17.68, 17.68, 17.68, 17.68, 17.68, 17.71, 17.71, 17.71, 17.71, 17.71, 17.71, 17.71, 17.71, 17.71, 17.71, 17.69, 17.69, 17.69, 17.69, 17.69, 17.68, 17.68, 17.68, 17.68, 17.68, 17.64, 17.64, 17.64, 17.64, 17.64, 17.65, 17.65, 17.65, 17.65]
                    
Loading

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 499 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1712318663 --> 1712319289
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.07, 0.07, 0.07, 0.07, 0.07, 0.28, 0.28, 0.28, 0.28, 0.28, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.18, 0.18, 0.18, 0.18, 0.18, 0.1, 0.1, 0.1, 0.1, 0.1, 0.09, 0.09, 0.09, 0.09, 0.09, 0.13, 0.13, 0.13, 0.13, 0.13, 0.23, 0.23, 0.23, 0.23, 0.23, 0.17, 0.17, 0.17, 0.17, 0.17, 0.33, 0.33, 0.33, 0.33, 0.33, 0.19, 0.19, 0.19, 0.19, 0.19, 0.15, 0.15, 0.15, 0.15, 0.15, 0.23, 0.23, 0.23, 0.23, 0.23, 0.2, 0.2, 0.2, 0.2, 0.2, 0.13, 0.13, 0.13, 0.13, 0.13, 0.16, 0.16, 0.16, 0.16, 0.16, 0.13, 0.13, 0.13, 0.13, 0.13, 0.27, 0.27, 0.27, 0.27, 0.27, 0.23, 0.23, 0.23, 0.23, 0.23, 0.26, 0.26, 0.26, 0.26, 0.26, 0.29, 0.29, 0.29, 0.29, 0.29, 0.16, 0.16, 0.16, 0.16, 0.16, 0.15, 0.15, 0.15, 0.15, 0.15, 0.26, 0.26, 0.26, 0.26, 0.26, 0.11, 0.11, 0.11, 0.11, 0.11, 0.15, 0.15, 0.15, 0.15, 0.15, 0.14, 0.14, 0.14, 0.14, 0.14, 0.27, 0.27, 0.27, 0.27, 0.27, 0.22, 0.22, 0.22, 0.22, 0.22, 0.22, 0.22, 0.22, 0.22, 0.22, 0.13, 0.13, 0.13, 0.13, 0.13, 0.12, 0.12, 0.12, 0.12, 0.12, 0.13, 0.13, 0.13, 0.13, 0.13, 0.15, 0.15, 0.15, 0.15, 0.15, 0.16, 0.16, 0.16, 0.16, 0.16, 0.18, 0.18, 0.18, 0.18, 0.18, 0.1, 0.1, 0.1, 0.1, 0.1, 0.25, 0.25, 0.25, 0.25, 0.25, 0.22, 0.22, 0.22, 0.22, 0.22, 0.23, 0.23, 0.23, 0.23, 0.23, 0.2, 0.2, 0.2, 0.2, 0.2, 0.12, 0.12, 0.12, 0.12, 0.12, 0.1, 0.1, 0.1, 0.1, 0.1, 0.14, 0.14, 0.14, 0.14, 0.14, 0.27, 0.27, 0.27, 0.27, 0.27, 0.42, 0.42, 0.42, 0.42, 0.42, 0.53, 0.53, 0.53, 0.53, 0.53, 0.39, 0.39, 0.39, 0.39, 0.39, 0.36, 0.36, 0.36, 0.36, 0.36, 0.31, 0.31, 0.31, 0.31, 0.31, 0.14, 0.14, 0.14, 0.14, 0.14, 0.11, 0.11, 0.11, 0.11, 0.11, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.2, 0.2, 0.2, 0.2, 0.2, 0.32, 0.32, 0.32, 0.32, 0.32, 0.16, 0.16, 0.16, 0.16, 0.16, 0.21, 0.21, 0.21, 0.21, 0.21, 0.13, 0.13, 0.13, 0.13, 0.13, 0.14, 0.14, 0.14, 0.14]
                    
Loading
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 499 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1712318663 --> 1712319289
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0]
                    
Loading

@slaren
Copy link
Collaborator

slaren commented Apr 4, 2024

The norm layer is probably being quantized if it is exported as a 2d tensor, but it needs to be f32. Exporting it as 1d (reshaped) may work.

@N8python
Copy link

N8python commented Apr 4, 2024

Does the inference appear to be sane?

@Noeda
Copy link
Contributor

Noeda commented Apr 4, 2024

Does the inference appear to be sane?

No. I don't get crashes but output looks random. Something is off. I'm using functionally identical code to the PR here.

I tried transposing the tensors; also I worked around the norm datatype problem by hacking quantizing to keep them at f16 or f32. These don't seem to help.

I started studying the transformers code again and comparing with llama.cpp. I think before the plus model, CohereLayerNorm was never used with 2D tensors; and I think slapping basic llm_build_norm won't work out of box. Need to figure out how is it diverging.

"hello world" prompt:

Screenshot 2024-04-04 at 3 20 07 PM

@acanis acanis mentioned this pull request Apr 4, 2024
@RefractAI
Copy link
Contributor Author

RefractAI commented Apr 4, 2024

The norm layer is probably being quantized if it is exported as a 2d tensor, but it needs to be f32. Exporting it as 1d (reshaped) may work.

I am now exporting it in 1d f32 in the latest commit, and the issue remains - nonsense output because the Layer Norm should be 2D not 1D.

@RefractAI RefractAI marked this pull request as draft April 5, 2024 00:48
@Noeda
Copy link
Contributor

Noeda commented Apr 5, 2024

In the current implementation seems like most of the values in the computation graph are zero. (Also I learned how to more systematically track intermediate computation values). It's very different compared to old Command-R model even before it hits the code path that uses the new norms.

Command-R+ (new model; 5 first and 5 last values from the first intermediate computed values)

tensor: 0x108178300 (inp_embd)
0 0 0 0 0  ... 0 0 0 0 0
tensor: 0x108178300 (inp_embd)
0 0 0 0 0  ... 0 0 0 0 0
tensor: 0x1081787b0 (norm-0)
0 0 0 0 0  ... 0 0 0 0 0
tensor: 0x1081787b0 (norm-0)
0 0 0 0 0  ... 0 0 0 0 0
tensor: 0x108178940 (attn_norm-0)
0 0 0 0 0  ... 0 0 0 0 0
tensor: 0x108178940 (attn_norm-0)
0 0 0 0 0  ... 0 0 0 0 0

Command-R (old model, same tensors)

tensor: 0x108148300 (inp_embd)
0 0 0 0 0  ... 0 0 0 0 0
tensor: 0x108148300 (inp_embd)
-0.000882447 0.00211787 0.000705957 -0.00141191 0.00123543  ... -0.0255127 -0.00927734 0.0742188 -0.00695801 0.00927734
tensor: 0x1081487b0 (norm-0)
0 0 0 0 0  ... 0 0 0 0 0
tensor: 0x1081487b0 (norm-0)
-0.0881477 0.0831225 0.0025248 -0.118372 0.032749  ... -1.10396 -0.398331 3.23062 -0.297527 0.408103
tensor: 0x108148940 (attn_norm-0)
-0.0881477 0.0831225 0.0025248 -0.118372 0.032749  ... -1.10396 -0.398331 3.23062 -0.297527 0.408103
tensor: 0x108148940 (attn_norm-0)
-0.00680045 0.00490344 0.000171977 -0.00778116 0.00199284  ... -0.0504679 -0.0273999 0.177759 -0.0168158 0.0285453

It's not zero across the board but it looks fairly broken. A bit surprising since it isn't that different of a model.

The quants I'm working with have a suspiciously good compression rates with zstd. They don't look like entire zeroes in hexedit but zstd compression % looks pretty anomalous for the first few gigabytes (only a few percentage %, then jumping up. Old Cohere models and other similar quants get barely any compression no matter where you are in the file).

Maybe worth checking if the GGUF converter isn't throwing away data somehow.

Although possibly I have corrupted files. Checksums from HF seem to match....hrm. It would be annoying if I've had trouble only because of corrupted files.

Edit: I don't get the zeroes in intermediate computations with f16. It's just so big the test workflow takes forever. I wonder if there might be another quant bug with tensors being larger than 2**31-1 like we found with the previous model, but more subtle this time.

@Noeda
Copy link
Contributor

Noeda commented Apr 5, 2024

I verified with ddrescue that there are large blocks of zeroes in the quantized files, including the Q4 and Q8. I now suspect there possibly an overflow bug of some kind in quantizing somewhere, but of different nature than last time.

F16 also looks broken but broken in a different way (the Q4/Q8 were random with random symbols; this at least repeats words):

Screenshot 2024-04-04 at 6 55 48 PM

@Carolinabanana are you able to verify if your code branch has the same issue? That is, any quantized files have big blocks of zeroes. (Or anyone else who has .ggufs for that matter).

Some ways to test: You can run 1) zstd --compress --keep <path to .gguf> and check is the compression rate <95%. Or 2) you can run:

ddrescue -b 1000000 --generate-mode /dev/zero <path to .gguf> report.txt

This command detects long runs of zeroes in a file. You can interpret the output:

$ cat report.txt
# Mapfile. Created by GNU ddrescue version 1.28
# Command line: ddrescue -b 1000000 --generate-mode /dev/zero ./commandr_plus_Q4_K.gguf mapfile
# Start time:   2024-04-04 18:44:43
# Current time: 2024-04-04 18:44:48
# Finished
# current_pos  current_status  current_pass
0xE9C033CC0     +               1
#      pos        size  status
0x00000000  0x00A66600  +
0x00A66600  0xC737FE00  ?    <-------- this here says there's a block of zeroes around 0xC737FE00 bytes long (~3 gigs).
0xC7DE6400  0xDD430B140  +

A good file from ddrescue has just one line and it didn't "rescue" anything.

Seems to be at the beginning of the file. The embedding tensors are broken? Both Q4 and Q8 have a big block of zeroes there. That was the issue for the plain Command-R but it broke much more visibly that time. Hmm.

Edit: Found a very suspicious 2024-03-09 15:53:59 +0200 20360) const int n = nrows * n_per_row; in ggml.c that would overflow with these tensors. Will open a separate PR if I can confirm.

@dranger003
Copy link
Contributor

# Mapfile. Created by GNU ddrescue version 1.28
# Command line: ddrescue -b 1000000 --generate-mode /dev/zero /md0/models/CohereForAI/c4ai-command-r-plus/ggml-c4ai-command-r-plus-q8_0.gguf mapfile
# Start time:   2024-04-04 22:02:17
# Current time: 2024-04-04 22:02:42
# Finished
# current_pos  current_status  current_pass
0x19AF3A0E80     +               1
#      pos        size  status
0x00000000  0x00A7D8C0  +
0x00A7D8C0  0xC732DF80  ?
0xC7DAB840  0x18E76868E0  +

@Noeda
Copy link
Contributor

Noeda commented Apr 5, 2024

@dranger003 Ah thanks! Yeah, that indicate that you also have a zero hole in your file. Okay good, so it's not just me. I think I may have found the part that overflows. It's the same tensor as last Command-R model that also had an overflow, but it overflows in a different part this time. Maybe the tensor is even larger this time.

@Noeda
Copy link
Contributor

Noeda commented Apr 5, 2024

I'm getting coherent text now from Q8 after overflow fixes and some clean ups (non-colored text is AI output, and the stuff before is my prompt).

Screenshot 2024-04-04 at 8 24 23 PM

I'm doing a few cleanups and then closing for the day and write some notes if anyone wants to use my code. The code is not really something you'd want to merge because it has my graveyard of debugging code and other crap.

Not very fast. ~1 second per token on a Mac Studio. It's big so makes sense. Haven't tried the other quants.

Screenshot 2024-04-04 at 8 25 38 PM

@N8python
Copy link

N8python commented Apr 5, 2024

Congrats on getting it working :D :D :D. My ballpark for a m3 max (my device) at Q8 would be 2-3 tok/sec (logic -> 3-5 tok/sec at Q3 for 120 b)... what mac studio do you have? Maybe there's a slower part of the inference code?

@Noeda
Copy link
Contributor

Noeda commented Apr 8, 2024

Would you care uploading the Q4_K to huggingface?

Thnx for your work btw

I'll be honest; I straight up might not have time. I have some possibly high-stakes interviews this week and will spend time prepping for those instead, starting right about when I finish typing this comment :D I'm not sure if @dranger003 has an older version in their HF if you look in past commits if there's a working Q4_K_M one uploaded.

I'm testing on Metal and Q4_0 is broken - produces garbage and nan during perplexity:

# garbage
./quantize ./models/command-r-plus/ggml-model-f16.gguf ./models/command-r-plus/ggml-model-q4_0.gguf q4_0

# works (requires #6541)
./quantize --token-embedding-type q8_0 ./models/command-r-plus/ggml-model-f16.gguf ./models/command-r-plus/ggml-model-q4_0.gguf q4_0

Similar observations as @Noeda noted earlier, though Q4_0 + Q6 token_embd tensor does not work for me. Probably more integer overflow problems in the Metal backend. Looking into this

Just rechecked my setup:

quantize, gguf-dump.py, main and perplexity from Q4_0
# From history:
./bin/quantize /Volumes/T9/rplus_banana_f16.gguf /Volumes/T9/rplus_banana_Q4_0.gguf Q4_0

# token embd
(venv) mikkojuola@Mikkos-Mac-Studio ~/realllama/banana/llama.cpp> python gguf-py/scripts/gguf-dump.py /Volumes/T9/rplus_banana_Q4_0.gguf|grep token
     17: STRING     |        1 | tokenizer.ggml.model = 'gpt2'
     18: [STRING]   |   256000 | tokenizer.ggml.tokens
     19: [INT32]    |   256000 | tokenizer.ggml.token_type
     20: [STRING]   |   253333 | tokenizer.ggml.merges
     21: UINT32     |        1 | tokenizer.ggml.bos_token_id = 5
     22: UINT32     |        1 | tokenizer.ggml.eos_token_id = 255001
     23: UINT32     |        1 | tokenizer.ggml.padding_token_id = 0
     24: BOOL       |        1 | tokenizer.ggml.add_bos_token = True
     25: BOOL       |        1 | tokenizer.ggml.add_eos_token = False
      1: 3145728000 | 12288, 256000,     1,     1 | Q6_K    | token_embd.weight

# main test
./bin/main --model /Volumes/T9/rplus_banana_Q4_0.gguf --prompt "hello world! my name is" --top-k 1
<omitted log output>
hello world! my name is kate and i am a 20 year old college student. i am a huge fan of the twilight series and i am currently working on my first fanfic. i am a huge fan of the twilight series and i am currently working on my

# perplexity
$ mikkojuola@Mikkos-Mac-Studio ~/realllama/banana/llama.cpp/build> ./bin/perplexity --model /Volumes/T9/rplus_banana_Q4_0.gguf -f ~/llama.cpp/ci/wikitext-2-raw/wiki.test.raw -ngl 256
<omitted log output>
perplexity: calculating perplexity over 560 chunks, n_ctx=512, batch_size=2048, n_seq=4
perplexity: 21.87 seconds per pass - ETA 51.02 minutes
[1]3.4319,[2]4.4709,[3]3.8271,[4]3.9028,[5]3.7290,[6]3.8120,[7]3.9015,[8]4.1230,^C

This is a case of "works for me". Wondering what could be different.

MacOS version and snippet from `llama.cpp` loading itself when using Q4_0.
$ sw_vers -productVersion
14.4.1
$ xcodebuild -version
Xcode 15.3

$
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 75000000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2 Ultra
ggml_metal_init: picking default device: Apple M2 Ultra
ggml_metal_init: loading '/Users/mikkojuola/realllama/banana/llama.cpp/build/bin/default.metallib'
ggml_metal_init: GPU name:   Apple M2 Ultra
ggml_metal_init: GPU family: MTLGPUFamilyApple8  (1008)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 193986.56 MB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =   512.00 MiB, (56981.12 / 185000.00)
llama_kv_cache_init:      Metal KV buffer size =   512.00 MiB
llama_new_context_with_model: KV self size  =  512.00 MiB, K (f16):  256.00 MiB, V (f16):  256.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     3.91 MiB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =   524.00 MiB, (57505.12 / 185000.00)
llama_new_context_with_model:      Metal compute buffer size =   524.00 MiB
llama_new_context_with_model:        CPU compute buffer size =    28.01 MiB
llama_new_context_with_model: graph nodes  = 2312
llama_new_context_with_model: graph splits = 2

@dranger003
Copy link
Contributor

@teis-e Why not use IQ4_XS?

@teis-e
Copy link

teis-e commented Apr 8, 2024

@teis-e Why not use IQ4_XS?

I saw your HuggingFace, but I don't understand, there are 2 files for IQ4_XS. Which one do I download?

@dranger003
Copy link
Contributor

dranger003 commented Apr 8, 2024

@teis-e Why not use IQ4_XS?

I saw your HuggingFace, but I don't understand, there are 2 files for IQ4_XS. Which one do I download?

iq4_xs-00001-of-00002 and iq4_xs-00002-of-00002 because HF only allows for max 50G per file.

@teis-e
Copy link

teis-e commented Apr 8, 2024

@teis-e Why not use IQ4_XS?

I saw your HuggingFace, but I don't understand, there are 2 files for IQ4_XS. Which one do I download?

iq4_xs-00001-of-00002 and iq4_xs-00002-of-00002 because HF only allows for max 50G per file.

And then i merge them? ./gguf-split --merg

How exactly?

@candre23
Copy link

candre23 commented Apr 8, 2024

@teis-e Why not use IQ4_XS?

I saw your HuggingFace, but I don't understand, there are 2 files for IQ4_XS. Which one do I download?

iq4_xs-00001-of-00002 and iq4_xs-00002-of-00002 because HF only allows for max 50G per file.

And then i merge them? ./gguf-split --merg

How exactly?

You don't need to merge them at all. Download both files and just point LCPP at the first one. It will load both parts properly.

If you want to merge them though, it's

gguf-split --merge /location/of/first-part.gguf /location/of/merged.gguf

@dranger003
Copy link
Contributor

copy /b file_name1 + file_name2 file_name_final

This is incorrect, you need to use gguf-split if you want to merge (which isn't needed to run the model).

@candre23
Copy link

candre23 commented Apr 8, 2024

copy /b file_name1 + file_name2 file_name_final

Models that have been split with the now-built-in splitting utility can't simply be concatenated. You can either leave them in multiple pieces and LCPP will load them as-is, or you can use the utility to recombine the pieces into a single large GGUF.

@teis-e
Copy link

teis-e commented Apr 8, 2024

I got it now. Thnx for all the answers.

I'm now moving to the next step using it on LocalAI

Has anybody got that working?

@dranger003
Copy link
Contributor

dranger003 commented Apr 8, 2024

Anyone able to run perplexity using CUDA? ./build/bin/perplexity -b 512 -ngl 65 -f /sdc1/models/wikitext-2-raw/wiki.test.raw -m /md0/models/CohereForAI/c4ai-command-r-plus/ggml-c4ai-command-r-plus-104b-iq4_xs.gguf

I have perplexity working again for this model using CUDA. I pushed the changes here dranger003@0bcfc87 and here dranger003@835d702

Quantization Model size (GiB) Perplexity Delta (FP16)
IQ1_S 21.59 8.2530 +/- 0.05234 88.23%
IQ1_M 23.49 7.4267 +/- 0.04646 69.39%
IQ2_XXS 26.65 6.1138 +/- 0.03683 39.44%
IQ2_XS 29.46 5.6489 +/- 0.03309 28.84%
IQ2_S 31.04 5.5187 +/- 0.03210 25.87%
IQ2_M 33.56 5.1930 +/- 0.02989 18.44%
IQ3_XXS 37.87 4.8258 +/- 0.02764 10.07%
IQ3_XS 40.61 4.7263 +/- 0.02665 7.80%
IQ3_S 42.80 4.6321 +/- 0.02600 5.65%
IQ3_M 44.41 4.6202 +/- 0.02585 5.38%
Q3_K_M 47.48 4.5770 +/- 0.02609 4.39%
Q3_K_L 51.60 4.5568 +/- 0.02594 3.93%
IQ4_XS 52.34 4.4428 +/- 0.02508 1.33%
Q5_K_S 66.87 4.3833 +/- 0.02466 -0.03%
Q6_K 79.32 4.3672 +/- 0.02455 -0.39%
Q8_0 102.74 4.3858 +/- 0.02469 0.03%
FP16 193.38 4.3845 +/- 0.02468 -

ggml-c4ai-command-r-plus-104b-ppl

EDIT: @Carolinabanana I'm running PPL on all the quants to test the code, looks like we'll need more updates, I'll continue to commit as I find them.

@ghchris2021
Copy link

copy /b file_name1 + file_name2 file_name_final

Models that have been split with the now-built-in splitting utility can't simply be concatenated. You can either leave them in multiple pieces and LCPP will load them as-is, or you can use the utility to recombine the pieces into a single large GGUF.

The above comment caught my eye. Forgive a simplistic question but I've long seen "split" files which are for GGUF or any other model formats on, e.g. hf, and usually in the file name there's not a lot of information / format consistency other than typically something along the lines of ggml-c4ai-command-r-plus-104b-f16-00005-of-00005.gguf or pytorch_model-00001-of-00004.bin.

So if there's no reliably consistent nomenclature as to file naming / "extension" to indicate "specially split" files, and many files are just "ordinarily split" by sharding without other transformation, how should a user tell whether the gguf file can be trivially concatenated or whether it's been somehow altered / wrapped with headers / trailers / whatever and may / must be gguf-split processed or left sharded and loaded that way?

I assume there's some kind of identifiable header / magic flag ("/usr/bin/file") or command line option to gguf-using utilities that can check the format / integrity / usability of sharded / not files?

@phymbert
Copy link
Collaborator

phymbert commented Apr 9, 2024

how should a user tell whether the gguf file can be trivially concatenated or whether it's been somehow altered / wrapped with headers / trailers / whatever and may / must be gguf-split processed or left sharded and loaded that way?

The format is specified in llama_split_path in llama.h:

llama.cpp/llama.h

Lines 1038 to 1041 in cc4a954

/// @details Build a split GGUF final path for this chunk.
/// llama_split_path(split_path, sizeof(split_path), "/models/ggml-model-q4_0", 2, 4) => split_path = "/models/ggml-model-q4_0-00002-of-00004.gguf"
// Returns the split_path length.
LLAMA_API int llama_split_path(char * split_path, size_t maxlen, const char * path_prefix, int split_no, int split_count);

I have tried to summarize all here:

Feel free to add an improvment request with split label:
https://github.com/ggerganov/llama.cpp/issues?q=is%3Aopen+is%3Aissue+label%3Asplit

@ggerganov
Copy link
Owner

ggerganov commented Apr 9, 2024

I'm testing on Metal and Q4_0 is broken - produces garbage and nan during perplexity:

# garbage
./quantize ./models/command-r-plus/ggml-model-f16.gguf ./models/command-r-plus/ggml-model-q4_0.gguf q4_0

# works (requires #6541)
./quantize --token-embedding-type q8_0 ./models/command-r-plus/ggml-model-f16.gguf ./models/command-r-plus/ggml-model-q4_0.gguf q4_0

Similar observations as @Noeda noted earlier, though Q4_0 + Q6 token_embd tensor does not work for me. Probably more integer overflow problems in the Metal backend. Looking into this

The problem was on my end - somehow I had the QK normalization tensors quantized to Q4_0 instead of keeping them in F32. I redid the conversion and quantization using the latest branch and I no longer observe issues with Metal. Q4_0 + F16 token_embd tensor also works correctly

I wouldn't be surprised if we have integer overflows in the Metal kernels (I'm actually more surprised that we don't 😄 ). We'll fix those as they occur

@ggerganov ggerganov merged commit 5dc9dd7 into ggerganov:master Apr 9, 2024
53 of 59 checks passed
@@ -160,7 +160,7 @@ def write_tensors(self):
data = data.astype(np.float32)

# TODO: Why cant we use these float16 as-is? There should be not reason to store float16 as float32
if self.ftype == 1 and data_dtype == np.float16 and n_dims == 1:
if self.ftype == 1 and data_dtype == np.float16 and (n_dims == 1 or new_name.endswith("_norm.weight")):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to update the comment

@@ -1225,7 +1225,7 @@ static void ggml_cuda_op_mul_mat_cublas(

// the main device has a larger memory buffer to hold the results from all GPUs
// ldc == nrows of the matrix that cuBLAS writes into
int ldc = id == ctx.device ? ne0 : row_diff;
int64_t ldc = id == ctx.device ? ne0 : row_diff;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to update the PR description to summarize why we upcasted all int params to int64 in this cntext.

Copy link
Contributor

@dranger003 dranger003 Apr 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@phymbert I reverted that one in dranger003@835d702 but it looks like the PR got merged before it could be pulled. My level of knowledge of these is no where near to be on par with those that created the code and so I definitely rely on your reviews. I looked at some of the values through the debugger but since we have so many overflowing I had to change them in batches, so this means I most likely changed some that don't need to be changed. Hopefully this makes some sense. I can submit another PR to master with that last commit, otherwise perplexity was still broken using CUDA for this model without that commit.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation, please raise with @ggerganov as I am out of the subject regarding CommandR+

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I opened a PR as a follow-up (#6563)

@kalomaze
Copy link
Contributor

kalomaze commented Apr 9, 2024

@dranger003 is there a chance you could upload an imatrix q4_K_S quant? (and/or imatrix q3_K_L)
CPU decoding is extremely slow on IQ quants, and it might be bottlenecking partial offloading from being feasible speed wise on 2x3090 setups :/
I can get around ~90% of the layers, but those last few are bringing down my t/s pretty dramatically.
Also heard someone claim the 5 bit model was 2x faster compared to IQ3_M.

EDIT: Apparently IQ3 is quite slow, while q4_K_S is equivalent to IQ4_XS.
It may be worth it to add q3_K_L and q3_K_M.

@dranger003
Copy link
Contributor

dranger003 commented Apr 9, 2024

EDIT: Apparently IQ3 is quite slow, while q4_K_S is equivalent to IQ4_XS.
It may be worth it to add q3_K_L and q3_K_M.

@kalomaze Sure, I'l see what I can do.

EDIT: I uploaded the new quants, I'll update the perplexity table shortly.

@kalomaze
Copy link
Contributor

Thanks a lot. For now I'm using IQ_3XXS and it seems fairly servicable

@yamikumo-DSD
Copy link

yamikumo-DSD commented Apr 10, 2024

I've tried IQ3_M variant for several hours on my Apple silicon.
Basically I felt it has high fluency in daily language ignoring the time before it spits out first token.
However, what I often observed was lower capability of coding task like missing indentation, unclosed bracket, etc., comparing to same size Command-R.
I'm not sure about imatrix quantization, but I read it extracts importance matrix using wikitext.
So, I suspect it's due to the bias and will try normal Q3 variants just uploaded.
Anybody else observed phenomenon like my case?

@christianwengert
Copy link

I've tried IQ3_M variant for several hours on my Apple silicon. Basically I felt it has high fluency in daily language ignoring the time before it spits out first token. However, what I often observed was lower capability of coding task like missing indentation, unclosed bracket, etc., comparing to same size Command-R. I'm not sure about imatrix quantization, but I read it extracts importance matrix using wikitext. So, I suspect it's due to the bias and will try normal Q3 variants just uploaded. Anybody else observed phenomenon like my case?

Can you show me your command line for that? When I use Q1_M I can use command-r on apple silicon M1 (64GB), but when I use a Q3_M I only get garbage and the logs (I use the server) show the following for each token:

ggml_metal_graph_compute: command buffer 3 failed with status 5

@yamikumo-DSD
Copy link

I've tried IQ3_M variant for several hours on my Apple silicon. Basically I felt it has high fluency in daily language ignoring the time before it spits out first token. However, what I often observed was lower capability of coding task like missing indentation, unclosed bracket, etc., comparing to same size Command-R. I'm not sure about imatrix quantization, but I read it extracts importance matrix using wikitext. So, I suspect it's due to the bias and will try normal Q3 variants just uploaded. Anybody else observed phenomenon like my case?

Can you show me your command line for that? When I use Q1_M I can use command-r on apple silicon M1 (64GB), but when I use a Q3_M I only get garbage and the logs (I use the server) show the following for each token:

ggml_metal_graph_compute: command buffer 3 failed with status 5

I'm currently using llama-cpp-python backed by the latest Llama.cpp, so I'm not using CLI now.
But I guess I had same errors when I missed 'sysctl iogpu.wired_limit_mb=foobar' on my Mac with older version of repository. Not sure i can reproduce same problem currently, so I'm gonna test it when I have time.

@4cecoder
Copy link

4cecoder commented Apr 12, 2024

These weights are split by gguf-split, so you cannot use cat to merge them. There's no need to merge them manually. Simply pass the first split and llama.cpp will automatically load all splits. If, for any reason, you want to merge splits, you must use the gguf-split with --merge option:

./gguf-split --merge ggml-c4ai-command-r-plus-104b-iq4_xs-00001-of-00002.gguf ggml-c4ai-command-r-plus-104b-iq4_xs.gguf

That is correct, I tested all quants successfully using just the first chunk. And I also have this information on the model page in the bullet list.

Please help building gguf-split on windows

#6404 (reply in thread)

@Sintayew4
Copy link

#6491

tybalex pushed a commit to rubra-ai/tools.cpp that referenced this pull request Apr 17, 2024
* Add Command R Plus GGUF

* Add Command R Plus GGUF

* Loading works up to LayerNorm2D

* Export new tensors in 1D so they are not quantized.

* Fix embedding layer based on Noeda's example

* Whitespace

* Add line

* Fix unexpected tokens on MPS. Re-add F16 fix. ((Noeda)

* dranger003: Fix block index overflow in CUDA dequantizing.

* Reverted blocked multiplication code as it still has issues and could affect other Llama arches

* export norms as f32

* fix overflow issues during quant and other cleanup

* Type convention

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* dranger003: Fix more int overflow during quant.

---------

Co-authored-by: S <seast@Ss-Mac-Studio.local>
Co-authored-by: S <s@example.com>
Co-authored-by: slaren <slarengh@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.