llama: extend for small granite models #7481

giuseppe · 2024-05-22T23:49:21Z

it works only for the small models 3b and 8b. The bigger models work fine with the existing GPTBigCodeForCausalLM architecture.

For the small models there are enough differences with the base llama arch that it is worth to define a new architecture.

To create the .gguf files, it is necessary to specify GraniteSmallForCausalLM in the architectures for the hf model.

Closes: #7116

Signed-off-by: Giuseppe Scrivano gscrivan@redhat.com

github-actions · 2024-05-23T00:24:27Z

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 555 iterations 🚀

Expand details for performance related PR only

Concurrent users: 8, duration: 10m
HTTP request : avg=8415.38ms p(95)=21422.74ms fails=, finish reason: stop=508 truncated=47
Prompt processing (pp): avg=102.34tk/s p(95)=502.31tk/s
Token generation (tg): avg=34.53tk/s p(95)=48.6tk/s
ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=fix-granite-3b commit=b974e9fcfbdafa22888bf535bd6c986a43e9e387

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 555 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1716913959 --> 1716914587
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 560.25, 560.25, 560.25, 560.25, 560.25, 704.82, 704.82, 704.82, 704.82, 704.82, 717.6, 717.6, 717.6, 717.6, 717.6, 758.11, 758.11, 758.11, 758.11, 758.11, 796.29, 796.29, 796.29, 796.29, 796.29, 794.54, 794.54, 794.54, 794.54, 794.54, 815.46, 815.46, 815.46, 815.46, 815.46, 820.32, 820.32, 820.32, 820.32, 820.32, 836.61, 836.61, 836.61, 836.61, 836.61, 857.29, 857.29, 857.29, 857.29, 857.29, 885.14, 885.14, 885.14, 885.14, 885.14, 904.29, 904.29, 904.29, 904.29, 904.29, 928.67, 928.67, 928.67, 928.67, 928.67, 901.78, 901.78, 901.78, 901.78, 901.78, 896.51, 896.51, 896.51, 896.51, 896.51, 896.37, 896.37, 896.37, 896.37, 896.37, 891.96, 891.96, 891.96, 891.96, 891.96, 909.38, 909.38, 909.38, 909.38, 909.38, 904.11, 904.11, 904.11, 904.11, 904.11, 904.52, 904.52, 904.52, 904.52, 904.52, 908.9, 908.9, 908.9, 908.9, 908.9, 909.7, 909.7, 909.7, 909.7, 909.7, 918.37, 918.37, 918.37, 918.37, 918.37, 918.33, 918.33, 918.33, 918.33, 918.33, 919.25, 919.25, 919.25, 919.25, 919.25, 934.81, 934.81, 934.81, 934.81, 934.81, 930.27, 930.27, 930.27, 930.27, 930.27, 926.04, 926.04, 926.04, 926.04, 926.04, 927.24, 927.24, 927.24, 927.24, 927.24, 927.93, 927.93, 927.93, 927.93, 927.93, 924.97, 924.97, 924.97, 924.97, 924.97, 928.58, 928.58, 928.58, 928.58, 928.58, 937.24, 937.24, 937.24, 937.24, 937.24, 937.23, 937.23, 937.23, 937.23, 937.23, 918.24, 918.24, 918.24, 918.24, 918.24, 913.94, 913.94, 913.94, 913.94, 913.94, 912.36, 912.36, 912.36, 912.36, 912.36, 913.59, 913.59, 913.59, 913.59, 913.59, 913.67, 913.67, 913.67, 913.67, 913.67, 908.83, 908.83, 908.83, 908.83, 908.83, 904.48, 904.48, 904.48, 904.48, 904.48, 889.07, 889.07, 889.07, 889.07, 889.07, 887.06, 887.06, 887.06, 887.06, 887.06, 883.42, 883.42, 883.42, 883.42, 883.42, 887.78, 887.78, 887.78, 887.78, 887.78, 888.59, 888.59, 888.59, 888.59, 888.59, 887.43, 887.43, 887.43, 887.43, 887.43, 891.3, 891.3, 891.3, 891.3, 891.3, 890.57, 890.57, 890.57, 890.57, 890.57, 892.82, 892.82, 892.82, 892.82, 892.82, 887.64, 887.64, 887.64, 887.64, 887.64, 887.39, 887.39, 887.39, 887.39, 887.39, 894.45, 894.45, 894.45, 894.45, 894.45, 893.25, 893.25, 893.25, 893.25, 893.25, 894.0, 894.0, 894.0, 894.0, 894.0, 892.93, 892.93, 892.93, 892.93, 892.93, 893.03, 893.03, 893.03, 893.03, 893.03, 894.11, 894.11, 894.11, 894.11, 894.11, 894.03, 894.03, 894.03, 894.03, 894.03, 896.0, 896.0, 896.0, 896.0, 896.0]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 555 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1716913959 --> 1716914587
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 41.59, 41.59, 41.59, 41.59, 41.59, 37.34, 37.34, 37.34, 37.34, 37.34, 29.91, 29.91, 29.91, 29.91, 29.91, 32.31, 32.31, 32.31, 32.31, 32.31, 32.64, 32.64, 32.64, 32.64, 32.64, 33.8, 33.8, 33.8, 33.8, 33.8, 34.77, 34.77, 34.77, 34.77, 34.77, 35.13, 35.13, 35.13, 35.13, 35.13, 35.17, 35.17, 35.17, 35.17, 35.17, 35.25, 35.25, 35.25, 35.25, 35.25, 35.2, 35.2, 35.2, 35.2, 35.2, 34.2, 34.2, 34.2, 34.2, 34.2, 33.85, 33.85, 33.85, 33.85, 33.85, 32.36, 32.36, 32.36, 32.36, 32.36, 31.3, 31.3, 31.3, 31.3, 31.3, 31.12, 31.12, 31.12, 31.12, 31.12, 31.1, 31.1, 31.1, 31.1, 31.1, 31.2, 31.2, 31.2, 31.2, 31.2, 30.51, 30.51, 30.51, 30.51, 30.51, 30.1, 30.1, 30.1, 30.1, 30.1, 30.13, 30.13, 30.13, 30.13, 30.13, 30.22, 30.22, 30.22, 30.22, 30.22, 30.41, 30.41, 30.41, 30.41, 30.41, 30.28, 30.28, 30.28, 30.28, 30.28, 30.48, 30.48, 30.48, 30.48, 30.48, 30.78, 30.78, 30.78, 30.78, 30.78, 30.43, 30.43, 30.43, 30.43, 30.43, 30.66, 30.66, 30.66, 30.66, 30.66, 31.02, 31.02, 31.02, 31.02, 31.02, 31.11, 31.11, 31.11, 31.11, 31.11, 31.3, 31.3, 31.3, 31.3, 31.3, 31.42, 31.42, 31.42, 31.42, 31.42, 31.37, 31.37, 31.37, 31.37, 31.37, 31.07, 31.07, 31.07, 31.07, 31.07, 30.91, 30.91, 30.91, 30.91, 30.91, 30.73, 30.73, 30.73, 30.73, 30.73, 30.88, 30.88, 30.88, 30.88, 30.88, 31.04, 31.04, 31.04, 31.04, 31.04, 31.27, 31.27, 31.27, 31.27, 31.27, 31.08, 31.08, 31.08, 31.08, 31.08, 30.98, 30.98, 30.98, 30.98, 30.98, 30.4, 30.4, 30.4, 30.4, 30.4, 30.16, 30.16, 30.16, 30.16, 30.16, 29.2, 29.2, 29.2, 29.2, 29.2, 29.18, 29.18, 29.18, 29.18, 29.18, 29.1, 29.1, 29.1, 29.1, 29.1, 29.09, 29.09, 29.09, 29.09, 29.09, 29.13, 29.13, 29.13, 29.13, 29.13, 29.12, 29.12, 29.12, 29.12, 29.12, 29.21, 29.21, 29.21, 29.21, 29.21, 29.24, 29.24, 29.24, 29.24, 29.24, 29.09, 29.09, 29.09, 29.09, 29.09, 29.02, 29.02, 29.02, 29.02, 29.02, 29.0, 29.0, 29.0, 29.0, 29.0, 29.14, 29.14, 29.14, 29.14, 29.14, 29.24, 29.24, 29.24, 29.24, 29.24, 29.33, 29.33, 29.33, 29.33, 29.33, 29.41, 29.41, 29.41, 29.41, 29.41, 29.46, 29.46, 29.46, 29.46, 29.46, 29.48, 29.48, 29.48, 29.48, 29.48]

Details

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 555 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1716913959 --> 1716914587
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.14, 0.14, 0.14, 0.14, 0.14, 0.44, 0.44, 0.44, 0.44, 0.44, 0.2, 0.2, 0.2, 0.2, 0.2, 0.12, 0.12, 0.12, 0.12, 0.12, 0.19, 0.19, 0.19, 0.19, 0.19, 0.08, 0.08, 0.08, 0.08, 0.08, 0.12, 0.12, 0.12, 0.12, 0.12, 0.16, 0.16, 0.16, 0.16, 0.16, 0.18, 0.18, 0.18, 0.18, 0.18, 0.32, 0.32, 0.32, 0.32, 0.32, 0.25, 0.25, 0.25, 0.25, 0.25, 0.2, 0.2, 0.2, 0.2, 0.2, 0.38, 0.38, 0.38, 0.38, 0.38, 0.3, 0.3, 0.3, 0.3, 0.3, 0.18, 0.18, 0.18, 0.18, 0.18, 0.2, 0.2, 0.2, 0.2, 0.2, 0.12, 0.12, 0.12, 0.12, 0.12, 0.35, 0.35, 0.35, 0.35, 0.35, 0.26, 0.26, 0.26, 0.26, 0.26, 0.15, 0.15, 0.15, 0.15, 0.15, 0.18, 0.18, 0.18, 0.18, 0.18, 0.16, 0.16, 0.16, 0.16, 0.16, 0.26, 0.26, 0.26, 0.26, 0.26, 0.11, 0.11, 0.11, 0.11, 0.11, 0.13, 0.13, 0.13, 0.13, 0.13, 0.29, 0.29, 0.29, 0.29, 0.29, 0.08, 0.08, 0.08, 0.08, 0.08, 0.14, 0.14, 0.14, 0.14, 0.14, 0.12, 0.12, 0.12, 0.12, 0.12, 0.15, 0.15, 0.15, 0.15, 0.15, 0.14, 0.14, 0.14, 0.14, 0.14, 0.16, 0.16, 0.16, 0.16, 0.16, 0.34, 0.34, 0.34, 0.34, 0.34, 0.14, 0.14, 0.14, 0.14, 0.14, 0.3, 0.3, 0.3, 0.3, 0.3, 0.17, 0.17, 0.17, 0.17, 0.17, 0.07, 0.07, 0.07, 0.07, 0.07, 0.12, 0.12, 0.12, 0.12, 0.12, 0.11, 0.11, 0.11, 0.11, 0.11, 0.45, 0.45, 0.45, 0.45, 0.45, 0.56, 0.56, 0.56, 0.56, 0.56, 0.5, 0.5, 0.5, 0.5, 0.5, 0.46, 0.46, 0.46, 0.46, 0.46, 0.11, 0.11, 0.11, 0.11, 0.11, 0.26, 0.26, 0.26, 0.26, 0.26, 0.21, 0.21, 0.21, 0.21, 0.21, 0.22, 0.22, 0.22, 0.22, 0.22, 0.19, 0.19, 0.19, 0.19, 0.19, 0.18, 0.18, 0.18, 0.18, 0.18, 0.22, 0.22, 0.22, 0.22, 0.22, 0.27, 0.27, 0.27, 0.27, 0.27, 0.15, 0.15, 0.15, 0.15, 0.15, 0.31, 0.31, 0.31, 0.31, 0.31, 0.12, 0.12, 0.12, 0.12, 0.12, 0.14, 0.14, 0.14, 0.14, 0.14, 0.17, 0.17, 0.17, 0.17, 0.17, 0.11, 0.11, 0.11, 0.11, 0.11, 0.18, 0.18, 0.18, 0.18, 0.18, 0.13, 0.13, 0.13, 0.13, 0.13, 0.21, 0.21, 0.21, 0.21, 0.21]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 555 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1716913959 --> 1716914587
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 2.0, 2.0, 2.0, 2.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 2.0, 2.0, 2.0, 2.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0]

llama.cpp

ggerganov · 2024-05-23T07:29:38Z

llama.cpp

+                if (model.arch == LLM_ARCH_LLAMA) {
+                    vocab.add_space_prefix = false;
+                }


Is this needed - looks wrong?

sorry, it should be LLM_ARCH_GRANITE_SMALL

convert-hf-to-gguf.py

giuseppe · 2024-05-24T06:51:43Z

@compilade thanks, addressed the issues and pushed a new version

ggerganov · 2024-05-24T09:46:01Z

Adding the --architecture argument should be avoided. Instead of adding new architecture GraniteSmallForCausalLM, try to update the existing class LlamaModel to handle this model. Since it is a BPE tokenizer, you might have to update the convert-hf-to-gguf-update.py script as described in #6920

Add optional MLP bias for ARCH_LLAMA to support Granite models. Partially addresses ggerganov/issues/7116 Still needs some more changes to properly support Granite.

giuseppe · 2024-05-24T23:12:47Z

Adding the --architecture argument should be avoided. Instead of adding new architecture GraniteSmallForCausalLM, try to update the existing class LlamaModel to handle this model. Since it is a BPE tokenizer, you might have to update the convert-hf-to-gguf-update.py script as described in #6920

I've simplified the implementation and it is using the existing Llama model. I've added a way to override the default rope type. Now the only Granite specific code in llama.cpp is to detect model.type

ggerganov · 2024-05-26T13:12:45Z

convert-hf-to-gguf.py

+        # Skip for granite models
+        if self.hparams.get("vocab_size", 32000) != 49152:
+            if name.endswith("q_proj.weight"):
+                data_torch = LlamaModel.permute(data_torch, n_head, n_head)
+            if name.endswith("k_proj.weight"):
+                data_torch = LlamaModel.permute(data_torch, n_head, n_kv_head)


I think we can avoid adding the rope type parameter all together, by permuting the Q, K attention tensors in the correct way here. I don't have an example code unfortunately, so we need to figure out how to do it. The only difference between RoPE NORM and NEOX is that in the former we rotate the pairs (x[2*i + 0], x[2*i + 1], while in the latter we rotate (x[i], x[i + n_rot/2]). So it's a matter of reordering the rows in each head in the correct way to make the RoPE type to be NORM - as all other LLaMA-based models

thanks for the suggestion. I had a look at it and I am not sure that is possible to do just by rearranging the Q,K weights without changing their values too.

If I understand it correctly, given:

const float * const src = (float *)((char *) src0->data + i3*nb03 + i2*nb02 + i1*nb01 + i0*nb00); float * dst_data = (float *)((char *) dst->data + i3*nb3 + i2*nb2 + i1*nb1 + i0*nb0);

we would like to shuffle the positions of x0 and x1 around, so that (RoPE NORM):

const float x0 = src[0]; const float x1 = src[1]; dst_data[0] = x0*cos_theta*zeta - x1*sin_theta; dst_data[1] = x0*sin_theta*zeta + x1*cos_theta;

can be used instead of (RoPE NEOX):

const float x0 = src[0]; const float x1 = src[n_dims/2]; dst_data[0] = x0*cos_theta - x1*sin_theta; dst_data[n_dims/2] = x0*sin_theta + x1*cos_theta;

So not only we want to re-arrange the elements in a way that RoPE NORM can find them (this would probably be easy) but also ensure that after the RoPE operation, the output is written down with the same layout RoPE NEOX expects it since the rest of the model expects that output.

Am I missing something?

Even though the output of Q = rope(q) and K = rope(k) would not be in the same order, it should still work because we compute KQ = K @ Q which is invariant to how the data in the heads is reordered - as long as it is reordered in the same way in both K and Q

I could be missing something though - not 100% confident in this. If you think it won't work, we can probably do the rope type thing, but I really prefer to find a way to avoid it

is it something that could be changed later?

I am not confident either that it is not possible, I've spent a few hours on it and I've not been successful so far

I gave this a try. Only the first n_dims elements of each rows should be re-ordered.

llama.cpp/ggml.c

Line 14418 in 95f84d5

if (ic < n_dims) {

@staticmethod def permute_neox_rope(weights: Tensor, rot_dim: int) -> Tensor: orig_shape = weights.shape assert orig_shape[-1] % rot_dim == 0 # reorder the first rot_dim elements of each row weights = weights.reshape((-1 , weights.shape[-1] // rot_dim, rot_dim)) weights[:, 0, :] = weights[:, 0, :].reshape((-1, 2, rot_dim // 2)).mT.contiguous().reshape((-1, rot_dim)) return weights.reshape((orig_shape))

It seems to partially work, but the output is still wrong, because in RoPE NEOX, it's only the first rot_dim elements per row that are roped, while in RoPE NORM, all of them are.

So it's not simply a re-ordering of elements that is necessary, unfortunately. The rope type is needed, it seems.

cool!

Could you put this diff into a patch I can cherry-pick and I update my PR?

@giuseppe Put this in a file (say, permute-bias.patch), then use git apply permute-bias.patch from the repo's top directory.

Patch content (click to expand)

diff --git a/convert-hf-to-gguf.py b/convert-hf-to-gguf.py index 99c1fdb4..63d50f8f 100755 --- a/convert-hf-to-gguf.py +++ b/convert-hf-to-gguf.py @@ -1325,8 +1325,6 @@ class LlamaModel(Model): # Apply to granite small models only if self.hparams.get("vocab_size", 32000) == 49152: self.gguf_writer.add_add_bos_token(False) - self.gguf_writer.add_rope_type(gguf.RopeType.NEOX) - self.gguf_writer.add_rope_scaling_type(gguf.RopeScalingType.NONE) @staticmethod def permute(weights: Tensor, n_head: int, n_head_kv: int | None): @@ -1342,12 +1340,10 @@ class LlamaModel(Model): n_head = self.hparams["num_attention_heads"] n_kv_head = self.hparams.get("num_key_value_heads") - # Skip for granite models - if self.hparams.get("vocab_size", 32000) != 49152: - if name.endswith("q_proj.weight"): - data_torch = LlamaModel.permute(data_torch, n_head, n_head) - if name.endswith("k_proj.weight"): - data_torch = LlamaModel.permute(data_torch, n_head, n_kv_head) + if name.endswith(("q_proj.weight", "q_proj.bias")): + data_torch = LlamaModel.permute(data_torch, n_head, n_head) + if name.endswith(("k_proj.weight", "k_proj.bias")): + data_torch = LlamaModel.permute(data_torch, n_head, n_kv_head) # process the experts separately if name.find("block_sparse_moe.experts") != -1: diff --git a/gguf-py/gguf/constants.py b/gguf-py/gguf/constants.py index d5c3d7b5..c9ae259e 100644 --- a/gguf-py/gguf/constants.py +++ b/gguf-py/gguf/constants.py @@ -57,7 +57,6 @@ class Keys: CAUSAL = "{arch}.attention.causal" class Rope: - TYPE = "{arch}.rope.type" DIMENSION_COUNT = "{arch}.rope.dimension_count" FREQ_BASE = "{arch}.rope.freq_base" SCALING_TYPE = "{arch}.rope.scaling.type" @@ -807,13 +806,6 @@ class TokenType(IntEnum): BYTE = 6 -class RopeType(Enum): - NONE = 'none' - NORM = 'norm' - NEOX = 'neox' - GLM = 'glm' - - class RopeScalingType(Enum): NONE = 'none' LINEAR = 'linear' @@ -1006,7 +998,6 @@ KEY_ATTENTION_LAYERNORM_EPS = Keys.Attention.LAYERNORM_EPS KEY_ATTENTION_LAYERNORM_RMS_EPS = Keys.Attention.LAYERNORM_RMS_EPS # RoPE -KEY_ROPE_TYPE = Keys.Rope.TYPE KEY_ROPE_DIMENSION_COUNT = Keys.Rope.DIMENSION_COUNT KEY_ROPE_FREQ_BASE = Keys.Rope.FREQ_BASE KEY_ROPE_SCALING_TYPE = Keys.Rope.SCALING_TYPE diff --git a/gguf-py/gguf/gguf_writer.py b/gguf-py/gguf/gguf_writer.py index ebfd15fd..8b41b54e 100644 --- a/gguf-py/gguf/gguf_writer.py +++ b/gguf-py/gguf/gguf_writer.py @@ -427,9 +427,6 @@ class GGUFWriter: def add_rope_freq_base(self, value: float) -> None: self.add_float32(Keys.Rope.FREQ_BASE.format(arch=self.arch), value) - def add_rope_type(self, value: RopeType) -> None: - self.add_string(Keys.Rope.TYPE.format(arch=self.arch), value.value) - def add_rope_scaling_type(self, value: RopeScalingType) -> None: self.add_string(Keys.Rope.SCALING_TYPE.format(arch=self.arch), value.value) diff --git a/llama.cpp b/llama.cpp index 16c11d43..f970c175 100644 --- a/llama.cpp +++ b/llama.cpp @@ -297,7 +297,6 @@ enum llm_kv { LLM_KV_ATTENTION_LAYERNORM_RMS_EPS, LLM_KV_ATTENTION_CAUSAL, - LLM_KV_ROPE_TYPE, LLM_KV_ROPE_DIMENSION_COUNT, LLM_KV_ROPE_FREQ_BASE, LLM_KV_ROPE_SCALE_LINEAR, @@ -376,7 +375,6 @@ static const std::map<llm_kv, const char *> LLM_KV_NAMES = { { LLM_KV_ATTENTION_LAYERNORM_RMS_EPS, "%s.attention.layer_norm_rms_epsilon" }, { LLM_KV_ATTENTION_CAUSAL, "%s.attention.causal" }, - { LLM_KV_ROPE_TYPE, "%s.rope.type" }, { LLM_KV_ROPE_DIMENSION_COUNT, "%s.rope.dimension_count" }, { LLM_KV_ROPE_FREQ_BASE, "%s.rope.freq_base" }, { LLM_KV_ROPE_SCALE_LINEAR, "%s.rope.scale_linear" }, @@ -1131,29 +1129,12 @@ struct LLM_TN { // gguf helpers // -static const std::map<enum llama_rope_type, const char *> LLAMA_ROPE_TYPES = { - { LLAMA_ROPE_TYPE_NONE, "none" }, - { LLAMA_ROPE_TYPE_NORM, "norm" }, - { LLAMA_ROPE_TYPE_NEOX, "neox" }, - { LLAMA_ROPE_TYPE_GLM, "glm" }, -}; - static const std::map<llama_rope_scaling_type, const char *> LLAMA_ROPE_SCALING_TYPES = { { LLAMA_ROPE_SCALING_TYPE_NONE, "none" }, { LLAMA_ROPE_SCALING_TYPE_LINEAR, "linear" }, { LLAMA_ROPE_SCALING_TYPE_YARN, "yarn" }, }; -static enum llama_rope_type llama_rope_type_from_string(const std::string & name) { - for (const auto & kv : LLAMA_ROPE_TYPES) { - if (kv.second == name) { - return (enum llama_rope_type) kv.first; - } - } - - return LLAMA_ROPE_TYPE_NONE; -} - static llama_rope_scaling_type llama_rope_scaling_type_from_string(const std::string & name) { for (const auto & kv : LLAMA_ROPE_SCALING_TYPES) { if (kv.second == name) { @@ -4417,15 +4398,7 @@ static void llm_load_hparams( hparams.use_alibi = true; } - hparams.rope_type = llama_default_rope_type(&model); - - const auto kv = LLM_KV(model.arch); - const int rope_type_keyidx = gguf_find_key(ctx, kv(LLM_KV_ROPE_TYPE).c_str()); - if (rope_type_keyidx != -1) { - std::string rope_type("none"); - ml.get_key(LLM_KV_ROPE_TYPE, rope_type); - hparams.rope_type = llama_rope_type_from_string(rope_type); - } + hparams.rope_type = llama_rope_type(&model); } // TODO: This should probably be in llama.h @@ -16252,7 +16225,7 @@ enum llama_vocab_type llama_vocab_type(const struct llama_model * model) { return model->vocab.type; } -enum llama_rope_type llama_default_rope_type(const struct llama_model * model) { +enum llama_rope_type llama_rope_type(const struct llama_model * model) { switch (model->arch) { // these models do not use RoPE case LLM_ARCH_GPT2: diff --git a/llama.h b/llama.h index 632136ca..16cece5d 100644 --- a/llama.h +++ b/llama.h @@ -422,7 +422,7 @@ extern "C" { LLAMA_API enum llama_pooling_type llama_pooling_type(const struct llama_context * ctx); LLAMA_API enum llama_vocab_type llama_vocab_type (const struct llama_model * model); - LLAMA_API enum llama_rope_type llama_default_rope_type (const struct llama_model * model); + LLAMA_API enum llama_rope_type llama_rope_type (const struct llama_model * model); LLAMA_API int32_t llama_n_vocab (const struct llama_model * model); LLAMA_API int32_t llama_n_ctx_train(const struct llama_model * model);

If by "patch" you meant a commit, then... I think I can directly push it here if "Maintainers are allowed to edit this pull request." works as I think it does? (I never tried pushing on someone else's fork, though)

ah, if you are fine that I apply it directly on top of my patch then I can do.

I was thinking about you owning the ownership for the commit, since you came up with the code

Nice work, thanks for looking into this!

propagate the add_space_prefix configuration from the HF model configuration to the gguf file and honor it with the gpt2 tokenizer. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

it works only for the small models 3b and 8b. The convert-hf-to-gguf.py script uses the vocabulary size of the granite models to detect granite and set the correct configuration. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

github-actions bot added the python python script changes label May 22, 2024

giuseppe mentioned this pull request May 22, 2024

Add Support for IBM Granite #7116

Closed

giuseppe force-pushed the fix-granite-3b branch from ab54d2b to 6826d67 Compare May 22, 2024 23:54

mofosyne added Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level model Model specific labels May 23, 2024

ggerganov reviewed May 23, 2024

View reviewed changes

llama.cpp Show resolved Hide resolved

ggerganov reviewed May 23, 2024

View reviewed changes

giuseppe force-pushed the fix-granite-3b branch from 6826d67 to ff8d34b Compare May 23, 2024 09:37

giuseppe changed the title ~~llama: define architecture for granite models~~ llama: define architecture for small granite models May 23, 2024

giuseppe force-pushed the fix-granite-3b branch 2 times, most recently from 1fb9186 to cd8d590 Compare May 23, 2024 13:06

compilade reviewed May 24, 2024

View reviewed changes

convert-hf-to-gguf.py Outdated Show resolved Hide resolved

convert-hf-to-gguf.py Outdated Show resolved Hide resolved

convert-hf-to-gguf.py Outdated Show resolved Hide resolved

giuseppe force-pushed the fix-granite-3b branch from f2b08f6 to 1fc902f Compare May 24, 2024 06:51

Add optional MLP bias for Granite models

120f7bf

Add optional MLP bias for ARCH_LLAMA to support Granite models. Partially addresses ggerganov/issues/7116 Still needs some more changes to properly support Granite.

giuseppe force-pushed the fix-granite-3b branch from 1fc902f to 431fde0 Compare May 24, 2024 23:10

giuseppe force-pushed the fix-granite-3b branch from 431fde0 to 5b2ef0d Compare May 25, 2024 20:44

ggerganov reviewed May 26, 2024

View reviewed changes

giuseppe changed the title ~~llama: define architecture for small granite models~~ llama: extend for small granite models May 27, 2024

giuseppe added 2 commits May 28, 2024 17:52

llama: honor add_space_prefix from the model configuration

06748ff

propagate the add_space_prefix configuration from the HF model configuration to the gguf file and honor it with the gpt2 tokenizer. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

llama: add support for small granite models

b974e9f

it works only for the small models 3b and 8b. The convert-hf-to-gguf.py script uses the vocabulary size of the granite models to detect granite and set the correct configuration. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

giuseppe force-pushed the fix-granite-3b branch from 5b2ef0d to b974e9f Compare May 28, 2024 15:55

ggerganov approved these changes May 28, 2024

View reviewed changes

ggerganov merged commit 5442939 into ggerganov:master May 28, 2024
67 of 71 checks passed

coder543 mentioned this pull request May 28, 2024

IBM-Granite ollama/ollama#4209

Closed

dlippold mentioned this pull request Jul 14, 2024

[Feature] Upgrade llama.cpp to support Phi-3-mini-128k-instruct and IBM Granite nomic-ai/gpt4all#2668

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama: extend for small granite models #7481

llama: extend for small granite models #7481

giuseppe commented May 22, 2024 •

edited

Loading

github-actions bot commented May 23, 2024 •

edited

Loading

ggerganov May 23, 2024

giuseppe May 23, 2024

giuseppe commented May 24, 2024

ggerganov commented May 24, 2024

giuseppe commented May 24, 2024

ggerganov May 26, 2024

giuseppe May 27, 2024

ggerganov May 27, 2024

giuseppe May 27, 2024

compilade May 27, 2024

giuseppe May 28, 2024

compilade May 28, 2024

giuseppe May 28, 2024

giuseppe May 28, 2024

ggerganov May 28, 2024

llama: extend for small granite models #7481

llama: extend for small granite models #7481

Conversation

giuseppe commented May 22, 2024 • edited Loading

github-actions bot commented May 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

giuseppe commented May 24, 2024

ggerganov commented May 24, 2024

giuseppe commented May 24, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

giuseppe commented May 22, 2024 •

edited

Loading

github-actions bot commented May 23, 2024 •

edited

Loading