Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama: extend for small granite models #7481

Merged
merged 3 commits into from
May 28, 2024

Conversation

giuseppe
Copy link
Contributor

@giuseppe giuseppe commented May 22, 2024

it works only for the small models 3b and 8b. The bigger models work fine with the existing GPTBigCodeForCausalLM architecture.

For the small models there are enough differences with the base llama arch that it is worth to define a new architecture.

To create the .gguf files, it is necessary to specify GraniteSmallForCausalLM in the architectures for the hf model.

Closes: #7116

Signed-off-by: Giuseppe Scrivano gscrivan@redhat.com

@github-actions github-actions bot added the python python script changes label May 22, 2024
Copy link
Contributor

github-actions bot commented May 23, 2024

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 555 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=8415.38ms p(95)=21422.74ms fails=, finish reason: stop=508 truncated=47
  • Prompt processing (pp): avg=102.34tk/s p(95)=502.31tk/s
  • Token generation (tg): avg=34.53tk/s p(95)=48.6tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=fix-granite-3b commit=b974e9fcfbdafa22888bf535bd6c986a43e9e387

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 555 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1716913959 --> 1716914587
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 560.25, 560.25, 560.25, 560.25, 560.25, 704.82, 704.82, 704.82, 704.82, 704.82, 717.6, 717.6, 717.6, 717.6, 717.6, 758.11, 758.11, 758.11, 758.11, 758.11, 796.29, 796.29, 796.29, 796.29, 796.29, 794.54, 794.54, 794.54, 794.54, 794.54, 815.46, 815.46, 815.46, 815.46, 815.46, 820.32, 820.32, 820.32, 820.32, 820.32, 836.61, 836.61, 836.61, 836.61, 836.61, 857.29, 857.29, 857.29, 857.29, 857.29, 885.14, 885.14, 885.14, 885.14, 885.14, 904.29, 904.29, 904.29, 904.29, 904.29, 928.67, 928.67, 928.67, 928.67, 928.67, 901.78, 901.78, 901.78, 901.78, 901.78, 896.51, 896.51, 896.51, 896.51, 896.51, 896.37, 896.37, 896.37, 896.37, 896.37, 891.96, 891.96, 891.96, 891.96, 891.96, 909.38, 909.38, 909.38, 909.38, 909.38, 904.11, 904.11, 904.11, 904.11, 904.11, 904.52, 904.52, 904.52, 904.52, 904.52, 908.9, 908.9, 908.9, 908.9, 908.9, 909.7, 909.7, 909.7, 909.7, 909.7, 918.37, 918.37, 918.37, 918.37, 918.37, 918.33, 918.33, 918.33, 918.33, 918.33, 919.25, 919.25, 919.25, 919.25, 919.25, 934.81, 934.81, 934.81, 934.81, 934.81, 930.27, 930.27, 930.27, 930.27, 930.27, 926.04, 926.04, 926.04, 926.04, 926.04, 927.24, 927.24, 927.24, 927.24, 927.24, 927.93, 927.93, 927.93, 927.93, 927.93, 924.97, 924.97, 924.97, 924.97, 924.97, 928.58, 928.58, 928.58, 928.58, 928.58, 937.24, 937.24, 937.24, 937.24, 937.24, 937.23, 937.23, 937.23, 937.23, 937.23, 918.24, 918.24, 918.24, 918.24, 918.24, 913.94, 913.94, 913.94, 913.94, 913.94, 912.36, 912.36, 912.36, 912.36, 912.36, 913.59, 913.59, 913.59, 913.59, 913.59, 913.67, 913.67, 913.67, 913.67, 913.67, 908.83, 908.83, 908.83, 908.83, 908.83, 904.48, 904.48, 904.48, 904.48, 904.48, 889.07, 889.07, 889.07, 889.07, 889.07, 887.06, 887.06, 887.06, 887.06, 887.06, 883.42, 883.42, 883.42, 883.42, 883.42, 887.78, 887.78, 887.78, 887.78, 887.78, 888.59, 888.59, 888.59, 888.59, 888.59, 887.43, 887.43, 887.43, 887.43, 887.43, 891.3, 891.3, 891.3, 891.3, 891.3, 890.57, 890.57, 890.57, 890.57, 890.57, 892.82, 892.82, 892.82, 892.82, 892.82, 887.64, 887.64, 887.64, 887.64, 887.64, 887.39, 887.39, 887.39, 887.39, 887.39, 894.45, 894.45, 894.45, 894.45, 894.45, 893.25, 893.25, 893.25, 893.25, 893.25, 894.0, 894.0, 894.0, 894.0, 894.0, 892.93, 892.93, 892.93, 892.93, 892.93, 893.03, 893.03, 893.03, 893.03, 893.03, 894.11, 894.11, 894.11, 894.11, 894.11, 894.03, 894.03, 894.03, 894.03, 894.03, 896.0, 896.0, 896.0, 896.0, 896.0]
                    
Loading
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 555 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1716913959 --> 1716914587
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 41.59, 41.59, 41.59, 41.59, 41.59, 37.34, 37.34, 37.34, 37.34, 37.34, 29.91, 29.91, 29.91, 29.91, 29.91, 32.31, 32.31, 32.31, 32.31, 32.31, 32.64, 32.64, 32.64, 32.64, 32.64, 33.8, 33.8, 33.8, 33.8, 33.8, 34.77, 34.77, 34.77, 34.77, 34.77, 35.13, 35.13, 35.13, 35.13, 35.13, 35.17, 35.17, 35.17, 35.17, 35.17, 35.25, 35.25, 35.25, 35.25, 35.25, 35.2, 35.2, 35.2, 35.2, 35.2, 34.2, 34.2, 34.2, 34.2, 34.2, 33.85, 33.85, 33.85, 33.85, 33.85, 32.36, 32.36, 32.36, 32.36, 32.36, 31.3, 31.3, 31.3, 31.3, 31.3, 31.12, 31.12, 31.12, 31.12, 31.12, 31.1, 31.1, 31.1, 31.1, 31.1, 31.2, 31.2, 31.2, 31.2, 31.2, 30.51, 30.51, 30.51, 30.51, 30.51, 30.1, 30.1, 30.1, 30.1, 30.1, 30.13, 30.13, 30.13, 30.13, 30.13, 30.22, 30.22, 30.22, 30.22, 30.22, 30.41, 30.41, 30.41, 30.41, 30.41, 30.28, 30.28, 30.28, 30.28, 30.28, 30.48, 30.48, 30.48, 30.48, 30.48, 30.78, 30.78, 30.78, 30.78, 30.78, 30.43, 30.43, 30.43, 30.43, 30.43, 30.66, 30.66, 30.66, 30.66, 30.66, 31.02, 31.02, 31.02, 31.02, 31.02, 31.11, 31.11, 31.11, 31.11, 31.11, 31.3, 31.3, 31.3, 31.3, 31.3, 31.42, 31.42, 31.42, 31.42, 31.42, 31.37, 31.37, 31.37, 31.37, 31.37, 31.07, 31.07, 31.07, 31.07, 31.07, 30.91, 30.91, 30.91, 30.91, 30.91, 30.73, 30.73, 30.73, 30.73, 30.73, 30.88, 30.88, 30.88, 30.88, 30.88, 31.04, 31.04, 31.04, 31.04, 31.04, 31.27, 31.27, 31.27, 31.27, 31.27, 31.08, 31.08, 31.08, 31.08, 31.08, 30.98, 30.98, 30.98, 30.98, 30.98, 30.4, 30.4, 30.4, 30.4, 30.4, 30.16, 30.16, 30.16, 30.16, 30.16, 29.2, 29.2, 29.2, 29.2, 29.2, 29.18, 29.18, 29.18, 29.18, 29.18, 29.1, 29.1, 29.1, 29.1, 29.1, 29.09, 29.09, 29.09, 29.09, 29.09, 29.13, 29.13, 29.13, 29.13, 29.13, 29.12, 29.12, 29.12, 29.12, 29.12, 29.21, 29.21, 29.21, 29.21, 29.21, 29.24, 29.24, 29.24, 29.24, 29.24, 29.09, 29.09, 29.09, 29.09, 29.09, 29.02, 29.02, 29.02, 29.02, 29.02, 29.0, 29.0, 29.0, 29.0, 29.0, 29.14, 29.14, 29.14, 29.14, 29.14, 29.24, 29.24, 29.24, 29.24, 29.24, 29.33, 29.33, 29.33, 29.33, 29.33, 29.41, 29.41, 29.41, 29.41, 29.41, 29.46, 29.46, 29.46, 29.46, 29.46, 29.48, 29.48, 29.48, 29.48, 29.48]
                    
Loading

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 555 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1716913959 --> 1716914587
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.14, 0.14, 0.14, 0.14, 0.14, 0.44, 0.44, 0.44, 0.44, 0.44, 0.2, 0.2, 0.2, 0.2, 0.2, 0.12, 0.12, 0.12, 0.12, 0.12, 0.19, 0.19, 0.19, 0.19, 0.19, 0.08, 0.08, 0.08, 0.08, 0.08, 0.12, 0.12, 0.12, 0.12, 0.12, 0.16, 0.16, 0.16, 0.16, 0.16, 0.18, 0.18, 0.18, 0.18, 0.18, 0.32, 0.32, 0.32, 0.32, 0.32, 0.25, 0.25, 0.25, 0.25, 0.25, 0.2, 0.2, 0.2, 0.2, 0.2, 0.38, 0.38, 0.38, 0.38, 0.38, 0.3, 0.3, 0.3, 0.3, 0.3, 0.18, 0.18, 0.18, 0.18, 0.18, 0.2, 0.2, 0.2, 0.2, 0.2, 0.12, 0.12, 0.12, 0.12, 0.12, 0.35, 0.35, 0.35, 0.35, 0.35, 0.26, 0.26, 0.26, 0.26, 0.26, 0.15, 0.15, 0.15, 0.15, 0.15, 0.18, 0.18, 0.18, 0.18, 0.18, 0.16, 0.16, 0.16, 0.16, 0.16, 0.26, 0.26, 0.26, 0.26, 0.26, 0.11, 0.11, 0.11, 0.11, 0.11, 0.13, 0.13, 0.13, 0.13, 0.13, 0.29, 0.29, 0.29, 0.29, 0.29, 0.08, 0.08, 0.08, 0.08, 0.08, 0.14, 0.14, 0.14, 0.14, 0.14, 0.12, 0.12, 0.12, 0.12, 0.12, 0.15, 0.15, 0.15, 0.15, 0.15, 0.14, 0.14, 0.14, 0.14, 0.14, 0.16, 0.16, 0.16, 0.16, 0.16, 0.34, 0.34, 0.34, 0.34, 0.34, 0.14, 0.14, 0.14, 0.14, 0.14, 0.3, 0.3, 0.3, 0.3, 0.3, 0.17, 0.17, 0.17, 0.17, 0.17, 0.07, 0.07, 0.07, 0.07, 0.07, 0.12, 0.12, 0.12, 0.12, 0.12, 0.11, 0.11, 0.11, 0.11, 0.11, 0.45, 0.45, 0.45, 0.45, 0.45, 0.56, 0.56, 0.56, 0.56, 0.56, 0.5, 0.5, 0.5, 0.5, 0.5, 0.46, 0.46, 0.46, 0.46, 0.46, 0.11, 0.11, 0.11, 0.11, 0.11, 0.26, 0.26, 0.26, 0.26, 0.26, 0.21, 0.21, 0.21, 0.21, 0.21, 0.22, 0.22, 0.22, 0.22, 0.22, 0.19, 0.19, 0.19, 0.19, 0.19, 0.18, 0.18, 0.18, 0.18, 0.18, 0.22, 0.22, 0.22, 0.22, 0.22, 0.27, 0.27, 0.27, 0.27, 0.27, 0.15, 0.15, 0.15, 0.15, 0.15, 0.31, 0.31, 0.31, 0.31, 0.31, 0.12, 0.12, 0.12, 0.12, 0.12, 0.14, 0.14, 0.14, 0.14, 0.14, 0.17, 0.17, 0.17, 0.17, 0.17, 0.11, 0.11, 0.11, 0.11, 0.11, 0.18, 0.18, 0.18, 0.18, 0.18, 0.13, 0.13, 0.13, 0.13, 0.13, 0.21, 0.21, 0.21, 0.21, 0.21]
                    
Loading
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 555 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1716913959 --> 1716914587
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 2.0, 2.0, 2.0, 2.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 2.0, 2.0, 2.0, 2.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0]
                    
Loading

@mofosyne mofosyne added Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level model Model specific labels May 23, 2024
llama.cpp Show resolved Hide resolved
llama.cpp Outdated
Comment on lines 4440 to 4505
if (model.arch == LLM_ARCH_LLAMA) {
vocab.add_space_prefix = false;
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed - looks wrong?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, it should be LLM_ARCH_GRANITE_SMALL

@giuseppe giuseppe changed the title llama: define architecture for granite models llama: define architecture for small granite models May 23, 2024
@giuseppe giuseppe force-pushed the fix-granite-3b branch 2 times, most recently from 1fb9186 to cd8d590 Compare May 23, 2024 13:06
convert-hf-to-gguf.py Outdated Show resolved Hide resolved
convert-hf-to-gguf.py Outdated Show resolved Hide resolved
convert-hf-to-gguf.py Outdated Show resolved Hide resolved
@giuseppe
Copy link
Contributor Author

@compilade thanks, addressed the issues and pushed a new version

@ggerganov
Copy link
Owner

Adding the --architecture argument should be avoided. Instead of adding new architecture GraniteSmallForCausalLM, try to update the existing class LlamaModel to handle this model. Since it is a BPE tokenizer, you might have to update the convert-hf-to-gguf-update.py script as described in #6920

Add optional MLP bias for ARCH_LLAMA to support Granite models.
Partially addresses ggerganov/issues/7116
Still needs some more changes to properly support Granite.
@giuseppe
Copy link
Contributor Author

Adding the --architecture argument should be avoided. Instead of adding new architecture GraniteSmallForCausalLM, try to update the existing class LlamaModel to handle this model. Since it is a BPE tokenizer, you might have to update the convert-hf-to-gguf-update.py script as described in #6920

I've simplified the implementation and it is using the existing Llama model. I've added a way to override the default rope type. Now the only Granite specific code in llama.cpp is to detect model.type

Comment on lines 1345 to 1350
# Skip for granite models
if self.hparams.get("vocab_size", 32000) != 49152:
if name.endswith("q_proj.weight"):
data_torch = LlamaModel.permute(data_torch, n_head, n_head)
if name.endswith("k_proj.weight"):
data_torch = LlamaModel.permute(data_torch, n_head, n_kv_head)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can avoid adding the rope type parameter all together, by permuting the Q, K attention tensors in the correct way here. I don't have an example code unfortunately, so we need to figure out how to do it. The only difference between RoPE NORM and NEOX is that in the former we rotate the pairs (x[2*i + 0], x[2*i + 1], while in the latter we rotate (x[i], x[i + n_rot/2]). So it's a matter of reordering the rows in each head in the correct way to make the RoPE type to be NORM - as all other LLaMA-based models

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the suggestion. I had a look at it and I am not sure that is possible to do just by rearranging the Q,K weights without changing their values too.

If I understand it correctly, given:

const float * const src = (float *)((char *) src0->data + i3*nb03 + i2*nb02 + i1*nb01 + i0*nb00);
           float * dst_data  = (float *)((char *)  dst->data +  i3*nb3 + i2*nb2  + i1*nb1  + i0*nb0);

we would like to shuffle the positions of x0 and x1 around, so that (RoPE NORM):

const float x0 = src[0];
const float x1 = src[1];

dst_data[0] = x0*cos_theta*zeta - x1*sin_theta;
dst_data[1] = x0*sin_theta*zeta + x1*cos_theta;

can be used instead of (RoPE NEOX):

const float x0 = src[0];
const float x1 = src[n_dims/2];

dst_data[0]        = x0*cos_theta - x1*sin_theta;
dst_data[n_dims/2] = x0*sin_theta + x1*cos_theta;

So not only we want to re-arrange the elements in a way that RoPE NORM can find them (this would probably be easy) but also ensure that after the RoPE operation, the output is written down with the same layout RoPE NEOX expects it since the rest of the model expects that output.

Am I missing something?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though the output of Q = rope(q) and K = rope(k) would not be in the same order, it should still work because we compute KQ = K @ Q which is invariant to how the data in the heads is reordered - as long as it is reordered in the same way in both K and Q

I could be missing something though - not 100% confident in this. If you think it won't work, we can probably do the rope type thing, but I really prefer to find a way to avoid it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it something that could be changed later?

I am not confident either that it is not possible, I've spent a few hours on it and I've not been successful so far

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave this a try. Only the first n_dims elements of each rows should be re-ordered.

llama.cpp/ggml.c

Line 14418 in 95f84d5

if (ic < n_dims) {

    @staticmethod
    def permute_neox_rope(weights: Tensor, rot_dim: int) -> Tensor:
        orig_shape = weights.shape
        assert orig_shape[-1] % rot_dim == 0
        # reorder the first rot_dim elements of each row
        weights = weights.reshape((-1 , weights.shape[-1] // rot_dim, rot_dim))
        weights[:, 0, :] = weights[:, 0, :].reshape((-1, 2, rot_dim // 2)).mT.contiguous().reshape((-1, rot_dim))
        return weights.reshape((orig_shape))

It seems to partially work, but the output is still wrong, because in RoPE NEOX, it's only the first rot_dim elements per row that are roped, while in RoPE NORM, all of them are.

So it's not simply a re-ordering of elements that is necessary, unfortunately. The rope type is needed, it seems.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool!

Could you put this diff into a patch I can cherry-pick and I update my PR?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@giuseppe Put this in a file (say, permute-bias.patch), then use git apply permute-bias.patch from the repo's top directory.

Patch content (click to expand)
diff --git a/convert-hf-to-gguf.py b/convert-hf-to-gguf.py
index 99c1fdb4..63d50f8f 100755
--- a/convert-hf-to-gguf.py
+++ b/convert-hf-to-gguf.py
@@ -1325,8 +1325,6 @@ class LlamaModel(Model):
         # Apply to granite small models only
         if self.hparams.get("vocab_size", 32000) == 49152:
             self.gguf_writer.add_add_bos_token(False)
-            self.gguf_writer.add_rope_type(gguf.RopeType.NEOX)
-            self.gguf_writer.add_rope_scaling_type(gguf.RopeScalingType.NONE)
 
     @staticmethod
     def permute(weights: Tensor, n_head: int, n_head_kv: int | None):
@@ -1342,12 +1340,10 @@ class LlamaModel(Model):
         n_head = self.hparams["num_attention_heads"]
         n_kv_head = self.hparams.get("num_key_value_heads")
 
-        # Skip for granite models
-        if self.hparams.get("vocab_size", 32000) != 49152:
-            if name.endswith("q_proj.weight"):
-                data_torch = LlamaModel.permute(data_torch, n_head, n_head)
-            if name.endswith("k_proj.weight"):
-                data_torch = LlamaModel.permute(data_torch, n_head, n_kv_head)
+        if name.endswith(("q_proj.weight", "q_proj.bias")):
+            data_torch = LlamaModel.permute(data_torch, n_head, n_head)
+        if name.endswith(("k_proj.weight", "k_proj.bias")):
+            data_torch = LlamaModel.permute(data_torch, n_head, n_kv_head)
 
         # process the experts separately
         if name.find("block_sparse_moe.experts") != -1:
diff --git a/gguf-py/gguf/constants.py b/gguf-py/gguf/constants.py
index d5c3d7b5..c9ae259e 100644
--- a/gguf-py/gguf/constants.py
+++ b/gguf-py/gguf/constants.py
@@ -57,7 +57,6 @@ class Keys:
         CAUSAL            = "{arch}.attention.causal"
 
     class Rope:
-        TYPE                    = "{arch}.rope.type"
         DIMENSION_COUNT         = "{arch}.rope.dimension_count"
         FREQ_BASE               = "{arch}.rope.freq_base"
         SCALING_TYPE            = "{arch}.rope.scaling.type"
@@ -807,13 +806,6 @@ class TokenType(IntEnum):
     BYTE         = 6
 
 
-class RopeType(Enum):
-    NONE = 'none'
-    NORM = 'norm'
-    NEOX = 'neox'
-    GLM  = 'glm'
-
-
 class RopeScalingType(Enum):
     NONE   = 'none'
     LINEAR = 'linear'
@@ -1006,7 +998,6 @@ KEY_ATTENTION_LAYERNORM_EPS     = Keys.Attention.LAYERNORM_EPS
 KEY_ATTENTION_LAYERNORM_RMS_EPS = Keys.Attention.LAYERNORM_RMS_EPS
 
 # RoPE
-KEY_ROPE_TYPE                 = Keys.Rope.TYPE
 KEY_ROPE_DIMENSION_COUNT      = Keys.Rope.DIMENSION_COUNT
 KEY_ROPE_FREQ_BASE            = Keys.Rope.FREQ_BASE
 KEY_ROPE_SCALING_TYPE         = Keys.Rope.SCALING_TYPE
diff --git a/gguf-py/gguf/gguf_writer.py b/gguf-py/gguf/gguf_writer.py
index ebfd15fd..8b41b54e 100644
--- a/gguf-py/gguf/gguf_writer.py
+++ b/gguf-py/gguf/gguf_writer.py
@@ -427,9 +427,6 @@ class GGUFWriter:
     def add_rope_freq_base(self, value: float) -> None:
         self.add_float32(Keys.Rope.FREQ_BASE.format(arch=self.arch), value)
 
-    def add_rope_type(self, value: RopeType) -> None:
-        self.add_string(Keys.Rope.TYPE.format(arch=self.arch), value.value)
-
     def add_rope_scaling_type(self, value: RopeScalingType) -> None:
         self.add_string(Keys.Rope.SCALING_TYPE.format(arch=self.arch), value.value)
 
diff --git a/llama.cpp b/llama.cpp
index 16c11d43..f970c175 100644
--- a/llama.cpp
+++ b/llama.cpp
@@ -297,7 +297,6 @@ enum llm_kv {
     LLM_KV_ATTENTION_LAYERNORM_RMS_EPS,
     LLM_KV_ATTENTION_CAUSAL,
 
-    LLM_KV_ROPE_TYPE,
     LLM_KV_ROPE_DIMENSION_COUNT,
     LLM_KV_ROPE_FREQ_BASE,
     LLM_KV_ROPE_SCALE_LINEAR,
@@ -376,7 +375,6 @@ static const std::map<llm_kv, const char *> LLM_KV_NAMES = {
     { LLM_KV_ATTENTION_LAYERNORM_RMS_EPS,   "%s.attention.layer_norm_rms_epsilon" },
     { LLM_KV_ATTENTION_CAUSAL,              "%s.attention.causal"                 },
 
-    { LLM_KV_ROPE_TYPE,                     "%s.rope.type"                            },
     { LLM_KV_ROPE_DIMENSION_COUNT,          "%s.rope.dimension_count"                 },
     { LLM_KV_ROPE_FREQ_BASE,                "%s.rope.freq_base"                       },
     { LLM_KV_ROPE_SCALE_LINEAR,             "%s.rope.scale_linear"                    },
@@ -1131,29 +1129,12 @@ struct LLM_TN {
 // gguf helpers
 //
 
-static const std::map<enum llama_rope_type, const char *> LLAMA_ROPE_TYPES = {
-    { LLAMA_ROPE_TYPE_NONE, "none" },
-    { LLAMA_ROPE_TYPE_NORM, "norm" },
-    { LLAMA_ROPE_TYPE_NEOX, "neox" },
-    { LLAMA_ROPE_TYPE_GLM,  "glm"  },
-};
-
 static const std::map<llama_rope_scaling_type, const char *> LLAMA_ROPE_SCALING_TYPES = {
     { LLAMA_ROPE_SCALING_TYPE_NONE,   "none"   },
     { LLAMA_ROPE_SCALING_TYPE_LINEAR, "linear" },
     { LLAMA_ROPE_SCALING_TYPE_YARN,   "yarn"   },
 };
 
-static enum llama_rope_type llama_rope_type_from_string(const std::string & name) {
-    for (const auto & kv : LLAMA_ROPE_TYPES) {
-        if (kv.second == name) {
-            return (enum llama_rope_type) kv.first;
-        }
-    }
-
-    return LLAMA_ROPE_TYPE_NONE;
-}
-
 static llama_rope_scaling_type llama_rope_scaling_type_from_string(const std::string & name) {
     for (const auto & kv : LLAMA_ROPE_SCALING_TYPES) {
         if (kv.second == name) {
@@ -4417,15 +4398,7 @@ static void llm_load_hparams(
         hparams.use_alibi = true;
     }
 
-    hparams.rope_type = llama_default_rope_type(&model);
-
-    const auto kv = LLM_KV(model.arch);
-    const int rope_type_keyidx = gguf_find_key(ctx, kv(LLM_KV_ROPE_TYPE).c_str());
-    if (rope_type_keyidx != -1) {
-        std::string rope_type("none");
-        ml.get_key(LLM_KV_ROPE_TYPE, rope_type);
-        hparams.rope_type =  llama_rope_type_from_string(rope_type);
-    }
+    hparams.rope_type = llama_rope_type(&model);
 }
 
 // TODO: This should probably be in llama.h
@@ -16252,7 +16225,7 @@ enum llama_vocab_type llama_vocab_type(const struct llama_model * model) {
     return model->vocab.type;
 }
 
-enum llama_rope_type llama_default_rope_type(const struct llama_model * model) {
+enum llama_rope_type llama_rope_type(const struct llama_model * model) {
     switch (model->arch) {
         // these models do not use RoPE
         case LLM_ARCH_GPT2:
diff --git a/llama.h b/llama.h
index 632136ca..16cece5d 100644
--- a/llama.h
+++ b/llama.h
@@ -422,7 +422,7 @@ extern "C" {
     LLAMA_API enum llama_pooling_type llama_pooling_type(const struct llama_context * ctx);
 
     LLAMA_API enum llama_vocab_type   llama_vocab_type  (const struct llama_model   * model);
-    LLAMA_API enum llama_rope_type    llama_default_rope_type   (const struct llama_model   * model);
+    LLAMA_API enum llama_rope_type    llama_rope_type   (const struct llama_model   * model);
 
     LLAMA_API int32_t llama_n_vocab    (const struct llama_model * model);
     LLAMA_API int32_t llama_n_ctx_train(const struct llama_model * model);

If by "patch" you meant a commit, then... I think I can directly push it here if "Maintainers are allowed to edit this pull request." works as I think it does? (I never tried pushing on someone else's fork, though)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, if you are fine that I apply it directly on top of my patch then I can do.

I was thinking about you owning the ownership for the commit, since you came up with the code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR updated

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, thanks for looking into this!

@giuseppe giuseppe changed the title llama: define architecture for small granite models llama: extend for small granite models May 27, 2024
propagate the add_space_prefix configuration from the HF model
configuration to the gguf file and honor it with the gpt2 tokenizer.

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
it works only for the small models 3b and 8b.

The convert-hf-to-gguf.py script uses the vocabulary size of the
granite models to detect granite and set the correct configuration.

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
@ggerganov ggerganov merged commit 5442939 into ggerganov:master May 28, 2024
67 of 71 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
model Model specific python python script changes Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Support for IBM Granite
5 participants