Enable llama.cpp on s390x big endian platform #3552

chenqiny · 2023-10-09T03:27:57Z

This pull request aims to enable the execution of llama.cpp on the s390x big endian platform. It includes the following changes:

Added a conditional check to determine if the platform is s390x. If this condition is met, the immintrin.h header file will not be imported, as it is not compatible with the s390x architecture.
Introduced the --bigendian option to the conversion script for s390x, ensuring gguf model compatibility with the big endian byte order of the platform.

Verification:
To ensure the changes made in this pull request, the following verifications have been performed:

Tested baichuan7b with float16.
Tested baichuan7b with float16.
Tested chinese-alpaca-2-13b with float16.

Please review this pull request and consider merging it into the main repository. Thank you!

Fixes #3298

1. verified with baichuan7b-chat with float 16 on s390x 2. verified with baichuan7b-chat 3. verified with chinese-alpaca-2-13b-f16

ggerganov

I think we should merge this. Any thoughts?

Obviously, this produces a different set of models, but I guess it wouldn't be a big issues as little-endian is the default and I don't expect anyone to start distributing big-endian models

chenqiny · 2023-10-10T12:05:32Z

I think we should merge this. Any thoughts?

Obviously, this produces a different set of models, but I guess it wouldn't be a big issues as little-endian is the default and I don't expect anyone to start distributing big-endian models

@ggerganov

Yes. I am glad this is approved. Thank you.

This provides an option to generate big endian gguf model from raw model by themselves.

Next step, I am going to study ggml code and thinking about how to enable s390x SIMD and AI accelerator.

ggerganov · 2023-10-10T12:08:02Z

Ok, just to make sure I understand - with the big endian model and without s390x SIMD, the inference works correctly, right?

chenqiny · 2023-10-10T12:10:15Z

Ok, just to make sure I understand - with the big endian model and without s390x SIMD, the inference works correctly, right?

Yes. It works very well on s390x big endian system without SIMD and AI accelerator code support. It is faster than I expect.

chenqiny · 2023-10-10T14:02:11Z

@ggerganov is there any other information I should provide? I saw the need_feedback tag.

make test result:
on x86: all cases are good

Test tests/test-tokenizer-0-falcon passed.
All tests passed.

on s390x, 2 tests failed which is expected. These two models are little endian.

main : reading vocab from: '/aivol/cqy/llama.cpp/models/ggml-vocab-llama.gguf'
gguf_init_from_file: invalid magic number 47475546
error loading model: llama_model_loader: failed to load model from /aivol/cqy/llama.cpp/models/ggml-vocab-llama.gguf

llama_load_model_from_file: failed to load model
main: error: failed to load vocab '/aivol/cqy/llama.cpp/models/ggml-vocab-llama.gguf'
Test $test_target FAILED!

main : reading vocab from: '/aivol/cqy/llama.cpp/models/ggml-vocab-falcon.gguf'
gguf_init_from_file: invalid magic number 47475546
error loading model: llama_model_loader: failed to load model from /aivol/cqy/llama.cpp/models/ggml-vocab-falcon.gguf

llama_load_model_from_file: failed to load model
main: error: failed to load vocab '/aivol/cqy/llama.cpp/models/ggml-vocab-falcon.gguf'
Test $test_target FAILED!


2 tests failed.

Baichuan2 7B result on s390x

$ ./main -m /aivol/cqy/gguf-s390/Baichuan2-7B-Chat-f16-convert.gguf -p "How to Build Web Site With 10 steps? First step" -n 100
Log start
main: build = 1359 (e4efbdb)
main: built with cc (GCC) 10.2.1 20201112 (Red Hat 10.2.1-8) for s390x-redhat-linux
main: seed  = 1696946092
llama_model_loader: loaded meta data with 18 key-value pairs and 291 tensors from /aivol/cqy/gguf-s390/Baichuan2-7B-Chat-f16-convert.gguf (version GGUF V2 (latest))
llama_model_loader: - tensor    0:                token_embd.weight f16      [  4096, 125696,     1,     1 ]
llama_model_loader: - tensor    1:         blk.0.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    2:            blk.0.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    3:            blk.0.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor    4:              blk.0.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    5:           blk.0.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    6:            blk.0.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    7:         blk.1.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    8:            blk.1.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    9:            blk.1.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   10:              blk.1.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   11:           blk.1.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   12:            blk.1.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   13:         blk.2.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   14:            blk.2.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   15:            blk.2.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   16:              blk.2.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   17:           blk.2.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   18:            blk.2.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   19:         blk.3.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   20:            blk.3.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   21:            blk.3.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   22:              blk.3.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   23:           blk.3.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   24:            blk.3.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   25:         blk.4.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   26:            blk.4.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   27:            blk.4.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   28:              blk.4.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   29:           blk.4.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   30:            blk.4.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   31:         blk.5.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   32:            blk.5.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   33:            blk.5.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   34:              blk.5.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   35:           blk.5.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   36:            blk.5.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   37:         blk.6.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   38:            blk.6.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   39:            blk.6.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   40:              blk.6.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   41:           blk.6.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   42:            blk.6.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   43:         blk.7.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   44:            blk.7.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   45:            blk.7.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   46:              blk.7.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   47:           blk.7.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   48:            blk.7.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   49:         blk.8.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   50:            blk.8.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   51:            blk.8.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   52:              blk.8.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   53:           blk.8.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   54:            blk.8.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   55:         blk.9.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   56:            blk.9.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   57:            blk.9.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   58:              blk.9.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   59:           blk.9.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   60:            blk.9.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   61:        blk.10.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   62:           blk.10.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   63:           blk.10.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   64:             blk.10.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   65:          blk.10.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   66:           blk.10.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   67:        blk.11.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   68:           blk.11.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   69:           blk.11.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   70:             blk.11.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   71:          blk.11.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   72:           blk.11.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   73:        blk.12.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   74:           blk.12.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   75:           blk.12.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   76:             blk.12.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   77:          blk.12.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   78:           blk.12.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   79:        blk.13.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   80:           blk.13.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   81:           blk.13.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   82:             blk.13.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   83:          blk.13.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   84:           blk.13.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   85:        blk.14.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   86:           blk.14.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   87:           blk.14.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   88:             blk.14.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   89:          blk.14.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   90:           blk.14.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   91:        blk.15.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   92:           blk.15.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   93:           blk.15.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   94:             blk.15.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   95:          blk.15.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   96:           blk.15.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   97:        blk.16.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   98:           blk.16.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   99:           blk.16.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  100:             blk.16.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  101:          blk.16.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  102:           blk.16.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  103:        blk.17.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  104:           blk.17.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  105:           blk.17.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  106:             blk.17.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  107:          blk.17.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  108:           blk.17.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  109:        blk.18.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  110:           blk.18.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  111:           blk.18.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  112:             blk.18.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  113:          blk.18.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  114:           blk.18.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  115:        blk.19.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  116:           blk.19.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  117:           blk.19.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  118:             blk.19.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  119:          blk.19.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  120:           blk.19.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  121:        blk.20.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  122:           blk.20.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  123:           blk.20.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  124:             blk.20.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  125:          blk.20.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  126:           blk.20.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  127:        blk.21.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  128:           blk.21.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  129:           blk.21.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  130:             blk.21.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  131:          blk.21.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  132:           blk.21.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  133:        blk.22.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  134:           blk.22.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  135:           blk.22.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  136:             blk.22.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  137:          blk.22.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  138:           blk.22.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  139:        blk.23.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  140:           blk.23.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  141:           blk.23.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  142:             blk.23.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  143:          blk.23.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  144:           blk.23.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  145:        blk.24.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  146:           blk.24.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  147:           blk.24.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  148:             blk.24.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  149:          blk.24.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  150:           blk.24.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  151:        blk.25.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  152:           blk.25.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  153:           blk.25.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  154:             blk.25.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  155:          blk.25.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  156:           blk.25.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  157:        blk.26.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  158:           blk.26.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  159:           blk.26.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  160:             blk.26.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  161:          blk.26.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  162:           blk.26.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  163:        blk.27.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  164:           blk.27.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  165:           blk.27.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  166:             blk.27.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  167:          blk.27.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  168:           blk.27.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  169:        blk.28.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  170:           blk.28.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  171:           blk.28.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  172:             blk.28.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  173:          blk.28.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  174:           blk.28.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  175:        blk.29.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  176:           blk.29.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  177:           blk.29.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  178:             blk.29.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  179:          blk.29.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  180:           blk.29.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  181:        blk.30.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  182:           blk.30.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  183:           blk.30.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  184:             blk.30.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  185:          blk.30.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  186:           blk.30.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  187:        blk.31.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  188:           blk.31.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  189:           blk.31.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor  190:             blk.31.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor  191:          blk.31.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  192:           blk.31.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  193:               output_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor  194:                    output.weight f16      [  4096, 125696,     1,     1 ]
llama_model_loader: - tensor  195:              blk.0.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  196:              blk.0.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  197:              blk.0.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  198:              blk.1.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  199:              blk.1.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  200:              blk.1.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  201:              blk.2.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  202:              blk.2.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  203:              blk.2.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  204:              blk.3.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  205:              blk.3.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  206:              blk.3.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  207:              blk.4.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  208:              blk.4.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  209:              blk.4.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  210:              blk.5.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  211:              blk.5.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  212:              blk.5.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  213:              blk.6.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  214:              blk.6.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  215:              blk.6.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  216:              blk.7.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  217:              blk.7.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  218:              blk.7.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  219:              blk.8.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  220:              blk.8.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  221:              blk.8.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  222:              blk.9.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  223:              blk.9.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  224:              blk.9.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  225:             blk.10.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  226:             blk.10.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  227:             blk.10.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  228:             blk.11.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  229:             blk.11.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  230:             blk.11.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  231:             blk.12.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  232:             blk.12.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  233:             blk.12.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  234:             blk.13.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  235:             blk.13.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  236:             blk.13.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  237:             blk.14.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  238:             blk.14.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  239:             blk.14.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  240:             blk.15.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  241:             blk.15.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  242:             blk.15.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  243:             blk.16.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  244:             blk.16.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  245:             blk.16.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  246:             blk.17.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  247:             blk.17.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  248:             blk.17.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  249:             blk.18.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  250:             blk.18.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  251:             blk.18.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  252:             blk.19.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  253:             blk.19.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  254:             blk.19.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  255:             blk.20.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  256:             blk.20.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  257:             blk.20.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  258:             blk.21.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  259:             blk.21.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  260:             blk.21.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  261:             blk.22.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  262:             blk.22.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  263:             blk.22.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  264:             blk.23.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  265:             blk.23.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  266:             blk.23.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  267:             blk.24.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  268:             blk.24.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  269:             blk.24.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  270:             blk.25.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  271:             blk.25.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  272:             blk.25.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  273:             blk.26.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  274:             blk.26.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  275:             blk.26.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  276:             blk.27.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  277:             blk.27.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  278:             blk.27.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  279:             blk.28.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  280:             blk.28.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  281:             blk.28.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  282:             blk.29.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  283:             blk.29.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  284:             blk.29.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  285:             blk.30.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  286:             blk.30.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  287:             blk.30.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  288:             blk.31.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  289:             blk.31.attn_k.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  290:             blk.31.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - kv   0:                       general.architecture str
llama_model_loader: - kv   1:                               general.name str
llama_model_loader: - kv   2:                baichuan.tensor_data_layout str
llama_model_loader: - kv   3:                    baichuan.context_length u32
llama_model_loader: - kv   4:                  baichuan.embedding_length u32
llama_model_loader: - kv   5:                       baichuan.block_count u32
llama_model_loader: - kv   6:               baichuan.feed_forward_length u32
llama_model_loader: - kv   7:              baichuan.rope.dimension_count u32
llama_model_loader: - kv   8:              baichuan.attention.head_count u32
llama_model_loader: - kv   9:           baichuan.attention.head_count_kv u32
llama_model_loader: - kv  10:  baichuan.attention.layer_norm_rms_epsilon f32
llama_model_loader: - kv  11:                       tokenizer.ggml.model str
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32
llama_model_loader: - kv  17:            tokenizer.ggml.padding_token_id u32
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type  f16:  226 tensors
llm_load_print_meta: format           = GGUF V2 (latest)
llm_load_print_meta: arch             = baichuan
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 125696
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = mostly F16 (guessed)
llm_load_print_meta: model params     = 7.51 B
llm_load_print_meta: model size       = 13.98 GiB (16.00 BPW)
llm_load_print_meta: general.name   = Baichuan2-7B-Chat
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: PAD token = 0 '<unk>'
llm_load_print_meta: LF token  = 1099 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.10 MB
llm_load_tensors: mem required  = 14317.11 MB
.........................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size  =  256.00 MB
llama_new_context_with_model: compute buffer total size = 259.63 MB

system_info: n_threads = 4 / 4 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 100, n_keep = 0


 How to Build Web Site With 10 steps? First step: Choose a Domain Name _ Last Step: Launch Your Website Online _ Continue reading "How To Build A Web Site In Ten Steps" The first step towards building an effective website is selecting domain name appropriately related _ Continued Read More _ Building websites yourself can save money compared to hiring someone else _ especially if you_re doing it part time or learning skills along the way _ but there are still steps involved_ Here they
_ Continue reading "Build Your Own Website: Steps To Success" Selecting a
llama_print_timings:        load time =    5102.90 ms
llama_print_timings:      sample time =     174.25 ms /   100 runs   (    1.74 ms per token,   573.90 tokens per second)
llama_print_timings: prompt eval time =   34672.27 ms /    14 tokens ( 2476.59 ms per token,     0.40 tokens per second)
llama_print_timings:        eval time =  248795.47 ms /    99 runs   ( 2513.09 ms per token,     0.40 tokens per second)
llama_print_timings:       total time =  283723.37 ms
Log end

ggerganov · 2023-10-10T14:46:11Z

@ggerganov is there any other information I should provide?

No, just want to give it some time to see if other people have an opinion on the changes. Will merge this tomorrow

convert-baichuan-hf-to-gguf.py

monatis

I think this requires to bump the GGUF version because the current spec is explicit in little endianess. The spec should also be updated to reflect this change. We cannot simply trust that people do not distribute big endian files.

And of course bump the package version in pyproject.toml

gguf-py/gguf/gguf.py

cebtenzzre · 2023-10-11T13:35:21Z

gguf-py/gguf/gguf.py

-        self.fout.write(struct.pack("<I", GGUF_VERSION))
-        self.fout.write(struct.pack("<Q", self.ti_data_count))
-        self.fout.write(struct.pack("<Q", self.kv_data_count))
+        self.fout.write(struct.pack(f"{self.get_pack_prefix()}I", GGUF_MAGIC))


The magic is meant to be exactly the ascii bytes G G U F in the file, regardless of the system endianness.

I think this requires to bump the GGUF version because the current spec is explicit in little endianess. The spec should also be updated to reflect this change. We cannot simply trust that people do not distribute big endian files.

And of course bump the package version in pyproject.toml

I suggest to check magic code instead. If the endianess is not match, magic code is 0x47475546. Then we can warn user: "Endianess of the GGUF file and platform do not match"

Suggested change

self.fout.write(struct.pack(f"{self.get_pack_prefix()}I", GGUF_MAGIC))

diff --git a/ggml.c b/ggml.c

index 6d1776c..04b88c9 100644

--- a/ggml.c

+++ b/ggml.c

@@ -20916,7 +20916,13 @@ struct gguf_context * gguf_init_from_file(const char * fname, struct gguf_init_p

gguf_fread_el(file, &magic, sizeof(magic), &offset);

if (magic != GGUF_MAGIC) {

- fprintf(stderr, "%s: invalid magic number %08x\n", __func__, magic);

+ if (magic == GGUF_WRONG_ENIAN_MAGIC)

+ {

+ fprintf(stderr, "Endianess of the GGUF file and platform do not match.%s: invalid magic number %08x.\n", __func__, magic);

+ }

+ else {

+ fprintf(stderr, "%s: invalid magic number %08x\n", __func__, magic);

+ }

fclose(file);

return NULL;

}

diff --git a/ggml.h b/ggml.h

index 3eddc44..2ecf893 100644

--- a/ggml.h

+++ b/ggml.h

@@ -232,6 +232,7 @@

#define GGML_EXIT_ABORTED 1

#define GGUF_MAGIC 0x46554747 // "GGUF"

+#define GGUF_WRONG_ENIAN_MAGIC 0x47475546

#define GGUF_VERSION 2

#define GGUF_DEFAULT_ALIGNMENT 32

Result after apply fix

~/code/aiu/work/llama.cpp> ./main -m ~/gguf-s390/Baichuan-7B-f16.gguf Log start main: build = 1360 (51e9d39) main: built with cc (SUSE Linux) 7.5.0 for x86_64-suse-linux main: seed = 1697040195 Endianess of the GGUF file and platform do not match.gguf_init_from_file: invalid magic number 47475546. error loading model: llama_model_loader: failed to load model from /home/cqy/gguf-s390/Baichuan-7B-f16.ggufllama_load_model_from_file: failed to load model llama_init_from_gpt_params: error: failed to load model '/home/cqy/gguf-s390/Baichuan-7B-f16.gguf' main: error: unable to load model

If you do want to start loading and saving files that start with F U G G (look in a hex editor), you will have to request a spec change, because that's no longer a GGUF file by its current definition.

@cebtenzzre

I added endianess check

If you do want to start loading and saving files that start with F U G G (look in a hex editor), you will have to request a spec change, because that's no longer a GGUF file by its current definition.

@cebtenzzre This depends on whether we think the magic number is a number or a string.

ggml.c read the magic number as uint_32. This is endianess sensitive. If we think magic is int type, I think my update is compatible to the spec. But if we think magic is string type, then we need to update both ggml.h, ggml.c and gguf.py.

@ggerganov what's your opinion?

struct gguf_header { uint32_t magic; uint32_t version; uint64_t n_tensors; // GGUFv2 uint64_t n_kv; // GGUFv2 };

Isn't it better to fix ggml.c to read and write the magic byte-per-byte to match the spec?
Currently, technically it does not match the spec

Isn't it better to fix ggml.c to read and write the magic byte-per-byte to match the spec? Currently, technically it does not match the spec

@ggerganov @cebtenzzre

Appreciate for your comments.

Yes. Let me clarify my update. I fixed the ggml.h to use the difference int magic value according to endianess which always represents "GGUF" characters. Now the file is always compatible to the spec. Now the GGUF file for big endian is started with "GGUF" as small endian GGUF file is.

See the hexdump of llama2 gguf file on s390x big endian:

[aiu gguf-s390]$ hexdump -C gguf-s390/llama-2-7b-f16-new.gguf|head -n 20 00000000 47 47 55 46 00 00 00 03 00 00 00 00 00 00 01 23 |GGUF...........#| 00000010 00 00 00 00 00 00 00 0f 00 00 00 00 00 00 00 14 |................| 00000020 67 65 6e 65 72 61 6c 2e 61 72 63 68 69 74 65 63 |general.architec| 00000030 74 75 72 65 00 00 00 08 00 00 00 00 00 00 00 05 |ture............| 00000040 6c 6c 61 6d 61 00 00 00 00 00 00 00 0c 67 65 6e |llama........gen| 00000050 65 72 61 6c 2e 6e 61 6d 65 00 00 00 08 00 00 00 |eral.name.......| 00000060 00 00 00 00 08 4c 4c 61 4d 41 20 76 32 00 00 00 |.....LLaMA v2...| 00000070 00 00 00 00 14 6c 6c 61 6d 61 2e 63 6f 6e 74 65 |.....llama.conte| 00000080 78 74 5f 6c 65 6e 67 74 68 00 00 00 04 00 00 10 |xt_length.......| 00000090 00 00 00 00 00 00 00 00 16 6c 6c 61 6d 61 2e 65 |.........llama.e| 000000a0 6d 62 65 64 64 69 6e 67 5f 6c 65 6e 67 74 68 00 |mbedding_length.| 000000b0 00 00 04 00 00 10 00 00 00 00 00 00 00 00 11 6c |...............l| 000000c0 6c 61 6d 61 2e 62 6c 6f 63 6b 5f 63 6f 75 6e 74 |lama.block_count| 000000d0 00 00 00 04 00 00 00 20 00 00 00 00 00 00 00 19 |....... ........| 000000e0 6c 6c 61 6d 61 2e 66 65 65 64 5f 66 6f 72 77 61 |llama.feed_forwa| 000000f0 72 64 5f 6c 65 6e 67 74 68 00 00 00 04 00 00 2b |rd_length......+| 00000100 00 00 00 00 00 00 00 00 1a 6c 6c 61 6d 61 2e 72 |.........llama.r| 00000110 6f 70 65 2e 64 69 6d 65 6e 73 69 6f 6e 5f 63 6f |ope.dimension_co| 00000120 75 6e 74 00 00 00 04 00 00 00 80 00 00 00 00 00 |unt.............| 00000130 00 00 1a 6c 6c 61 6d 61 2e 61 74 74 65 6e 74 69 |...llama.attenti|

And I rolled back the line to write GGUF_MAGIC in gguf.py. It always write the magic in byte order.

def write_header_to_file(self): self.fout.write(struct.pack("<I", GGUF_MAGIC)) self.fout.write(struct.pack(f"{self.pack_prefix}I", GGUF_VERSION)) self.fout.write(struct.pack(f"{self.pack_prefix}Q", self.ti_data_count)) self.fout.write(struct.pack(f"{self.pack_prefix}Q", self.kv_data_count)) self.flush()

Yes, this works, but I wish to avoid the ifdef in the header and the inclusion of extra headers (endian.h).
We should implement the multi-character constant alternative as proposed by @cebtenzzre and instead of read / write uint32_t at once, we should read / write byte-by-byte and compare the multi-byte constant.

@ggerganov @monatis

I like this choice. Previously I thought maybe this change is too big.

I will also need to change the magic in struct gguf_header to char array.

If you agree, I will update according to your comments.

struct gguf_header { uint32_t magic; => char magic[4]; uint32_t version; uint64_t n_tensors; // GGUFv2 uint64_t n_kv; // GGUFv2 };

@ggerganov updated according to your comments

2. update GGUF version 3. change get_pack_prefix to property 4. update information log

chenqiny · 2023-10-17T05:53:04Z

@monatis is it possible review the changes according to your comments?

Thank you.

monatis

I also want to bring this pr to the community's attention in ggml-org/ggml#302

monatis · 2023-10-17T06:25:04Z

convert.py

@@ -932,6 +932,8 @@ def write_all(fname_out: Path, ftype: GGMLFileType, params: Params, model: LazyM
            elapsed = time.time() - start
            size = ' x '.join(f"{dim:6d}" for dim in lazy_tensor.shape)
            padi = len(str(len(model)))
+            if endianess==gguf.GGUFEndian.BIG:
+                ndarray.byteswap(inplace=True)


This should be handle in GGUFWriter.write_tensor_data just like you do in add_tensor. Conversion script should not have no responsibility for handling endianness other than setting it in the constructor.

@monatis updated as your comments

monatis · 2023-10-17T06:28:36Z

ggml.h

+    #if BYTE_ORDER == LITTLE_ENDIAN
+        #define GGUF_MAGIC 0x46554747
+    #elif BYTE_ORDER == BIG_ENDIAN
+        #define GGUF_MAGIC 0x47475546


I think we should either have a comment here to explain it's the same byte sequence in the file or (maybe even better) read raw bytes as Georgi suggested.

Now I changed it to char string.

1. Set GGUF_MAGIC to "GGUF" string instead of int value 2. Compare "GGUF" char by char to ensure its byte order 3. Move bytes swap code from convert.py to gguf.py write_tensor_data

chenqiny · 2023-10-20T10:52:01Z

@ggerganov @monatis

Thank you for your kindly review comments.

I did following updates according to these discussion.

Set GGUF_MAGIC to "GGUF" string instead of int value.
Compare "GGUF" char by char. Ensure the header magic is in order.
Move bytes swap code from convert.py to gguf.py write_tensor_data function.

It is possible review them again? Thanks.

ggerganov

ggml changes are OK. Waiting for @monatis review and we can merge

monatis

LGTM. Thanks, this is good to merge now.

* check whether platform is 390x if yes->do not import immintrin.h * support s390x big endian * support --bigendian option for s390x 1. verified with baichuan7b-chat with float 16 on s390x 2. verified with baichuan7b-chat 3. verified with chinese-alpaca-2-13b-f16 * update format based on editor-config checker result * Update convert-baichuan-hf-to-gguf.py * 1. check in ggml.c if endianess is not match 2. update GGUF version 3. change get_pack_prefix to property 4. update information log * always use "GGUF" as beginng of GGUF file * Compare "GGUF" with file header char by char 1. Set GGUF_MAGIC to "GGUF" string instead of int value 2. Compare "GGUF" char by char to ensure its byte order 3. Move bytes swap code from convert.py to gguf.py write_tensor_data --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

cebtenzzre · 2023-10-31T23:41:34Z

From ggml-org/ggml#302 (comment):

As far as I can tell, the only functional difference is the version number has changed, which is... not ideal? How do you tell apart a little-endian and big-endian file? I've updated to v3 nonetheless and written a little about it, but this seems like something we should rectify ASAP.

This is something I noticed while implementing support downstream - all I had to do was accept GGUF files with version number 3, which simply acknowledges that the program may crash now if someone tries to load a big-endian file.

chenqiny and others added 5 commits October 2, 2023 04:28

check whether platform is 390x if yes->do not import immintrin.h

0613562

support s390x big endian

fa62c8c

support --bigendian option for s390x

1ce890a

1. verified with baichuan7b-chat with float 16 on s390x 2. verified with baichuan7b-chat 3. verified with chinese-alpaca-2-13b-f16

Merge branch 'ggerganov:master' into master

8640f3b

update format based on editor-config checker result

e4efbdb

ggerganov approved these changes Oct 10, 2023

View reviewed changes

ggerganov added the need feedback Testing and feedback with results are needed label Oct 10, 2023

ggerganov reviewed Oct 11, 2023

View reviewed changes

convert-baichuan-hf-to-gguf.py Outdated Show resolved Hide resolved

Update convert-baichuan-hf-to-gguf.py

51e9d39

ggerganov requested a review from monatis October 11, 2023 06:55

monatis requested changes Oct 11, 2023

View reviewed changes

gguf-py/gguf/gguf.py Outdated Show resolved Hide resolved

cebtenzzre reviewed Oct 11, 2023

View reviewed changes

1. check in ggml.c if endianess is not match

7fc0250

2. update GGUF version 3. change get_pack_prefix to property 4. update information log

chenqiny requested a review from monatis October 12, 2023 16:28

always use "GGUF" as beginng of GGUF file

e513abe

monatis requested changes Oct 17, 2023

View reviewed changes

monatis mentioned this pull request Oct 17, 2023

GGUF file format specification ggml-org/ggml#302

Merged

Compare "GGUF" with file header char by char

eb5b832

1. Set GGUF_MAGIC to "GGUF" string instead of int value 2. Compare "GGUF" char by char to ensure its byte order 3. Move bytes swap code from convert.py to gguf.py write_tensor_data

Merge branch 'ggerganov:master' into master

4389bda

ggerganov requested a review from monatis October 20, 2023 11:05

ggerganov approved these changes Oct 20, 2023

View reviewed changes

monatis approved these changes Oct 20, 2023

View reviewed changes

ggerganov merged commit 8cf19d6 into ggerganov:master Oct 20, 2023

This was referenced Oct 27, 2023

Add custom models nomic-ai/gpt4all#725

Closed

llama : correctly report GGUFv3 format #3818

Merged

cebtenzzre mentioned this pull request Oct 27, 2023

backend: support GGUFv3 nomic-ai/gpt4all#1582

Merged

philpax mentioned this pull request Nov 5, 2023

GGUF endianness cannot be determined from GGUF itself #3957

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable llama.cpp on s390x big endian platform #3552

Enable llama.cpp on s390x big endian platform #3552

chenqiny commented Oct 9, 2023 •

edited

Loading

ggerganov left a comment

chenqiny commented Oct 10, 2023

ggerganov commented Oct 10, 2023

chenqiny commented Oct 10, 2023

chenqiny commented Oct 10, 2023

ggerganov commented Oct 10, 2023 •

edited

Loading

monatis left a comment

cebtenzzre Oct 11, 2023

chenqiny Oct 11, 2023

cebtenzzre Oct 12, 2023

chenqiny Oct 12, 2023

chenqiny Oct 13, 2023

ggerganov Oct 15, 2023

chenqiny Oct 16, 2023

ggerganov Oct 17, 2023

chenqiny Oct 17, 2023

chenqiny Oct 20, 2023

chenqiny commented Oct 17, 2023

monatis left a comment

monatis Oct 17, 2023

chenqiny Oct 20, 2023

monatis Oct 17, 2023

chenqiny Oct 20, 2023

chenqiny commented Oct 20, 2023 •

edited

Loading

ggerganov left a comment

monatis left a comment

cebtenzzre commented Oct 31, 2023 •

edited

Loading

-        self.fout.write(struct.pack(f"{self.get_pack_prefix()}I", GGUF_MAGIC))
+diff --git a/ggml.c b/ggml.c
+index 6d1776c..04b88c9 100644
+--- a/ggml.c
++++ b/ggml.c
+@@ -20916,7 +20916,13 @@ struct gguf_context * gguf_init_from_file(const char * fname, struct gguf_init_p
+         gguf_fread_el(file, &magic, sizeof(magic), &offset);
+         if (magic != GGUF_MAGIC) {
+-            fprintf(stderr, "%s: invalid magic number %08x\n", __func__, magic);
++            if (magic == GGUF_WRONG_ENIAN_MAGIC)
++            {
++                fprintf(stderr, "Endianess of the GGUF file and platform do not match.%s: invalid magic number %08x.\n", __func__, magic);
++            }
++            else {
++                fprintf(stderr, "%s: invalid magic number %08x\n", __func__, magic);
++            }
+             fclose(file);
+             return NULL;
+         }
+diff --git a/ggml.h b/ggml.h
+index 3eddc44..2ecf893 100644
+--- a/ggml.h
++++ b/ggml.h
+@@ -232,6 +232,7 @@
+ #define GGML_EXIT_ABORTED 1
+ #define GGUF_MAGIC   0x46554747 // "GGUF"
++#define GGUF_WRONG_ENIAN_MAGIC 0x47475546
+ #define GGUF_VERSION 2
+ #define GGUF_DEFAULT_ALIGNMENT 32

Enable llama.cpp on s390x big endian platform #3552

Enable llama.cpp on s390x big endian platform #3552

Conversation

chenqiny commented Oct 9, 2023 • edited Loading

ggerganov left a comment

Choose a reason for hiding this comment

chenqiny commented Oct 10, 2023

ggerganov commented Oct 10, 2023

chenqiny commented Oct 10, 2023

chenqiny commented Oct 10, 2023

ggerganov commented Oct 10, 2023 • edited Loading

monatis left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chenqiny commented Oct 17, 2023

monatis left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chenqiny commented Oct 20, 2023 • edited Loading

ggerganov left a comment

Choose a reason for hiding this comment

monatis left a comment

Choose a reason for hiding this comment

cebtenzzre commented Oct 31, 2023 • edited Loading

chenqiny commented Oct 9, 2023 •

edited

Loading

ggerganov commented Oct 10, 2023 •

edited

Loading

chenqiny commented Oct 20, 2023 •

edited

Loading

cebtenzzre commented Oct 31, 2023 •

edited

Loading