-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable llama.cpp on s390x big endian platform #3552
Conversation
1. verified with baichuan7b-chat with float 16 on s390x 2. verified with baichuan7b-chat 3. verified with chinese-alpaca-2-13b-f16
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should merge this. Any thoughts?
Obviously, this produces a different set of models, but I guess it wouldn't be a big issues as little-endian is the default and I don't expect anyone to start distributing big-endian models
Yes. I am glad this is approved. Thank you. This provides an option to generate big endian gguf model from raw model by themselves. Next step, I am going to study ggml code and thinking about how to enable s390x SIMD and AI accelerator. |
Ok, just to make sure I understand - with the big endian model and without s390x SIMD, the inference works correctly, right? |
Yes. It works very well on s390x big endian system without SIMD and AI accelerator code support. It is faster than I expect. |
@ggerganov is there any other information I should provide? I saw the need_feedback tag. make test result:
on s390x, 2 tests failed which is expected. These two models are little endian.
Baichuan2 7B result on s390x
|
@ggerganov is there any other information I should provide? No, just want to give it some time to see if other people have an opinion on the changes. Will merge this tomorrow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this requires to bump the GGUF version because the current spec is explicit in little endianess. The spec should also be updated to reflect this change. We cannot simply trust that people do not distribute big endian files.
And of course bump the package version in pyproject.toml
gguf-py/gguf/gguf.py
Outdated
self.fout.write(struct.pack("<I", GGUF_VERSION)) | ||
self.fout.write(struct.pack("<Q", self.ti_data_count)) | ||
self.fout.write(struct.pack("<Q", self.kv_data_count)) | ||
self.fout.write(struct.pack(f"{self.get_pack_prefix()}I", GGUF_MAGIC)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The magic is meant to be exactly the ascii bytes G G U F in the file, regardless of the system endianness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this requires to bump the GGUF version because the current spec is explicit in little endianess. The spec should also be updated to reflect this change. We cannot simply trust that people do not distribute big endian files.
And of course bump the package version in
pyproject.toml
I suggest to check magic code instead. If the endianess is not match, magic code is 0x47475546. Then we can warn user: "Endianess of the GGUF file and platform do not match"
self.fout.write(struct.pack(f"{self.get_pack_prefix()}I", GGUF_MAGIC)) | |
diff --git a/ggml.c b/ggml.c | |
index 6d1776c..04b88c9 100644 | |
--- a/ggml.c | |
+++ b/ggml.c | |
@@ -20916,7 +20916,13 @@ struct gguf_context * gguf_init_from_file(const char * fname, struct gguf_init_p | |
gguf_fread_el(file, &magic, sizeof(magic), &offset); | |
if (magic != GGUF_MAGIC) { | |
- fprintf(stderr, "%s: invalid magic number %08x\n", __func__, magic); | |
+ if (magic == GGUF_WRONG_ENIAN_MAGIC) | |
+ { | |
+ fprintf(stderr, "Endianess of the GGUF file and platform do not match.%s: invalid magic number %08x.\n", __func__, magic); | |
+ } | |
+ else { | |
+ fprintf(stderr, "%s: invalid magic number %08x\n", __func__, magic); | |
+ } | |
fclose(file); | |
return NULL; | |
} | |
diff --git a/ggml.h b/ggml.h | |
index 3eddc44..2ecf893 100644 | |
--- a/ggml.h | |
+++ b/ggml.h | |
@@ -232,6 +232,7 @@ | |
#define GGML_EXIT_ABORTED 1 | |
#define GGUF_MAGIC 0x46554747 // "GGUF" | |
+#define GGUF_WRONG_ENIAN_MAGIC 0x47475546 | |
#define GGUF_VERSION 2 | |
#define GGUF_DEFAULT_ALIGNMENT 32 |
Result after apply fix
~/code/aiu/work/llama.cpp> ./main -m ~/gguf-s390/Baichuan-7B-f16.gguf
Log start
main: build = 1360 (51e9d39)
main: built with cc (SUSE Linux) 7.5.0 for x86_64-suse-linux
main: seed = 1697040195
Endianess of the GGUF file and platform do not match.gguf_init_from_file: invalid magic number 47475546.
error loading model: llama_model_loader: failed to load model from /home/cqy/gguf-s390/Baichuan-7B-f16.ggufllama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model '/home/cqy/gguf-s390/Baichuan-7B-f16.gguf'
main: error: unable to load model
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you do want to start loading and saving files that start with F U G G
(look in a hex editor), you will have to request a spec change, because that's no longer a GGUF file by its current definition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added endianess check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you do want to start loading and saving files that start with
F U G G
(look in a hex editor), you will have to request a spec change, because that's no longer a GGUF file by its current definition.
@cebtenzzre This depends on whether we think the magic number is a number or a string.
ggml.c read the magic number as uint_32. This is endianess sensitive. If we think magic is int type, I think my update is compatible to the spec. But if we think magic is string type, then we need to update both ggml.h, ggml.c and gguf.py.
@ggerganov what's your opinion?
struct gguf_header {
uint32_t magic;
uint32_t version;
uint64_t n_tensors; // GGUFv2
uint64_t n_kv; // GGUFv2
};
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it better to fix ggml.c
to read and write the magic byte-per-byte to match the spec?
Currently, technically it does not match the spec
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it better to fix
ggml.c
to read and write the magic byte-per-byte to match the spec? Currently, technically it does not match the spec
Appreciate for your comments.
Yes. Let me clarify my update. I fixed the ggml.h to use the difference int magic value according to endianess which always represents "GGUF" characters. Now the file is always compatible to the spec. Now the GGUF file for big endian is started with "GGUF" as small endian GGUF file is.
See the hexdump of llama2 gguf file on s390x big endian:
[aiu gguf-s390]$ hexdump -C gguf-s390/llama-2-7b-f16-new.gguf|head -n 20
00000000 47 47 55 46 00 00 00 03 00 00 00 00 00 00 01 23 |GGUF...........#|
00000010 00 00 00 00 00 00 00 0f 00 00 00 00 00 00 00 14 |................|
00000020 67 65 6e 65 72 61 6c 2e 61 72 63 68 69 74 65 63 |general.architec|
00000030 74 75 72 65 00 00 00 08 00 00 00 00 00 00 00 05 |ture............|
00000040 6c 6c 61 6d 61 00 00 00 00 00 00 00 0c 67 65 6e |llama........gen|
00000050 65 72 61 6c 2e 6e 61 6d 65 00 00 00 08 00 00 00 |eral.name.......|
00000060 00 00 00 00 08 4c 4c 61 4d 41 20 76 32 00 00 00 |.....LLaMA v2...|
00000070 00 00 00 00 14 6c 6c 61 6d 61 2e 63 6f 6e 74 65 |.....llama.conte|
00000080 78 74 5f 6c 65 6e 67 74 68 00 00 00 04 00 00 10 |xt_length.......|
00000090 00 00 00 00 00 00 00 00 16 6c 6c 61 6d 61 2e 65 |.........llama.e|
000000a0 6d 62 65 64 64 69 6e 67 5f 6c 65 6e 67 74 68 00 |mbedding_length.|
000000b0 00 00 04 00 00 10 00 00 00 00 00 00 00 00 11 6c |...............l|
000000c0 6c 61 6d 61 2e 62 6c 6f 63 6b 5f 63 6f 75 6e 74 |lama.block_count|
000000d0 00 00 00 04 00 00 00 20 00 00 00 00 00 00 00 19 |....... ........|
000000e0 6c 6c 61 6d 61 2e 66 65 65 64 5f 66 6f 72 77 61 |llama.feed_forwa|
000000f0 72 64 5f 6c 65 6e 67 74 68 00 00 00 04 00 00 2b |rd_length......+|
00000100 00 00 00 00 00 00 00 00 1a 6c 6c 61 6d 61 2e 72 |.........llama.r|
00000110 6f 70 65 2e 64 69 6d 65 6e 73 69 6f 6e 5f 63 6f |ope.dimension_co|
00000120 75 6e 74 00 00 00 04 00 00 00 80 00 00 00 00 00 |unt.............|
00000130 00 00 1a 6c 6c 61 6d 61 2e 61 74 74 65 6e 74 69 |...llama.attenti|
And I rolled back the line to write GGUF_MAGIC in gguf.py. It always write the magic in byte order.
def write_header_to_file(self):
self.fout.write(struct.pack("<I", GGUF_MAGIC))
self.fout.write(struct.pack(f"{self.pack_prefix}I", GGUF_VERSION))
self.fout.write(struct.pack(f"{self.pack_prefix}Q", self.ti_data_count))
self.fout.write(struct.pack(f"{self.pack_prefix}Q", self.kv_data_count))
self.flush()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this works, but I wish to avoid the ifdef
in the header and the inclusion of extra headers (endian.h
).
We should implement the multi-character constant alternative as proposed by @cebtenzzre and instead of read / write uint32_t
at once, we should read / write byte-by-byte and compare the multi-byte constant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this choice. Previously I thought maybe this change is too big.
I will also need to change the magic in struct gguf_header to char array.
If you agree, I will update according to your comments.
struct gguf_header {
uint32_t magic; => char magic[4];
uint32_t version;
uint64_t n_tensors; // GGUFv2
uint64_t n_kv; // GGUFv2
};
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ggerganov updated according to your comments
2. update GGUF version 3. change get_pack_prefix to property 4. update information log
@monatis is it possible review the changes according to your comments? Thank you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also want to bring this pr to the community's attention in ggerganov/ggml#302
convert.py
Outdated
@@ -932,6 +932,8 @@ def write_all(fname_out: Path, ftype: GGMLFileType, params: Params, model: LazyM | |||
elapsed = time.time() - start | |||
size = ' x '.join(f"{dim:6d}" for dim in lazy_tensor.shape) | |||
padi = len(str(len(model))) | |||
if endianess==gguf.GGUFEndian.BIG: | |||
ndarray.byteswap(inplace=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be handle in GGUFWriter.write_tensor_data
just like you do in add_tensor
. Conversion script should not have no responsibility for handling endianness other than setting it in the constructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@monatis updated as your comments
ggml.h
Outdated
#if BYTE_ORDER == LITTLE_ENDIAN | ||
#define GGUF_MAGIC 0x46554747 | ||
#elif BYTE_ORDER == BIG_ENDIAN | ||
#define GGUF_MAGIC 0x47475546 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should either have a comment here to explain it's the same byte sequence in the file or (maybe even better) read raw bytes as Georgi suggested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now I changed it to char string.
1. Set GGUF_MAGIC to "GGUF" string instead of int value 2. Compare "GGUF" char by char to ensure its byte order 3. Move bytes swap code from convert.py to gguf.py write_tensor_data
Thank you for your kindly review comments. I did following updates according to these discussion.
It is possible review them again? Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ggml
changes are OK. Waiting for @monatis review and we can merge
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks, this is good to merge now.
* check whether platform is 390x if yes->do not import immintrin.h * support s390x big endian * support --bigendian option for s390x 1. verified with baichuan7b-chat with float 16 on s390x 2. verified with baichuan7b-chat 3. verified with chinese-alpaca-2-13b-f16 * update format based on editor-config checker result * Update convert-baichuan-hf-to-gguf.py * 1. check in ggml.c if endianess is not match 2. update GGUF version 3. change get_pack_prefix to property 4. update information log * always use "GGUF" as beginng of GGUF file * Compare "GGUF" with file header char by char 1. Set GGUF_MAGIC to "GGUF" string instead of int value 2. Compare "GGUF" char by char to ensure its byte order 3. Move bytes swap code from convert.py to gguf.py write_tensor_data --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* check whether platform is 390x if yes->do not import immintrin.h * support s390x big endian * support --bigendian option for s390x 1. verified with baichuan7b-chat with float 16 on s390x 2. verified with baichuan7b-chat 3. verified with chinese-alpaca-2-13b-f16 * update format based on editor-config checker result * Update convert-baichuan-hf-to-gguf.py * 1. check in ggml.c if endianess is not match 2. update GGUF version 3. change get_pack_prefix to property 4. update information log * always use "GGUF" as beginng of GGUF file * Compare "GGUF" with file header char by char 1. Set GGUF_MAGIC to "GGUF" string instead of int value 2. Compare "GGUF" char by char to ensure its byte order 3. Move bytes swap code from convert.py to gguf.py write_tensor_data --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
From ggerganov/ggml#302 (comment):
This is something I noticed while implementing support downstream - all I had to do was accept GGUF files with version number 3, which simply acknowledges that the program may crash now if someone tries to load a big-endian file. |
This pull request aims to enable the execution of llama.cpp on the s390x big endian platform. It includes the following changes:
Verification:
To ensure the changes made in this pull request, the following verifications have been performed:
Please review this pull request and consider merging it into the main repository. Thank you!
Fixes #3298