-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[User]Failed to execute any models on s390x #3298
Comments
Cool, people are having IBM mainframes at home now :) The model magic is reversed, that is indeed endianness problem. This is either gonna be really easy, or particularly painful to solve. You could try downloading a raw model, and converting/quantizing it directly on that particular machine. @KerfuffleV2 Do you happen to know if convert scripts are endian-aware ? Supposedly pytorch library is, so in case convert can't do that, maybe "reconverting" pytorch model would solve this ? What do you think ? |
Pretty sure they are, it uses numpy and Python's What I'd wonder about is actually loading the model. llama.cpp just mmaps the file, which is going to be full of little-endian data. I don't think there's any conversion or special handling to convert to big endian after loading, so... I'd expect that won't work so great, even if you could get past the file magic being wrong. Can try just hex-editing it to what is expected (or hacking the code to ignore the error and proceed). I doubt it'll work though, I think all the quantizations use at least |
Basically, I found this: pytorch/pytorch#65300 In the comments they explain pytorch can open models created on different endianness machine, which made me think, perhaps using pytorch on the target machine, to load a model and write it again to another file, would "fix" endianness, but then pytorch model would have to be converted to gguf ? I found #707 mentioning use of old convert script for pytorch to ggml, which could be ten converted to gguf As long as the model is in correct endians for the host, It shouldn't matter what endians does the host use, all file and memory reads/writes will be compatibly symmetric Even if code uses The only thing which could break things is union-like downcasting, but I have no idea if it's used anywhere Edit: Ok, this will be a problem: Lines 374 to 390 in 36b904e
|
Unfortunately, that still wouldn't help you. The I think probably the easiest "fix" if one can convert the file themselves is to edit
Not too hard to replace since they all start with The current situation undoubtedly isn't ideal though, this is just a hack to get it working by any means. |
To be honest, I really wanted this to work :) I'm absolutely in love with the concept of putting LLM on a museum grade hardware, I've been hunting for a green phosphor CRT on eBay for months, for that exact reason :) Fallout style terminal and stuff :) |
I think what I suggested should get you there as far as metadata-type stuff goes. When loading with Torch, it should deal with endian stuff properly since I assume that's in the pickle format. BUT Writing it out just uses numpy's tofile thing: https://numpy.org/doc/stable/reference/generated/numpy.ndarray.tofile.html - it doesn't care about endianness. I think only the conversion scripts need to change to make this work and the conversion doesn't even have to happen on the big endian machine. So to get it working just:
I'm actually not really sure what the best way to handle this outside of temporary hacks is though. I guess add support in the C code for loading LE GGUF files on BE machines, at least enough to parse the GGUF metadata stuff. Could also add an endianness flag to the metadata and prevent running GGUF files that are the wrong endianness. It wouldn't be impossible to convert the actual tensor data but it be a pain because the quantized formats have individual bytes interspersed with stuff like |
Sorry for late response. I started to learn gguf and setup environment on s390x in my holiday. Update to this issue: I tried to convert baichuan2 model on s390x because sentencepiece supports s390x and big endian. The conversion is success. But I encountered same issue. Now I will focus on magic number. $ ./main -m /aivol/cqy/Baichuan2-7B-Chat/ggml-model-f16.gguf -p "Write a song for my 20th working year" -n 400
Log start
main: build = 1299 (f5ef5cf)
main: built with cc (GCC) 10.2.1 20201112 (Red Hat 10.2.1-8) for s390x-redhat-linux
main: seed = 1696236188
gguf_init_from_file: invalid magic number 47475546
error loading model: llama_model_loader: failed to load model from /aivol/cqy/Baichuan2-7B-Chat/ggml-model-f16.gguf
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model '/aivol/cqy/Baichuan2-7B-Chat/ggml-model-f16.gguf'
main: error: unable to load model |
@KerfuffleV2 I made some progress by changing "<" to ">". But using Baichuan2-7B-chat model, the response is very strange. Got same response with "Baichuan-7B" also. I will try numpy suggestion later. Thanks. Command I used: Output:
My updates to gguf.py --- a/gguf-py/gguf/gguf.py
+++ b/gguf-py/gguf/gguf.py
@@ -483,10 +483,10 @@ class GGUFWriter:
self.tensors = []
def write_header_to_file(self):
- self.fout.write(struct.pack("<I", GGUF_MAGIC))
- self.fout.write(struct.pack("<I", GGUF_VERSION))
- self.fout.write(struct.pack("<Q", self.ti_data_count))
- self.fout.write(struct.pack("<Q", self.kv_data_count))
+ self.fout.write(struct.pack(">I", GGUF_MAGIC))
+ self.fout.write(struct.pack(">I", GGUF_VERSION))
+ self.fout.write(struct.pack(">Q", self.ti_data_count))
+ self.fout.write(struct.pack(">Q", self.kv_data_count))
self.flush()
# print("tensors " + str(self.ti_data_count) + " kv " + str(self.kv_data_count))
@@ -559,16 +559,16 @@ class GGUFWriter:
self.add_val(val, GGUFValueType.ARRAY)
_simple_value_packing = {
- GGUFValueType.UINT8: "<B",
- GGUFValueType.INT8: "<b",
- GGUFValueType.UINT16: "<H",
- GGUFValueType.INT16: "<h",
- GGUFValueType.UINT32: "<I",
- GGUFValueType.INT32: "<i",
- GGUFValueType.FLOAT32: "<f",
- GGUFValueType.UINT64: "<Q",
- GGUFValueType.INT64: "<q",
- GGUFValueType.FLOAT64: "<d",
+ GGUFValueType.UINT8: ">B",
+ GGUFValueType.INT8: ">b",
+ GGUFValueType.UINT16: ">H",
+ GGUFValueType.INT16: ">h",
+ GGUFValueType.UINT32: ">I",
+ GGUFValueType.INT32: ">i",
+ GGUFValueType.FLOAT32: ">f",
+ GGUFValueType.UINT64: ">Q",
+ GGUFValueType.INT64: ">q",
+ GGUFValueType.FLOAT64: ">d",
GGUFValueType.BOOL: "?" ,
}
def add_val(self, val: Any, vtype: GGUFValueType | None = None, add_vtype: bool = True):
@@ -576,7 +576,7 @@ class GGUFWriter:
vtype = GGUFValueType.get_type(val)
if add_vtype:
- self.kv_data += struct.pack("<I", vtype)
+ self.kv_data += struct.pack(">I", vtype)
self.kv_data_count += 1
pack_fmt = self._simple_value_packing.get(vtype)
@@ -584,14 +584,14 @@ class GGUFWriter:
self.kv_data += struct.pack(pack_fmt, val)
elif vtype == GGUFValueType.STRING:
encoded_val = val.encode("utf8") if isinstance(val, str) else val
- self.kv_data += struct.pack("<Q", len(encoded_val))
+ self.kv_data += struct.pack(">Q", len(encoded_val))
self.kv_data += encoded_val
elif vtype == GGUFValueType.ARRAY and isinstance(val, Sequence) and len(val) > 0:
ltype = GGUFValueType.get_type(val[0])
if not all(GGUFValueType.get_type(i) is ltype for i in val[1:]):
raise ValueError("All items in a GGUF array should be of the same type")
- self.kv_data += struct.pack("<I", ltype)
- self.kv_data += struct.pack("<Q", len(val))
+ self.kv_data += struct.pack(">I", ltype)
+ self.kv_data += struct.pack(">Q", len(val))
for item in val:
self.add_val(item, add_vtype=False)
else:
@@ -605,18 +605,18 @@ class GGUFWriter:
assert raw_dtype is not None or tensor_dtype in (np.float32, np.float16), "Only F32 and F16 tensors are supported for now"
encoded_name = name.encode("utf8")
- self.ti_data += struct.pack("<Q", len(encoded_name))
+ self.ti_data += struct.pack(">Q", len(encoded_name))
self.ti_data += encoded_name
n_dims = len(tensor_shape)
- self.ti_data += struct.pack("<I", n_dims)
+ self.ti_data += struct.pack(">I", n_dims)
for i in range(n_dims):
- self.ti_data += struct.pack("<Q", tensor_shape[n_dims - 1 - i])
+ self.ti_data += struct.pack(">Q", tensor_shape[n_dims - 1 - i])
if raw_dtype is None:
dtype = GGMLQuantizationType.F32 if tensor_dtype == np.float32 else GGMLQuantizationType.F16
else:
dtype = raw_dtype
- self.ti_data += struct.pack("<I", dtype)
- self.ti_data += struct.pack("<Q", self.offset_tensor)
+ self.ti_data += struct.pack(">I", dtype)
+ self.ti_data += struct.pack(">Q", self.offset_tensor)
self.offset_tensor += GGUFWriter.ggml_pad(tensor_nbytes, self.data_alignment)
self.ti_data_count += 1 Full Output
|
You should be very close! I think your issue is because the actual weights are still in little endian. Try this change in def add_tensor(self, name: str, tensor: np.ndarray[Any, Any], raw_shape: Sequence[int] | None = None, raw_dtype: GGMLQuantizationType | None = None):
tensor.byteswap(inplace=True)
if self.use_temp_file and self.temp_file is None:
fp = tempfile.SpooledTemporaryFile(mode="w+b", max_size=256*1024*1024) (The change is just to immediate do the |
@KerfuffleV2 I tried it on s390x. I am converting model on s390x. Is there a way to dump the model in memory to a file always in little endian mode? Then I can compare the result on s390x and x86. My script def add_tensor(self, name: str, tensor: np.ndarray[Any, Any], raw_shape: Sequence[int] | None = None, raw_dtype: GGMLQuantizationType | None = None):
tensor.byteswap(inplace=True)
if self.use_temp_file and self.temp_file is None:
fp = tempfile.SpooledTemporaryFile(mode="w+b", max_size=256*1024*1024)
fp.seek(0)
self.temp_file = fp
shape: Sequence[int] = raw_shape if raw_shape is not None else tensor.shape
self.add_tensor_info(name, shape, tensor.dtype, tensor.nbytes, raw_dtype = raw_dtype) I will go through gguf python script again today. ``` .... system_info: n_threads = 4 / 4 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | Write a song for my 20th working year<h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/> Conversion output: ```
|
Umm, unfortunately at this point I'm really not sure what the issue is. I pretty much just took a guess that fixing the actual tensor data was the problem, but you got an exactly identical result that way apparently. One thing I should mention though: Have you tested that some build and model on x86? The recent changes to decoding (#3228) broke Baichuan 13B for me (it just crashes instead of producing incorrect output). It's possible Baichuan 7B inference is broken in a different way so it wouldn't work regardless of the endian stuff. I don't have a 7B model on hand to test with right now. It's also not impossible that the actual operations in GGML make assumptions that don't work on big endian. Unfortunately identifying/fixing that is past my level of expertise. Hopefully it's something simpler, but that problem is possible. (I guess it might also be possible that the endian fixing stuff has to occur before the unpacking/permuting step, although I'm not really too sure why it would matter.) |
@KerfuffleV2 I have tried Baichuan 7B on x86 several days ago. It worked well. I can verify it again. Will padding related functions affect the result? I will have a try. I think for little endian, padding zero is appended after data, but for big endian, it should be apppended before data. If the answer is Yes, is it possible help me to know how to update the following code? def write_padding(self, fp: BinaryIO, n: int, align: int | None = None):
pad = GGUFWriter.ggml_pad(n, align if align is not None else self.data_alignment) - n
if pad != 0:
fp.write(bytes([0] * pad))
def write_tensor_data(self, tensor: np.ndarray[Any, Any]):
self.write_padding(self.fout, self.fout.tell())
tensor.tofile(self.fout)
self.write_padding(self.fout, tensor.nbytes)
def write_tensors_to_file(self):
self.write_ti_data_to_file()
self.write_padding(self.fout, self.fout.tell())
if self.temp_file is None:
for (currtensor, currpad) in self.tensors:
currtensor.tofile(self.fout)
if currpad != 0:
self.fout.write(bytes([0] * currpad))
return
def add_tensor(self, name: str, tensor: np.ndarray[Any, Any], raw_shape: Sequence[int] | None = None, raw_dtype: GGMLQuantizationType | None = None):
tensor.byteswap(inplace=True)
if self.use_temp_file and self.temp_file is None:
fp = tempfile.SpooledTemporaryFile(mode="w+b", max_size=256*1024*1024)
fp.seek(0)
self.temp_file = fp
shape: Sequence[int] = raw_shape if raw_shape is not None else tensor.shape
self.add_tensor_info(name, shape, tensor.dtype, tensor.nbytes, raw_dtype = raw_dtype)
pad = GGUFWriter.ggml_pad(tensor.nbytes, self.data_alignment) - tensor.nbytes
if self.temp_file is None:
self.tensors.append((tensor, pad))
return
tensor.tofile(self.temp_file)
if pad != 0:
self.temp_file.write(bytes([0] * pad)) |
Depends on what you mean by "several days". The pull I thought may have broken stuff got merged about 5 days ago.
It's already before the tensor data, but that wouldn't change based on endianness. It's just empty space to make sure the start of the data is aligned to a multiple of 32 bytes. (The current code is sort of confusing because it adds an extra padding chunk after the last tensor too, which is kind of useless. Not something that would cause problems though.) |
7B works well with latest code on x86.
But it is broken on s390x.
|
@KerfuffleV2 thank you for help. I will continue to use gdb to print the llama_model to understand the difference., |
Futhur findings, After I compiled llama.cpp with LLAMA_DEBUG option on, the assertion in ggml is failed on s390 when running ./main program. No matter in gdb or execute the command directly. But it is good on x86. Command I use:
ggml code snipet around ggml.c Line 12731 which assertion is failed: It seems some value from source tensor is NAN. I guess this is why it's crashed @KerfuffleV2 What does it check?
I also saw an error when I using gdb command
GDB command to run llama.cpp gdb ./main
gdb> break main.cpp:187
gdb> run -m /aivol/cqy/Baichuan-7B/ggml-model-f16.gguf -p "Steps to build a web site: first," -n 50 Error Output:
|
I think the difference is asserts probably don't run when compiled normally with optimization, so calculations just produce weird values instead of failing outright. There are two possible causes I can think of, the first is that the tensor data is just incorrect in the actual model file. The other is that the way the GGML operations are implemented just doesn't work on big endian for some reason. Hmm... I don't know how much time you want to put into trying various random stuff but one thing to try would be just printing out some fairly small tensor. Maybe that could even be done from the conversion script. You could possibly try something like: def add_tensor(self, name: str, tensor: np.ndarray[Any, Any], raw_shape: Sequence[int] | None = None, raw_dtype: GGMLQuantizationType | None = None):
if name == 'blk.0.attn_norm.weight':
print('BEFORE', tensor)
tensor.byteswap(inplace=True)
if name == 'blk.0.attn_norm.weight':
print('AFTER', tensor) That just dumps the tensor before and after the byteswap stuff. If byteswapping was needed, I'd expect to see a bunch of crazy values in the "before" and more normal values in "after". The other place to do something similar would be in the |
I fixed it at final. @KerfuffleV2 Many thanks for the help. With the jouney, I did a lot of wrong practice, but got to the right direction at final. I was using convert.py instead of convert-baichuan-hf-to-gguf.py. What are the difference between them? This script does not call add_tensor, but do same thing is write all function.
The result:
@KerfuffleV2 it seems there is endian conversion issue. I printed the first 16 word of data member in model layer 0 attn_norm with gdb. The first float 32 is 0x3d8d0000 and value is 0.0688476563 on x86. But it is 0x8d3d0000 on s390, but in fact the big endian representation should be 0x00008d3d on s390. x86: (gdb) x/16xw model.layers[0].attn_norm.data
0x7ffce5df99a0: 0x3d8d0000 0x3d250000 0x3de70000 0x3d630000
0x7ffce5df99b0: 0x3d3b0000 0x3d590000 0x3d3b0000 0x3d120000
0x7ffce5df99c0: 0x3d460000 0x3d4d0000 0x3d380000 0x3d270000
0x7ffce5df99d0: 0x3d3d0000 0x3dae0000 0x3d310000 0x3d040000 s390: (gdb) x/16xw model.layers[0].attn_norm.data
0x3fcec14b9a0: 0x8d3d0000 0x253d0000 0xe73d0000 0x633d0000
0x3fcec14b9b0: 0x3b3d0000 0x593d0000 0x3b3d0000 0x123d0000
0x3fcec14b9c0: 0x463d0000 0x4d3d0000 0x383d0000 0x273d0000
0x3fcec14b9d0: 0x3d3d0000 0xae3d0000 0x313d0000 0x043d0000 |
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Expected Behavior
Allow llama.cpp to be execute on s390x architecture
I am curious whether there is big endian/little endian issue of gguf model. My system is big endian.
BTW if you can point me how to add support for new sets of SIMD instructions, I can try to add s390x SIMD instructions support by myself.
Thank you.
Current Behavior
I can compile this program on s390x by commented k_quants.c line# 50.
#if !defined(__riscv)
//#include <immintrin.h>
#endif
And I can execute ./main -h
But if I execute it with a real model, then I got invalid magic number.
Is there an endianess issue?
Environment and Context
Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.
$ lscpu
$ uname -a
Linux 4.18.0-305.el8.s390x #1 SMP Thu Apr 29 09:06:01 EDT 2021 s390x s390x s390x GNU/Linux
Python 3.9.2
GNU Make 4.2.1
Built for s390x-ibm-linux-gnu
g++ (GCC) 10.2.1 20201112 (Red Hat 10.2.1-8)
The text was updated successfully, but these errors were encountered: