Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[User]Failed to execute any models on s390x #3298

Closed
4 tasks done
chenqiny opened this issue Sep 21, 2023 · 18 comments · Fixed by #3552
Closed
4 tasks done

[User]Failed to execute any models on s390x #3298

chenqiny opened this issue Sep 21, 2023 · 18 comments · Fixed by #3552

Comments

@chenqiny
Copy link
Contributor

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

Allow llama.cpp to be execute on s390x architecture

I am curious whether there is big endian/little endian issue of gguf model. My system is big endian.

BTW if you can point me how to add support for new sets of SIMD instructions, I can try to add s390x SIMD instructions support by myself.

Thank you.

Current Behavior

I can compile this program on s390x by commented k_quants.c line# 50.
#if !defined(__riscv)
//#include <immintrin.h>
#endif

And I can execute ./main -h

But if I execute it with a real model, then I got invalid magic number.
Is there an endianess issue?

[root@aiu llama.cpp]# ./main -m ./models/ggml-vocab-llama.gguf
Log start
main: build = 1265 (324f340)
main: built with cc (GCC) 10.2.1 20201112 (Red Hat 10.2.1-8) for s390x-redhat-linux
main: seed  = 1695309361
gguf_init_from_file: invalid magic number 47475546
error loading model: llama_model_loader: failed to load model from ./models/ggml-vocab-llama.gguf

llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model './models/ggml-vocab-llama.gguf'
main: error: unable to load model

Environment and Context

Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.

  • Physical (or virtual) hardware you are using, e.g. for Linux:

$ lscpu

Architecture:        s390x
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Big Endian
CPU(s):              4
On-line CPU(s) list: 0-3
Thread(s) per core:  1
Core(s) per socket:  8
Socket(s) per book:  2
Book(s) per drawer:  4
Drawer(s):           4
NUMA node(s):        1
Vendor ID:           IBM/S390
Machine type:        3931
CPU dynamic MHz:     5200
CPU static MHz:      5200
BogoMIPS:            3331.00
Hypervisor:          PR/SM
Hypervisor vendor:   IBM
Virtualization type: full
Dispatching mode:    horizontal
L1d cache:           128K
L1i cache:           128K
L2 cache:            32768K
L3 cache:            262144K
NUMA node0 CPU(s):   0-3
Flags:               esan3 zarch stfle msa ldisp eimm dfp edat etf3eh highgprs te vx vxd vxe gs vxe2 vxp sort dflt sie
  • Operating System, e.g. for Linux:

$ uname -a

Linux 4.18.0-305.el8.s390x #1 SMP Thu Apr 29 09:06:01 EDT 2021 s390x s390x s390x GNU/Linux

  • SDK version, e.g. for Linux:
$ python3 --version
$ make --version
$ g++ --version

Python 3.9.2
GNU Make 4.2.1
Built for s390x-ibm-linux-gnu
g++ (GCC) 10.2.1 20201112 (Red Hat 10.2.1-8)

@staviq
Copy link
Contributor

staviq commented Sep 21, 2023

Cool, people are having IBM mainframes at home now :)

The model magic is reversed, that is indeed endianness problem. This is either gonna be really easy, or particularly painful to solve.

You could try downloading a raw model, and converting/quantizing it directly on that particular machine.

@KerfuffleV2 Do you happen to know if convert scripts are endian-aware ? Supposedly pytorch library is, so in case convert can't do that, maybe "reconverting" pytorch model would solve this ? What do you think ?

@KerfuffleV2
Copy link
Collaborator

KerfuffleV2 commented Sep 21, 2023

Do you happen to know if convert scripts are endian-aware ?

Pretty sure they are, it uses numpy and Python's struct with the correct format for little-endian. (Going by memory here.)

What I'd wonder about is actually loading the model. llama.cpp just mmaps the file, which is going to be full of little-endian data. I don't think there's any conversion or special handling to convert to big endian after loading, so... I'd expect that won't work so great, even if you could get past the file magic being wrong.

Can try just hex-editing it to what is expected (or hacking the code to ignore the error and proceed). I doubt it'll work though, I think all the quantizations use at least f16s which will be wonky on big endian? Not 100% sure though.

@staviq
Copy link
Contributor

staviq commented Sep 21, 2023

Basically, I found this: pytorch/pytorch#65300

In the comments they explain pytorch can open models created on different endianness machine, which made me think, perhaps using pytorch on the target machine, to load a model and write it again to another file, would "fix" endianness, but then pytorch model would have to be converted to gguf ? I found #707 mentioning use of old convert script for pytorch to ggml, which could be ten converted to gguf

As long as the model is in correct endians for the host, It shouldn't matter what endians does the host use, all file and memory reads/writes will be compatibly symmetric

Even if code uses 0xNNNN literals, like for model magic, as long as the binary writing that, has the same endians as the one reading it, value will be correct during execution

The only thing which could break things is union-like downcasting, but I have no idea if it's used anywhere

Edit: Ok, this will be a problem:

llama.cpp/ggml.c

Lines 374 to 390 in 36b904e

static inline float fp32_from_bits(uint32_t w) {
union {
uint32_t as_bits;
float as_value;
} fp32;
fp32.as_bits = w;
return fp32.as_value;
}
static inline uint32_t fp32_to_bits(float f) {
union {
float as_value;
uint32_t as_bits;
} fp32;
fp32.as_value = f;
return fp32.as_bits;
}

@KerfuffleV2
Copy link
Collaborator

KerfuffleV2 commented Sep 21, 2023

but then pytorch model would have to be converted to gguf ?

Unfortunately, that still wouldn't help you. The gguf Python stuff explicitly saves stuff as little endian, so stuff like the lengths of items, etc are all going to be in little-endian.

I think probably the easiest "fix" if one can convert the file themselves is to edit gguf-py/gguf/gguf.py and change all the struct formats to either use native (=) or BE (>). Also don't use convert.py to quantize to q8_0 or also change the struct formats there.

gguf-py/gguf/gguf.py
486:        self.fout.write(struct.pack("<I", GGUF_MAGIC))
487:        self.fout.write(struct.pack("<I", GGUF_VERSION))
488:        self.fout.write(struct.pack("<Q", self.ti_data_count))
489:        self.fout.write(struct.pack("<Q", self.kv_data_count))
562:        GGUFValueType.UINT8:   "<B",
563:        GGUFValueType.INT8:    "<b",
564:        GGUFValueType.UINT16:  "<H",
565:        GGUFValueType.INT16:   "<h",
566:        GGUFValueType.UINT32:  "<I",
567:        GGUFValueType.INT32:   "<i",
568:        GGUFValueType.FLOAT32: "<f",
569:        GGUFValueType.UINT64:  "<Q",
570:        GGUFValueType.INT64:   "<q",
571:        GGUFValueType.FLOAT64: "<d",
579:            self.kv_data += struct.pack("<I", vtype)
587:            self.kv_data += struct.pack("<Q", len(encoded_val))
593:            self.kv_data += struct.pack("<I", ltype)
594:            self.kv_data += struct.pack("<Q", len(val))
608:        self.ti_data += struct.pack("<Q", len(encoded_name))
611:        self.ti_data += struct.pack("<I", n_dims)
613:            self.ti_data += struct.pack("<Q", tensor_shape[n_dims - 1 - i])
618:        self.ti_data += struct.pack("<I", dtype)
619:        self.ti_data += struct.pack("<Q", self.offset_tensor)

Not too hard to replace since they all start with "<

The current situation undoubtedly isn't ideal though, this is just a hack to get it working by any means.

@staviq
Copy link
Contributor

staviq commented Sep 21, 2023

To be honest, I really wanted this to work :) I'm absolutely in love with the concept of putting LLM on a museum grade hardware, I've been hunting for a green phosphor CRT on eBay for months, for that exact reason :) Fallout style terminal and stuff :)

@KerfuffleV2
Copy link
Collaborator

To be honest, I really wanted this to work :)

I think what I suggested should get you there as far as metadata-type stuff goes. When loading with Torch, it should deal with endian stuff properly since I assume that's in the pickle format. BUT convert.py doesn't actually use Torch so there will probably have to be something to convert the data in the numpy array to big endian before anything else like permuting occurs.

Writing it out just uses numpy's tofile thing: https://numpy.org/doc/stable/reference/generated/numpy.ndarray.tofile.html - it doesn't care about endianness.

I think only the conversion scripts need to change to make this work and the conversion doesn't even have to happen on the big endian machine. So to get it working just:

  1. Make the struct format changes I suggested above.
  2. In convert.py just use numpy functions to convert to big endian. (Just be really simple, especially for f16 which just will be swapping the bytes.)

I'm actually not really sure what the best way to handle this outside of temporary hacks is though. I guess add support in the C code for loading LE GGUF files on BE machines, at least enough to parse the GGUF metadata stuff. Could also add an endianness flag to the metadata and prevent running GGUF files that are the wrong endianness.

It wouldn't be impossible to convert the actual tensor data but it be a pain because the quantized formats have individual bytes interspersed with stuff like f16s. So conversion would have to happen with something that's aware of all the quantization formats. Even if you could dynamically convert it, you couldn't mmap the model.

@chenqiny
Copy link
Contributor Author

chenqiny commented Oct 2, 2023

Sorry for late response. I started to learn gguf and setup environment on s390x in my holiday.

Update to this issue:

I tried to convert baichuan2 model on s390x because sentencepiece supports s390x and big endian. The conversion is success.

But I encountered same issue.

Now I will focus on magic number.

$ ./main -m /aivol/cqy/Baichuan2-7B-Chat/ggml-model-f16.gguf -p "Write a song for my 20th working year" -n 400
Log start
main: build = 1299 (f5ef5cf)
main: built with cc (GCC) 10.2.1 20201112 (Red Hat 10.2.1-8) for s390x-redhat-linux
main: seed  = 1696236188
gguf_init_from_file: invalid magic number 47475546
error loading model: llama_model_loader: failed to load model from /aivol/cqy/Baichuan2-7B-Chat/ggml-model-f16.gguf

llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model '/aivol/cqy/Baichuan2-7B-Chat/ggml-model-f16.gguf'
main: error: unable to load model

@chenqiny
Copy link
Contributor Author

chenqiny commented Oct 2, 2023

but then pytorch model would have to be converted to gguf ?

Unfortunately, that still wouldn't help you. The gguf Python stuff explicitly saves stuff as little endian, so stuff like the lengths of items, etc are all going to be in little-endian.

I think probably the easiest "fix" if one can convert the file themselves is to edit gguf-py/gguf/gguf.py and change all the struct formats to either use native (=) or BE (>). Also don't use convert.py to quantize to q8_0 or also change the struct formats there.

gguf-py/gguf/gguf.py
486:        self.fout.write(struct.pack("<I", GGUF_MAGIC))
487:        self.fout.write(struct.pack("<I", GGUF_VERSION))
488:        self.fout.write(struct.pack("<Q", self.ti_data_count))
489:        self.fout.write(struct.pack("<Q", self.kv_data_count))
562:        GGUFValueType.UINT8:   "<B",
563:        GGUFValueType.INT8:    "<b",
564:        GGUFValueType.UINT16:  "<H",
565:        GGUFValueType.INT16:   "<h",
566:        GGUFValueType.UINT32:  "<I",
567:        GGUFValueType.INT32:   "<i",
568:        GGUFValueType.FLOAT32: "<f",
569:        GGUFValueType.UINT64:  "<Q",
570:        GGUFValueType.INT64:   "<q",
571:        GGUFValueType.FLOAT64: "<d",
579:            self.kv_data += struct.pack("<I", vtype)
587:            self.kv_data += struct.pack("<Q", len(encoded_val))
593:            self.kv_data += struct.pack("<I", ltype)
594:            self.kv_data += struct.pack("<Q", len(val))
608:        self.ti_data += struct.pack("<Q", len(encoded_name))
611:        self.ti_data += struct.pack("<I", n_dims)
613:            self.ti_data += struct.pack("<Q", tensor_shape[n_dims - 1 - i])
618:        self.ti_data += struct.pack("<I", dtype)
619:        self.ti_data += struct.pack("<Q", self.offset_tensor)

Not too hard to replace since they all start with "<

The current situation undoubtedly isn't ideal though, this is just a hack to get it working by any means.

@KerfuffleV2 I made some progress by changing "<" to ">". But using Baichuan2-7B-chat model, the response is very strange. Got same response with "Baichuan-7B" also. I will try numpy suggestion later. Thanks.

Command I used: ./main -m /aivol/cqy/Baichuan-7B/ggml-model-f16.gguf -p "Write a song for my 20th working year" -n 400

Output:

system_info: n_threads = 4 / 4 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 400, n_keep = 0


 Write a song for my 20th working year<h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/>

My updates to gguf.py

--- a/gguf-py/gguf/gguf.py
+++ b/gguf-py/gguf/gguf.py
@@ -483,10 +483,10 @@ class GGUFWriter:
         self.tensors = []

     def write_header_to_file(self):
-        self.fout.write(struct.pack("<I", GGUF_MAGIC))
-        self.fout.write(struct.pack("<I", GGUF_VERSION))
-        self.fout.write(struct.pack("<Q", self.ti_data_count))
-        self.fout.write(struct.pack("<Q", self.kv_data_count))
+        self.fout.write(struct.pack(">I", GGUF_MAGIC))
+        self.fout.write(struct.pack(">I", GGUF_VERSION))
+        self.fout.write(struct.pack(">Q", self.ti_data_count))
+        self.fout.write(struct.pack(">Q", self.kv_data_count))
         self.flush()
 #        print("tensors " + str(self.ti_data_count) + " kv " + str(self.kv_data_count))

@@ -559,16 +559,16 @@ class GGUFWriter:
         self.add_val(val, GGUFValueType.ARRAY)

     _simple_value_packing = {
-        GGUFValueType.UINT8:   "<B",
-        GGUFValueType.INT8:    "<b",
-        GGUFValueType.UINT16:  "<H",
-        GGUFValueType.INT16:   "<h",
-        GGUFValueType.UINT32:  "<I",
-        GGUFValueType.INT32:   "<i",
-        GGUFValueType.FLOAT32: "<f",
-        GGUFValueType.UINT64:  "<Q",
-        GGUFValueType.INT64:   "<q",
-        GGUFValueType.FLOAT64: "<d",
+        GGUFValueType.UINT8:   ">B",
+        GGUFValueType.INT8:    ">b",
+        GGUFValueType.UINT16:  ">H",
+        GGUFValueType.INT16:   ">h",
+        GGUFValueType.UINT32:  ">I",
+        GGUFValueType.INT32:   ">i",
+        GGUFValueType.FLOAT32: ">f",
+        GGUFValueType.UINT64:  ">Q",
+        GGUFValueType.INT64:   ">q",
+        GGUFValueType.FLOAT64: ">d",
         GGUFValueType.BOOL:    "?" ,
     }
     def add_val(self, val: Any, vtype: GGUFValueType | None = None, add_vtype: bool = True):
@@ -576,7 +576,7 @@ class GGUFWriter:
             vtype = GGUFValueType.get_type(val)

         if add_vtype:
-            self.kv_data += struct.pack("<I", vtype)
+            self.kv_data += struct.pack(">I", vtype)
             self.kv_data_count += 1

         pack_fmt = self._simple_value_packing.get(vtype)
         @@ -584,14 +584,14 @@ class GGUFWriter:
             self.kv_data += struct.pack(pack_fmt, val)
         elif vtype == GGUFValueType.STRING:
             encoded_val = val.encode("utf8") if isinstance(val, str) else val
-            self.kv_data += struct.pack("<Q", len(encoded_val))
+            self.kv_data += struct.pack(">Q", len(encoded_val))
             self.kv_data += encoded_val
         elif vtype == GGUFValueType.ARRAY and isinstance(val, Sequence) and len(val) > 0:
             ltype = GGUFValueType.get_type(val[0])
             if not all(GGUFValueType.get_type(i) is ltype for i in val[1:]):
                 raise ValueError("All items in a GGUF array should be of the same type")
-            self.kv_data += struct.pack("<I", ltype)
-            self.kv_data += struct.pack("<Q", len(val))
+            self.kv_data += struct.pack(">I", ltype)
+            self.kv_data += struct.pack(">Q", len(val))
             for item in val:
                 self.add_val(item, add_vtype=False)
         else:
@@ -605,18 +605,18 @@ class GGUFWriter:
         assert raw_dtype is not None or tensor_dtype in (np.float32, np.float16), "Only F32 and F16 tensors are supported for now"

         encoded_name = name.encode("utf8")
-        self.ti_data += struct.pack("<Q", len(encoded_name))
+        self.ti_data += struct.pack(">Q", len(encoded_name))
         self.ti_data += encoded_name
         n_dims = len(tensor_shape)
-        self.ti_data += struct.pack("<I", n_dims)
+        self.ti_data += struct.pack(">I", n_dims)
         for i in range(n_dims):
-            self.ti_data += struct.pack("<Q", tensor_shape[n_dims - 1 - i])
+            self.ti_data += struct.pack(">Q", tensor_shape[n_dims - 1 - i])
         if raw_dtype is None:
             dtype = GGMLQuantizationType.F32 if tensor_dtype == np.float32 else GGMLQuantizationType.F16
         else:
             dtype = raw_dtype
-        self.ti_data += struct.pack("<I", dtype)
-        self.ti_data += struct.pack("<Q", self.offset_tensor)
+        self.ti_data += struct.pack(">I", dtype)
+        self.ti_data += struct.pack(">Q", self.offset_tensor)
         self.offset_tensor += GGUFWriter.ggml_pad(tensor_nbytes, self.data_alignment)
         self.ti_data_count += 1

Full Output

[cqy@aiu llama.cpp]$ ./main -m /aivol/cqy/Baichuan2-7B-Chat/ggml-model-f16.gguf -p "Write a song for my 20th working year" -n 400
Log start
main: build = 1299 (f5ef5cf)
main: built with cc (GCC) 10.2.1 20201112 (Red Hat 10.2.1-8) for s390x-redhat-linux
main: seed  = 1696261001
llama_model_loader: loaded meta data with 18 key-value pairs and 291 tensors from /aivol/cqy/Baichuan2-7B-Chat/ggml-model-f16.gguf (version GGUF V2 (latest))
llama_model_loader: - tensor    0:                token_embd.weight f16      [  4096, 125696,     1,     1 ]
llama_model_loader: - tensor    1:         blk.0.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    2:            blk.0.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    3:            blk.0.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor    4:              blk.0.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    5:           blk.0.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    6:            blk.0.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    7:         blk.1.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    8:            blk.1.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    9:            blk.1.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   10:              blk.1.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   11:           blk.1.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   12:            blk.1.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   13:         blk.2.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   14:            blk.2.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   15:            blk.2.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   16:              blk.2.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   17:           blk.2.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   18:            blk.2.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   19:         blk.3.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   20:            blk.3.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   21:            blk.3.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   22:              blk.3.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
...
llama_model_loader: - tensor  249:             blk.18.attn_q.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor  290:             blk.31.attn_v.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - kv   0:                       general.architecture str
llama_model_loader: - kv   1:                               general.name str
llama_model_loader: - kv   2:                       llama.context_length u32
llama_model_loader: - kv   3:                     llama.embedding_length u32                                                                                                                                                                         llama_model_loader: - kv   4:                          llama.block_count u32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32
llama_model_loader: - kv   7:                 llama.attention.head_count u32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32                                                                                                                                                                         llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32                                                                                                                                                                         llama_model_loader: - kv  10:                          general.file_type u32                                                                                                                                                                         llama_model_loader: - kv  11:                       tokenizer.ggml.model str                                                                                                                                                                         llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32
llama_model_loader: - kv  17:            tokenizer.ggml.padding_token_id u32
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type  f16:  226 tensors
llm_load_print_meta: format           = GGUF V2 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 125696
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = mostly F16
llm_load_print_meta: model params     = 7.51 B
llm_load_print_meta: model size       = 13.98 GiB (16.00 BPW)
llm_load_print_meta: general.name   = LLaMA v2
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: PAD token = 0 '<unk>'
llm_load_print_meta: LF token  = 1099 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.09 MB
llm_load_tensors: mem required  = 14317.11 MB
.........................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size  =  256.00 MB
llama_new_context_with_model: kv self size  =  256.00 MB
llama_new_context_with_model: compute buffer total size = 259.38 MB

system_info: n_threads = 4 / 4 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, miro
stat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 400, n_keep = 0

@KerfuffleV2
Copy link
Collaborator

You should be very close! I think your issue is because the actual weights are still in little endian. Try this change in gguf.py:

    def add_tensor(self, name: str, tensor: np.ndarray[Any, Any], raw_shape: Sequence[int] | None = None, raw_dtype: GGMLQuantizationType | None = None):
        tensor.byteswap(inplace=True)
        if self.use_temp_file and self.temp_file is None:
            fp = tempfile.SpooledTemporaryFile(mode="w+b", max_size=256*1024*1024)

(The change is just to immediate do the byteswap at the start of the function.)

@chenqiny
Copy link
Contributor Author

chenqiny commented Oct 3, 2023

You should be very close! I think your issue is because the actual weights are still in little endian. Try this change in gguf.py:

    def add_tensor(self, name: str, tensor: np.ndarray[Any, Any], raw_shape: Sequence[int] | None = None, raw_dtype: GGMLQuantizationType | None = None):
        tensor.byteswap(inplace=True)
        if self.use_temp_file and self.temp_file is None:
            fp = tempfile.SpooledTemporaryFile(mode="w+b", max_size=256*1024*1024)

(The change is just to immediate do the byteswap at the start of the function.)

@KerfuffleV2 I tried it on s390x. I am converting model on s390x. 
But result remains same. Is there a way to debug and check which part is wrong?

Is there a way to dump the model in memory to a file always in little endian mode? Then I can compare the result on s390x and x86.

My script

    def add_tensor(self, name: str, tensor: np.ndarray[Any, Any], raw_shape: Sequence[int] | None = None, raw_dtype: GGMLQuantizationType | None = None):
        tensor.byteswap(inplace=True)
        if self.use_temp_file and self.temp_file is None:
            fp = tempfile.SpooledTemporaryFile(mode="w+b", max_size=256*1024*1024)
            fp.seek(0)
            self.temp_file = fp

        shape: Sequence[int] = raw_shape if raw_shape is not None else tensor.shape
        self.add_tensor_info(name, shape, tensor.dtype, tensor.nbytes, raw_dtype = raw_dtype)

I will go through gguf python script again today.

```
[cqy@aiu llama.cpp]$ ./main -m /aivol/cqy/Baichuan2-7B-Chat/ggml-model-f16.gguf -p "Write a song for my 20th working year" -n 400
Log start
main: build = 1299 (f5ef5cf)
main: built with cc (GCC) 10.2.1 20201112 (Red Hat 10.2.1-8) for s390x-redhat-linux
main: seed  = 1696313018
llama_model_loader: loaded meta data with 18 key-value pairs and 291 tensors from /aivol/cqy/Baichuan2-7B-Chat/ggml-model-f16.gguf (version GGUF V2 (latest))
llama_model_loader: - tensor    0:                token_embd.weight f16      [  4096, 125696,     1,     1 ]
llama_model_loader: - tensor    1:         blk.0.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    2:            blk.0.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    3:            blk.0.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor    4:              blk.0.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    5:           blk.0.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    6:            blk.0.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    7:         blk.1.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    8:            blk.1.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    9:            blk.1.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   10:              blk.1.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   11:           blk.1.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   12:            blk.1.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   13:         blk.2.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   14:            blk.2.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   15:            blk.2.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   16:              blk.2.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   17:           blk.2.attn_norm.weight f32      [  4096,     1,     1,     1 ]

....
llama_model_loader: - tensor   18:            blk.2.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   19:         blk.3.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   20:            blk.3.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   21:            blk.3.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor   22:              blk.3.ffn_up.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   23:           blk.3.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   24:            blk.3.ffn_norm.weight f32      [  4096,     1,     1,     1 ]                                                                                                                                          llama_model_loader: - tensor   25:         blk.4.attn_output.weight f16      [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor   26:            blk.4.ffn_gate.weight f16      [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor   27:            blk.4.ffn_down.weight f16      [ 11008,  4096,     1,     1 ]
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = mostly F16
llm_load_print_meta: model params     = 7.51 B
llm_load_print_meta: model size       = 13.98 GiB (16.00 BPW)
llm_load_print_meta: general.name   = LLaMA v2
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: PAD token = 0 '<unk>'
llm_load_print_meta: LF token  = 1099 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.09 MB
llm_load_tensors: mem required  = 14317.11 MB
.........................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size  =  256.00 MB
llama_new_context_with_model: compute buffer total size = 259.38 MB

system_info: n_threads = 4 / 4 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 400, n_keep = 0

Write a song for my 20th working year<h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/>

Conversion output:

```

[cqy@aiu llama.cpp]$ python3.9 convert.py /aivol/cqy/Baichuan2-7B-Chat/
Loading model file /aivol/cqy/Baichuan2-7B-Chat/pytorch_model.bin
params = Params(n_vocab=125696, n_embd=4096, n_layer=32, n_ctx=4096, n_ff=11008, n_head=32, n_head_kv=32, f_norm_eps=1e-06, f_rope_freq_base=None, f_rope_scale=None, ftype=None, path_model=PosixPath('/aivol/cqy/Baichuan2-7B-Chat'))
Loading vocab file '/aivol/cqy/Baichuan2-7B-Chat/tokenizer.model', type 'spm'
normalizer.cc(51) LOG(INFO) precompiled_charsmap is empty. use identity normalization.
Unpacking and permuting layer 0
Unpacking and permuting layer 1                                                                                                                                                                                                                      Unpacking and permuting layer 2
Unpacking and permuting layer 3
Unpacking and permuting layer 4
Unpacking and permuting layer 5
Unpacking and permuting layer 6                                                                                                                                                                                                                      Unpacking and permuting layer 7                                                                                                                                                                                                                      Unpacking and permuting layer 8                                                                                                                                                                                                                      Unpacking and permuting layer 9                                                                                                                                                                                                                      Unpacking and permuting layer 10                                                                                                                                                                                                                     Unpacking and permuting layer 11
Unpacking and permuting layer 12
Unpacking and permuting layer 13
Unpacking and permuting layer 14
Unpacking and permuting layer 15
Unpacking and permuting layer 16
Unpacking and permuting layer 17
Unpacking and permuting layer 18
[0] 0:[tmux]* 1:bash- 2:nmon  3:

@KerfuffleV2
Copy link
Collaborator

KerfuffleV2 commented Oct 3, 2023

Umm, unfortunately at this point I'm really not sure what the issue is. I pretty much just took a guess that fixing the actual tensor data was the problem, but you got an exactly identical result that way apparently.

One thing I should mention though: Have you tested that some build and model on x86? The recent changes to decoding (#3228) broke Baichuan 13B for me (it just crashes instead of producing incorrect output). It's possible Baichuan 7B inference is broken in a different way so it wouldn't work regardless of the endian stuff. I don't have a 7B model on hand to test with right now.

It's also not impossible that the actual operations in GGML make assumptions that don't work on big endian. Unfortunately identifying/fixing that is past my level of expertise. Hopefully it's something simpler, but that problem is possible. (I guess it might also be possible that the endian fixing stuff has to occur before the unpacking/permuting step, although I'm not really too sure why it would matter.)

@chenqiny
Copy link
Contributor Author

chenqiny commented Oct 3, 2023

Umm, unfortunately at this point I'm really not sure what the issue is. I pretty much just took a guess that fixing the actual tensor data was the problem, but you got an exactly identical result that way apparently.

One thing I should mention though: Have you tested that some build and model on x86? The recent changes to decoding (#3228) broke Baichuan 13B for me (it just crashes instead of producing incorrect output). It's possible Baichuan 7B inference is broken in a different way so it wouldn't work regardless of the endian stuff. I don't have a 7B model on hand to test with right now.

It's also not impossible that the actual operations in GGML make assumptions that don't work on big endian. Unfortunately identifying/fixing that is past my level of expertise. Hopefully it's something simpler, but that problem is possible. (I guess it might also be possible that the endian fixing stuff has to occur before the unpacking/permuting step, although I'm not really too sure why it would matter.)

@KerfuffleV2 I have tried Baichuan 7B on x86 several days ago. It worked well. I can verify it again.

Will padding related functions affect the result? I will have a try. I think for little endian, padding zero is appended after data, but for big endian, it should be apppended before data. If the answer is Yes, is it possible help me to know how to update the following code?
 

    def write_padding(self, fp: BinaryIO, n: int, align: int | None = None):
        pad = GGUFWriter.ggml_pad(n, align if align is not None else self.data_alignment) - n
        if pad != 0:
            fp.write(bytes([0] * pad))

    def write_tensor_data(self, tensor: np.ndarray[Any, Any]):
        self.write_padding(self.fout, self.fout.tell())
        tensor.tofile(self.fout)
        self.write_padding(self.fout, tensor.nbytes)

    def write_tensors_to_file(self):
        self.write_ti_data_to_file()

        self.write_padding(self.fout, self.fout.tell())

        if self.temp_file is None:
            for (currtensor, currpad) in self.tensors:
                currtensor.tofile(self.fout)
                if currpad != 0:
                    self.fout.write(bytes([0] * currpad))
            return
    
   def add_tensor(self, name: str, tensor: np.ndarray[Any, Any], raw_shape: Sequence[int] | None = None, raw_dtype: GGMLQuantizationType | None = None):
        tensor.byteswap(inplace=True)
        if self.use_temp_file and self.temp_file is None:
            fp = tempfile.SpooledTemporaryFile(mode="w+b", max_size=256*1024*1024)
            fp.seek(0)
            self.temp_file = fp

        shape: Sequence[int] = raw_shape if raw_shape is not None else tensor.shape
        self.add_tensor_info(name, shape, tensor.dtype, tensor.nbytes, raw_dtype = raw_dtype)

        pad = GGUFWriter.ggml_pad(tensor.nbytes, self.data_alignment) - tensor.nbytes

        if  self.temp_file is None:
            self.tensors.append((tensor, pad))
            return

        tensor.tofile(self.temp_file)

        if pad != 0:
            self.temp_file.write(bytes([0] * pad))

@KerfuffleV2
Copy link
Collaborator

I have tried Baichuan 7B on x86 several days ago.

Depends on what you mean by "several days". The pull I thought may have broken stuff got merged about 5 days ago.

I think for little endian, padding zero is appended after data, but for big endian, it should be apppended before data.

It's already before the tensor data, but that wouldn't change based on endianness. It's just empty space to make sure the start of the data is aligned to a multiple of 32 bytes. (The current code is sort of confusing because it adds an extra padding chunk after the last tensor too, which is kind of useless. Not something that would cause problems though.)

@chenqiny
Copy link
Contributor Author

chenqiny commented Oct 3, 2023

I have tried Baichuan 7B on x86 several days ago.

Depends on what you mean by "several days". The pull I thought may have broken stuff got merged about 5 days ago.

I think for little endian, padding zero is appended after data, but for big endian, it should be apppended before data.

It's already before the tensor data, but that wouldn't change based on endianness. It's just empty space to make sure the start of the data is aligned to a multiple of 32 bytes. (The current code is sort of confusing because it adds an extra padding chunk after the last tensor too, which is kind of useless. Not something that would cause problems though.)

7B works well with latest code on x86. 
 

 Steps to build a web site: first, you need an idea of what kind. [end of text]

But it is broken on s390x.

Steps to build a web site: first,<h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/><h4/>

@chenqiny
Copy link
Contributor Author

chenqiny commented Oct 3, 2023

@KerfuffleV2 thank you for help. I will continue to use gdb to print the llama_model to understand the difference.,

@chenqiny
Copy link
Contributor Author

chenqiny commented Oct 3, 2023

Futhur findings, 

After I compiled llama.cpp with LLAMA_DEBUG option on, the assertion in ggml is failed on s390 when running ./main program. No matter in gdb or execute the command directly. But it is good on x86.

Command I use:

 make -j8 LLAMA_DEBUG=1

ggml code snipet around ggml.c Line 12731 which assertion is failed:

It seems some value from source tensor is NAN. I guess this is why it's crashed

@KerfuffleV2  What does it check? 

static void ggml_compute_forward_soft_max_f32(
        const struct ggml_compute_params * params,
        const struct ggml_tensor * src0,
        struct ggml_tensor * dst) {
    GGML_ASSERT(ggml_is_contiguous(src0));
    GGML_ASSERT(ggml_is_contiguous(dst));
    GGML_ASSERT(ggml_are_same_shape(src0, dst));

    if (params->type == GGML_TASK_INIT || params->type == GGML_TASK_FINALIZE) {
        return;
    }

    // TODO: handle transposed/permuted matrices

    const int ith = params->ith;
    const int nth = params->nth;

    const int nc = src0->ne[0];
    const int nr = ggml_nrows(src0);

    // rows per thread
    const int dr = (nr + nth - 1)/nth;

    // row range for this thread
    const int ir0 = dr*ith;
    const int ir1 = MIN(ir0 + dr, nr);

        const struct ggml_compute_params * params,
        const struct ggml_tensor * src0,
        struct ggml_tensor * dst) {
            for (int i1 = ir0; i1 < ir1; i1++) {
        float *sp = (float *)((char *) src0->data + i1*src0->nb[1]);
        float *dp = (float *)((char *)  dst->data +  i1*dst->nb[1]);

#ifndef NDEBUG
        for (int i = 0; i < nc; ++i) {
            //printf("p[%d] = %f\n", i, p[i]);
assertion failure-->            assert(!isnan(sp[i]));
        }
#endif

        float max = -INFINITY;
        ggml_vec_max_f32(nc, &max, sp);
        

I also saw an error when I using gdb command print *model to show model variable.

mapping = Python Exception <class 'ValueError'> Unsupported implementation for unique_ptr: std::__uniq_ptr_data<llama_mmap, std::default_delete<llama_mmap>, true, true>: 
{_M_t = {<std::__uniq_ptr_impl<llama_mmap, std::default_delete<llama_mmap> >> = {_M_t = std::tuple containing = {[1] = 0x18f15b0, [2] = {<std::default_delete<llama_mmap>> = {<No data fields>}, <No data fields>}}}, <No data fields>}},

GDB command to run llama.cpp

gdb ./main
gdb> break main.cpp:187
gdb> run -m /aivol/cqy/Baichuan-7B/ggml-model-f16.gguf -p "Steps to build a web site: first," -n 50

Error Output:

llama_new_context_with_model: compute buffer total size = 138.88 MB
[New Thread 0x3fca1d1c910 (LWP 65075)]
[New Thread 0x3fca151b910 (LWP 65076)]
[New Thread 0x3fca0d1a910 (LWP 65077)]
main: ggml.c:12731: ggml_compute_forward_soft_max_f32: Assertion `!isnan(sp[i])' failed.
main: ggml.c:12731: ggml_compute_forward_soft_max_f32: Assertion `!isnan(sp[i])' failed.
main: ggml.c:12731: ggml_compute_forward_soft_max_f32: Assertion `!isnan(sp[i])' failed.
main: ggml.c:12731: ggml_compute_forward_soft_max_f32: Assertion `!isnan(sp[i])' failed.

Thread 1 "main" received signal SIGABRT, Aborted.
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50        return ret;
(gdb) backtrace
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x000003fffd923308 in __GI_abort () at abort.c:79
#2  0x000003fffd93c408 in __assert_fail_base (fmt=0x3fffda63df6 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=assertion@entry=0x1174f94 "!isnan(sp[i])", file=file@entry=0x1172b30 "ggml.c", line=line@entry=12731,
    function=function@entry=0x1177570 <__PRETTY_FUNCTION__.31> "ggml_compute_forward_soft_max_f32") at assert.c:92
#3  0x000003fffd93c484 in __GI___assert_fail (assertion=0x1174f94 "!isnan(sp[i])", file=0x1172b30 "ggml.c", line=<optimized out>, function=0x1177570 <__PRETTY_FUNCTION__.31> "ggml_compute_forward_soft_max_f32") at assert.c:101
#4  0x0000000001057432 in ggml_compute_forward_soft_max_f32 (params=0x3fffffead28, src0=0x3fcaa2c0010, dst=0x3fcaa2c0160) at ggml.c:12731
#5  0x00000000010576ea in ggml_compute_forward_soft_max (params=0x3fffffead28, src0=0x3fcaa2c0010, dst=0x3fcaa2c0160) at ggml.c:12775
#6  0x0000000001066f96 in ggml_compute_forward (params=0x3fffffead28, tensor=0x3fcaa2c0160) at ggml.c:16378
#7  0x000000000106c2d0 in ggml_graph_compute_thread (data=0x3fffffeae68) at ggml.c:17968
#8  0x000000000106d75e in ggml_graph_compute (cgraph=0x3fcaa21e030, cplan=0x3fffffeb030) at ggml.c:18445
#9  0x000000000107bc44 in ggml_graph_compute_helper (buf=std::vector of length 44224, capacity 44224 = {...}, graph=0x3fcaa21e030, n_threads=4) at llama.cpp:478
#10 0x000000000108f0c8 in llama_decode_internal (lctx=..., batch=...) at llama.cpp:4140
#11 0x000000000109cefe in llama_decode (ctx=0x18f15d0, batch=...) at llama.cpp:7449
#12 0x000000000114911a in llama_init_from_gpt_params (params=...) at common/common.cpp:843
#13 0x0000000001010f18 in main (argc=7, argv=0x3ffffffe5b8) at examples/main/main.cpp:181

@KerfuffleV2
Copy link
Collaborator

I think the difference is asserts probably don't run when compiled normally with optimization, so calculations just produce weird values instead of failing outright. There are two possible causes I can think of, the first is that the tensor data is just incorrect in the actual model file. The other is that the way the GGML operations are implemented just doesn't work on big endian for some reason.

Hmm... I don't know how much time you want to put into trying various random stuff but one thing to try would be just printing out some fairly small tensor. Maybe that could even be done from the conversion script. blk.0.attn_norm.weight looks like a good candidate, since it's 1 dimensional.

You could possibly try something like:

    def add_tensor(self, name: str, tensor: np.ndarray[Any, Any], raw_shape: Sequence[int] | None = None, raw_dtype: GGMLQuantizationType | None = None):
        if name == 'blk.0.attn_norm.weight':
            print('BEFORE', tensor)
        tensor.byteswap(inplace=True)
        if name == 'blk.0.attn_norm.weight':
            print('AFTER', tensor)

That just dumps the tensor before and after the byteswap stuff. If byteswapping was needed, I'd expect to see a bunch of crazy values in the "before" and more normal values in "after".

The other place to do something similar would be in the main example or something, to just look up a small tensor like that one and dump some of the values right after loading the model. Waste time following my advice at your own peril. :)

@chenqiny
Copy link
Contributor Author

chenqiny commented Oct 7, 2023

I fixed it at final. 

@KerfuffleV2  Many thanks for the help. With the jouney, I did a lot of wrong practice, but got to the right direction at final.

I was using convert.py instead of convert-baichuan-hf-to-gguf.py.  What are the difference between them? 

This script does not call add_tensor, but do same thing is write all function.

    @staticmethod
    def write_all(fname_out: Path, ftype: GGMLFileType, params: Params, model: LazyModel, vocab: Vocab, svocab: gguf.SpecialVocab, concurrency: int = DEFAULT_CONCURRENCY) -> None:
        check_vocab_size(params, vocab)

        of = OutputFile(fname_out)

        # meta data
        of.add_meta_arch(params)
        of.add_meta_vocab(vocab)
        of.add_meta_special_vocab(svocab)

        # tensor info
        for name, lazy_tensor in model.items():
            of.add_tensor_info(name, lazy_tensor)

The result:

ystem_info: n_threads = 4 / 4 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 400, n_keep = 0


 Build web site with 10 steps: first step, make sure the software program will be compatible using your computer or laptop.
Download our free demo to see how we can help you get more customers and increase sales! This is a very easy-to follow guide that shows even those who have never done anything like this before how it_s done. Just click on _free download_, enter the information, then sit back as your brand new website comes up fast with all of its features included for free _ no credit card necessary or any other payment plan required!
As you can see from our web site that we are one of the best SEO companies around and have been helping businesses get

I think the difference is asserts probably don't run when compiled normally with optimization, so calculations just produce weird values instead of failing outright. There are two possible causes I can think of, the first is that the tensor data is just incorrect in the actual model file. The other is that the way the GGML operations are implemented just doesn't work on big endian for some reason.

Hmm... I don't know how much time you want to put into trying various random stuff but one thing to try would be just printing out some fairly small tensor. Maybe that could even be done from the conversion script. blk.0.attn_norm.weight looks like a good candidate, since it's 1 dimensional.

You could possibly try something like:

    def add_tensor(self, name: str, tensor: np.ndarray[Any, Any], raw_shape: Sequence[int] | None = None, raw_dtype: GGMLQuantizationType | None = None):
        if name == 'blk.0.attn_norm.weight':
            print('BEFORE', tensor)
        tensor.byteswap(inplace=True)
        if name == 'blk.0.attn_norm.weight':
            print('AFTER', tensor)

That just dumps the tensor before and after the byteswap stuff. If byteswapping was needed, I'd expect to see a bunch of crazy values in the "before" and more normal values in "after".

The other place to do something similar would be in the main example or something, to just look up a small tensor like that one and dump some of the values right after loading the model. Waste time following my advice at your own peril. :)

@KerfuffleV2 it seems there is endian conversion issue.

I printed the first 16 word of data member in model layer 0 attn_norm with gdb.

The first float 32 is 0x3d8d0000 and value is 0.0688476563 on x86. But it is 0x8d3d0000 on s390, but in fact the big endian representation should be 0x00008d3d on s390.

x86:

(gdb) x/16xw model.layers[0].attn_norm.data
0x7ffce5df99a0: 0x3d8d0000      0x3d250000      0x3de70000      0x3d630000
0x7ffce5df99b0: 0x3d3b0000      0x3d590000      0x3d3b0000      0x3d120000
0x7ffce5df99c0: 0x3d460000      0x3d4d0000      0x3d380000      0x3d270000
0x7ffce5df99d0: 0x3d3d0000      0x3dae0000      0x3d310000      0x3d040000

s390:

(gdb) x/16xw model.layers[0].attn_norm.data
0x3fcec14b9a0:  0x8d3d0000      0x253d0000      0xe73d0000      0x633d0000
0x3fcec14b9b0:  0x3b3d0000      0x593d0000      0x3b3d0000      0x123d0000
0x3fcec14b9c0:  0x463d0000      0x4d3d0000      0x383d0000      0x273d0000
0x3fcec14b9d0:  0x3d3d0000      0xae3d0000      0x313d0000      0x043d0000

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants