Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update to latest llama.cpp #118

Merged
merged 3 commits into from
Mar 31, 2023
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ RUN apt update && \
echo "deb [ arch=amd64,arm64 ] https://repo.mongodb.org/apt/ubuntu jammy/mongodb-org/6.0 multiverse" | tee /etc/apt/sources.list.d/mongodb-org-6.0.list && \
apt-get update && \
apt-get install -y mongodb-org && \
git clone https://github.com/ggerganov/llama.cpp.git --branch master-5a5f8b1
git clone https://github.com/ggerganov/llama.cpp.git --branch master-ee0c40d

RUN pip install --upgrade pip

Expand Down
1,300 changes: 1,300 additions & 0 deletions api/poetry.lock

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions api/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -64,3 +64,4 @@ beanie = "^1.17.0"
dnspython = "^2.3.0"
lazy-model = "^0.0.5"
requests = "^2.28.2"
numpy = "^1.24.2"
4 changes: 2 additions & 2 deletions api/src/serge/routers/chat.py
Original file line number Diff line number Diff line change
Expand Up @@ -124,7 +124,7 @@ async def event_generator():
prompt=full_prompt,
params=chat.parameters,
):
await asyncio.sleep(0.1)
await asyncio.sleep(0.01)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the the purpose of these sleeps?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we generate a token once every 100ms at best for good machines, so there's no point in checking the output buffer of the program more than that. The sleep was here to prevent the infinite loop from locking up resources by running constantly.

What I realized was that we have a chunk size of 4 bytes and we check the buffer every 0.1s. So we were fetching at most (1/0.1)*4 = 40bytes a second. Usually that's enough but when we load the initial prompt it can go a lot faster than that and we were slowing things down for no reason there. It was bad design from my side :/

The symptom of that was that you would see the CPU activity decrease but it would still take a while for the answer to appear in the chat. The answer was fully generated but was just being read slowly from the output buffer. 🤦 Now with a sleep timer of 0.01 and a chunk size of 64, I don't expect we'll have a problem haha


chunks.append(output)
full_answer += output
Expand Down Expand Up @@ -161,7 +161,7 @@ async def ask_a_question(chat_id: str, prompt: str):
prompt=full_prompt,
params=chat.parameters,
):
await asyncio.sleep(0.1)
await asyncio.sleep(0.01)
answer += output
except Exception as e:
error = e.__str__()
Expand Down
3 changes: 3 additions & 0 deletions api/src/serge/utils/convert.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,9 @@
import os
import struct
import sys

from sentencepiece import SentencePieceProcessor
from serge.utils.migrate import migrate

HPARAMS = keys = ["vocab_size", "dim", "multiple_of", "n_heads", "n_layers"]

Expand Down Expand Up @@ -115,6 +117,7 @@ def convert_all(dir_model: str, tokenizer_model: str):
tokenizer = SentencePieceProcessor(tokenizer_model)
for file in files:
convert_one_file(file, tokenizer)
migrate(file)
except OSError:
print("Missing tokenizer, don't forget to download it!")

Expand Down
2 changes: 1 addition & 1 deletion api/src/serge/utils/generate.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ async def generate(
prompt: str,
params: ChatParameters
):
CHUNK_SIZE = 4
CHUNK_SIZE = 64
await params.fetch_all_links()

args = (
Expand Down
305 changes: 305 additions & 0 deletions api/src/serge/utils/migrate.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,305 @@
# Migrate ggml file(s) with ggmf magic to ggml file with ggjt magic
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we just CP this file in the Dockerfile after the git clone?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The file is different, I modified the original script so it could be called as a function, and the way it shuffles files around also is different than the original so it works better with serge. I hope the content doesn't change too often 😅

#
# We caused a breaking change to the file format on 2023-03-30 in:
# https://github.com/ggerganov/llama.cpp/pull/613
#
# (1) If you still have the Meta LLaMA .pth files, then close this
# file now; you can just run `convert-pth-to-ggml.py` again to
# migrate to the new format. The tool is easier to use too. It
# isn't necessary anymore to manage split output files because
# the new format always combines things into a single file.
#
# (2) If you deleted the Meta LLaMA .pth files due to save on disk
# space, then this tool is intended to help you. Please check
# out the instructions below.
#
# USAGE
#
# python migrate-ggml-2023-03-30-pr613.py INPUT OUTPUT
#
# PREREQUISITES
#
# pip install numpy
# cd llama.cpp
# make -j4
#
# EXAMPLE (7B MODEL)
#
# # you can replace all the 'f16' with 'q4_0' if you're using quantized weights
# python migrate-ggml-2023-03-30-pr613.py models/7B/ggml-model-f16.bin models/7B/ggml-model-f16-ggjt.bin
#
# # check that it works
# ./main -m models/7B/ggml-model-f16-ggjt.bin -p 'Question: Do you love me?'
#
# # you can delete the old files
# rm -f models/7B/ggml-model-f16.bin
# mv models/7B/ggml-model-f16-ggjt.bin models/7B/ggml-model-f16.bin
#
# EXAMPLE (13B MODEL)
#
# # you can replace all the 'f16' with 'q4_0' if you're using quantized weights
# python migrate-ggml-2023-03-30-pr613.py models/13B/ggml-model-f16.bin models/13B/ggml-model-f16-ggjt.bin
#
# # check that it works
# ./main -m models/13B/ggml-model-f16-ggjt.bin -p 'Question: Do you love me?'
#
# # you can delete the old files
# rm -f models/13B/ggml-model-f16.bin*
# mv models/13B/ggml-model-f16-ggjt.bin models/13B/ggml-model-f16.bin
#

import argparse
import os
import sys
import json
import struct
import numpy as np

QK = 32

GGML_TYPE_Q4_0 = 0
GGML_TYPE_Q4_1 = 1
GGML_TYPE_I8 = 2
GGML_TYPE_I16 = 3
GGML_TYPE_I32 = 4
GGML_TYPE_F16 = 5
GGML_TYPE_F32 = 6

WTYPE_NAMES = {
0: "F32",
1: "F16",
2: "Q4_0",
3: "Q4_1",
}

WTYPES = {
0: GGML_TYPE_F32,
1: GGML_TYPE_F16,
2: GGML_TYPE_Q4_0,
3: GGML_TYPE_Q4_1,
}

GGML_BLCK_SIZE = {
GGML_TYPE_Q4_0: QK,
GGML_TYPE_Q4_1: QK,
GGML_TYPE_I8: 1,
GGML_TYPE_I16: 1,
GGML_TYPE_I32: 1,
GGML_TYPE_F16: 1,
GGML_TYPE_F32: 1,
}

GGML_TYPE_SIZE = {
GGML_TYPE_Q4_0: 4 + QK//2,
GGML_TYPE_Q4_1: 4*2 + QK//2,
GGML_TYPE_I8: 1,
GGML_TYPE_I16: 2,
GGML_TYPE_I32: 4,
GGML_TYPE_F16: 2,
GGML_TYPE_F32: 4,
}

HPARAMS = [
'magic', # int32
'version', # int32
'n_vocab', # int32
'n_embd', # int32
'n_mult', # int32
'n_head', # int32
'n_layer', # int32
'n_rot', # int32
'f16', # int32
]

def read_hparams(fin):
struct_fmt = "i" * len(HPARAMS)
struct_size = struct.calcsize(struct_fmt)
buf = fin.read(struct_size)
ints = struct.unpack(struct_fmt, buf)
hparams = dict(zip(HPARAMS, ints))
return hparams

def write_hparams(fout, hparams):
struct_fmt = "i" * len(HPARAMS)
struct_size = struct.calcsize(struct_fmt)
ints = [hparams[h] for h in HPARAMS]
fout.write(struct.pack(struct_fmt, *ints))

def read_tokens(fin, hparams):
tokens = []
for i in range(hparams['n_vocab']):
len_b = fin.read(4)
(length,) = struct.unpack("i", len_b)
word = fin.read(length)
score_b = fin.read(4)
(score,) = struct.unpack("f", score_b)
tokens.append((word, score))
return tokens

def write_tokens(fout, tokens):
for word, score in tokens:
fout.write(struct.pack("i", len(word)))
fout.write(word)
fout.write(struct.pack("f", score))

def ggml_nelements(shape):
r = 1
for i in shape:
r *= i
return r

def ggml_nbytes(shape, ftype):
x = ggml_nelements(shape)
t = WTYPES[ftype]
x *= GGML_TYPE_SIZE[t]
x //= GGML_BLCK_SIZE[t]
return x

def copy_tensors(fin, fout, part_id, n_parts):
while True:

b = fin.read(4)
if not b: break
(n_dims,) = struct.unpack("i", b)
b = fin.read(4)
(length,) = struct.unpack("i", b)
b = fin.read(4)
(ftype,) = struct.unpack("i", b)

assert n_dims in (1, 2)

partshape = list(range(n_dims))
for i in range(n_dims):
b = fin.read(4)
partshape[i] = struct.unpack("i", b)[0]
partshape = list(reversed(partshape))

name = fin.read(length)
data = fin.read(ggml_nbytes(partshape, ftype))

blck_size = GGML_BLCK_SIZE[WTYPES[ftype]]
type_size = GGML_TYPE_SIZE[WTYPES[ftype]]

print(f"Processing tensor {name} with shape: {partshape} and type: {WTYPE_NAMES[ftype]}")

# determine dimension along which multipart tensor is sharded
#
# split_dim 0 regex:
# - output.*
# - layers.*.attention.wq.weight
# - layers.*.attention.wk.weight
# - layers.*.attention.wv.weight
# - layers.*.feed_forward.w1.weight
# - layers.*.feed_forward.w3.weight
#
# split_dim 1 regex:
# - tok_embeddings.*
# - layers.*.attention.wo.weight
# - layers.*.feed_forward.w2.weight
#
if n_dims > 1:
split_dim = 1
if b"tok_embeddings" in name:
split_dim = 1
elif b"layers" in name:
if b"attention.wo.weight" in name:
split_dim = 1
elif b"feed_forward.w2.weight" in name:
split_dim = 1
else:
split_dim = 0
elif b"output" in name:
split_dim = 0

# output tensor header
fullshape = list(partshape)
if n_dims > 1:
fullshape[split_dim] *= n_parts
fout.write(struct.pack("iii", n_dims, len(name), ftype))
for dim in reversed(fullshape):
fout.write(struct.pack("i", dim))
fout.write(name)

# ensure tensor data is aligned
tensor_data_offset = fout.tell()
while tensor_data_offset % QK != 0:
fout.write(struct.pack("B", 0))
tensor_data_offset += 1

# output unified mappable tensor data
if n_dims == 1 or n_parts == 1:
# copy tensor which we thankfully received in one piece
if part_id == 0:
fout.write(data)
elif split_dim == 0:
# reassemble multifile tensor containing some of the rows
rows_per_chunk = partshape[0]
current_row = part_id * rows_per_chunk
bytes_per_row = fullshape[1] // blck_size * type_size
offset = current_row * bytes_per_row
fout.seek(tensor_data_offset + offset)
fout.write(data)
elif split_dim == 1:
# reassemble multifile tensor containing some of the cols
cols_per_chunk = partshape[1]
current_col = part_id * cols_per_chunk
bpr = partshape[1] // blck_size * type_size
bytes_per_row = fullshape[1] // blck_size * type_size
offset_current_col = current_col // blck_size * type_size
for row in range(partshape[0]):
offset_row = row * bytes_per_row
offset = offset_row + offset_current_col
fout.seek(tensor_data_offset + offset)
fout.write(data[row * bpr:row * bpr + bpr])

# advance file position to next tensor
fout.seek(tensor_data_offset + ggml_nbytes(fullshape, ftype))

def migrate(fin_path):
assert fin_path
assert os.path.exists(fin_path)

with open(fin_path, "rb") as fin:
hparams = read_hparams(fin)
tokens = read_tokens(fin, hparams)

if hparams['magic'] == 0x67676a74: # ggjt
print("%s: input ggml has already been converted to 'ggjt' magic\n" %
(fin_path))
return

if hparams['magic'] != 0x67676d66: # ggmf
print("%s: input ggml file doesn't have expected 'ggmf' magic: %#x\n" %
(fin_path, hparams['magic']))
return

hparams['magic'] = 0x67676a74 # ggjt

# count number of multipart files by convention
n_parts = 1
while True:
if os.path.exists("%s.%d" % (fin_path, n_parts)):
n_parts += 1
else:
break

# we output a single file for ggml
with open(fin_path+".migrated", "wb") as fout:
write_hparams(fout, hparams)
write_tokens(fout, tokens)
offset_of_tensors = fout.tell()
# the tensors we load could be split across multiple files
for part_id in range(n_parts):
fout.seek(offset_of_tensors)
print(f"Processing part {part_id+1} of {n_parts}\n")
fin_path = fin_path
if part_id > 0:
fin_path += ".%d" % (part_id)
with open(fin_path, "rb") as fin:
read_tokens(fin, read_hparams(fin))
copy_tensors(fin, fout, part_id, n_parts)

os.remove(fin_path)
os.rename(fin_path+".migrated", fin_path)

print(f"Done. Output file: {fin_path+'.migrated'}\n")