-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
convert-hf : support direct Q8_0 conversion #7234
Conversation
I didn't notice these on my first pass.
* convert-hf : support q8_0 conversion * convert-hf : add missing ftype This was messing with the checksums otherwise. * convert-hf : add missing ftype to Baichuan and Xverse I didn't notice these on my first pass.
Hi, this PR breaks model conversion on my system.
I was using However, I am hoping that it is possible to allow it to work with numpy 1.22 as it did before this commit as a fallback? A lot of toolchains might still be on slightly older versions of numpy, forcing the use of the latest newest version may not be ideal. |
if tensor_dtype == np.uint8: | ||
block_size, type_size = GGML_QUANT_SIZES[raw_dtype] | ||
if tensor_shape[-1] % type_size != 0: | ||
raise ValueError(f"Quantized tensor row size ({tensor_shape[-1]}) is not a multiple of {dtype.name} type size ({type_size})") | ||
tensor_shape = tuple(tensor_shape[:-1]) + (tensor_shape[-1] // type_size * block_size,) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has broken copying of tensors on i-quants (and probably several others as well), using
./gguf-new-metadata.py foo.IQ4_NL.gguf bar.gguf
you now get
ValueError: Quantized tensor row size (4096) is not a multiple of IQ4_NL type size (18)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue seems to be that the type_size
is off by 2, however I don't see why the tensor should be reshaped in this scenario, so this should probably be re-evaluated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for finding this!
I think it also breaks copying of all other quantized tensors in gguf-new-metadata
.
Sorry about that.
I think I found a way to fix this while also simplifying what happens to the shape in the round-trip between GGUFReader
and GGUFWriter
. See #7483
This adds
Q8_0
conversion toconvert-hf-to-gguf.py
, and it results in EXACTLY the same files as if converted with./quantize
from anf32
model.Note that this was NOT the case for
convert.py
, because it rounds to nearest even and divides by the scale, while the reference implementation inggml-quants.c
rounds away from zero and multiplies by the inverse of the scale.Summary of changes
self.gguf_writer.add_file_type(self.ftype)
forStableLMModel
,InternLM2Model
,PlamoModel
,QwenModel
,BaichuanModel
, andXverseModel
.gguf-py/gguf/quants.py
to put theQ8_0
implementation there and also movebf16
conversion in there too.bf16
conversion, from40-60 MB/s
on my machine to104 MB/s
GGUFWriter
support arbitrary quants withnp.uint8
dtyperaw_dtype
TODO:
Model.extra_f16_tensors
toModel.extra_quantized_tensors
Q8_0
inconvert.py
to round in the same way as the reference implementation?Testing
To be sure this Python implementation of
Q8_0
really is working in the exact same way as the reference implementation fromggml-quants.c
, I'm testing conversion and quantization of a bunch of different model architectures.I recently got a big external hard drive, which makes storing the output of these tests much easier.
I'm doing pretty much this for each for every model architecture tested below
(using
outfile
{ftype}
templating introduced in #7158) :I'd say there is some suspense when the checksums begin to appear. Will they match?
torch.float32
) https://huggingface.co/BAAI/bge-small-en-v1.5torch.bfloat16
) https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0torch.float32
) https://huggingface.co/jtatman/TinyMistral-248m-v2.5-4x-Moetorch.bfloat16
) https://huggingface.co/smallcloudai/Refact-1_6B-fimtorch.float16
) https://huggingface.co/pansophic/rocket-3Bf16
, but I'm quantizing fromf32
anyway for consistency with the other tests.AhA! Looking at the diff of the
gguf-dump
of each model shows this is a metadata problem! Theftype
was missing!ftype
in 2b1e5ea: CHECKSUMS MATCH!!!torch.float16
)torch.float16
models? Bloom is infloat16
, yet it still worked for it.AHA, the problem is that the
ftype
is not put in the model!!!ftype
in 2b1e5ea: CHECKSUMS MATCH!!!!torch.float32
) https://huggingface.co/jondurbin/bagel-dpo-2.8b-v0.2torch.float16
) https://huggingface.co/bigscience/bloom-560mtorch.bfloat16
) https://huggingface.co/Qwen/Qwen-1_8B-ChatStableLMModel
, theftype
is missing.ftype
in 2b1e5ea: checksums match!torch.bfloat16
) https://huggingface.co/internlm/internlm2-chat-1_8bftype
was missing, but I noticed it before first converting. After adding theftype
in 2b1e5ea: checksums match!