Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

convert-hf : support direct Q8_0 conversion #7234

Merged
merged 3 commits into from
May 13, 2024

Conversation

compilade
Copy link
Collaborator

@compilade compilade commented May 12, 2024

This adds Q8_0 conversion to convert-hf-to-gguf.py, and it results in EXACTLY the same files as if converted with ./quantize from an f32 model.

Note that this was NOT the case for convert.py, because it rounds to nearest even and divides by the scale, while the reference implementation in ggml-quants.c rounds away from zero and multiplies by the inverse of the scale.

Summary of changes

  • Add missing self.gguf_writer.add_file_type(self.ftype) for StableLMModel, InternLM2Model, PlamoModel, QwenModel, BaichuanModel, and XverseModel.
    • This was messing up the checksums otherwise
  • Add gguf-py/gguf/quants.py to put the Q8_0 implementation there and also move bf16 conversion in there too.
  • Make lazy tensors support shape and dtype changes from operations which won't run on meta tensors
    • Useful with Numpy which doesn't have true meta tensors
  • Performance improvement for bf16 conversion, from 40-60 MB/s on my machine to 104 MB/s
  • make GGUFWriter support arbitrary quants with np.uint8 dtype
    • it now corrects the shape using the type size and the block size of the raw_dtype

TODO:

  • Maybe rename Model.extra_f16_tensors to Model.extra_quantized_tensors
  • Maybe also fix Q8_0 in convert.py to round in the same way as the reference implementation?

Testing

To be sure this Python implementation of Q8_0 really is working in the exact same way as the reference implementation from ggml-quants.c, I'm testing conversion and quantization of a bunch of different model architectures.

I recently got a big external hard drive, which makes storing the output of these tests much easier.

I'm doing pretty much this for each for every model architecture tested below
(using outfile {ftype} templating introduced in #7158) :

$ python3 convert-hf-to-gguf.py --outtype f32 --outfile /srv/LLMstash/tmp/model-name-lazy-convert.{ftype}.gguf /srv/LLMstash/src/model-dir/
$ python3 convert-hf-to-gguf.py --outtype bf16 --outfile /srv/LLMstash/tmp/model-name-lazy-convert.{ftype}.gguf /srv/LLMstash/src/model-dir/
$ python3 convert-hf-to-gguf.py --outtype q8_0 --outfile /srv/LLMstash/tmp/model-name-lazy-convert.{ftype}.gguf /srv/LLMstash/src/model-dir/
$ ./build/bin/quantize /srv/LLMstash/tmp/model-name-{lazy-convert.f32,quantize.bf16}.gguf bf16
$ ./build/bin/quantize /srv/LLMstash/tmp/model-name-{lazy-convert.f32,quantize.q8_0}.gguf q8_0
$ sha256sum /srv/LLMstash/tmp/model-name-*.{bf16,q8_0}.gguf

I'd say there is some suspense when the checksums begin to appear. Will they match?

  • BERT bge-small (torch.float32) https://huggingface.co/BAAI/bge-small-en-v1.5
    • @compilade checksums match
      $ sha256sum bge-small-*.{bf16,q8_0}.gguf
      95d9de5f3b6118b62d668b3e1cf1a675ed1a7334d37c0e08904fd2fe89afa880  bge-small-lazy-convert.bf16.gguf
      95d9de5f3b6118b62d668b3e1cf1a675ed1a7334d37c0e08904fd2fe89afa880  bge-small-quantize.bf16.gguf
      2940cc3f35e94fcc9e2cc35ed75e6524941d4c45ecec8a29a4cb013934bf3b1a  bge-small-lazy-convert.q8_0.gguf
      2940cc3f35e94fcc9e2cc35ed75e6524941d4c45ecec8a29a4cb013934bf3b1a  bge-small-quantize.q8_0.gguf
  • Tinyllama (torch.bfloat16) https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0
    • @compilade checkums match
      $ sha256sum tinyllama-*.{bf16,q8_0}.gguf
      5af5abc9122fd3cd113b5bdd828e0f1c1737a432e3de0b80cd403a32659f07a6  tinyllama-lazy-convert.bf16.gguf
      5af5abc9122fd3cd113b5bdd828e0f1c1737a432e3de0b80cd403a32659f07a6  tinyllama-quantize.bf16.gguf
      06d3a29d46ac6d2a128a4c357544f24e311a582eb9af18134db76fe3044111e8  tinyllama-lazy-convert.q8_0.gguf
      06d3a29d46ac6d2a128a4c357544f24e311a582eb9af18134db76fe3044111e8  tinyllama-quantize.q8_0.gguf
  • TinyMistral MoE (torch.float32) https://huggingface.co/jtatman/TinyMistral-248m-v2.5-4x-Moe
    • @compilade checksums match
      $ sha256sum tinymistral-moe-*.{bf16,q8_0}.gguf
      d6e6e1977ffa9cc365d425c107d7a79770b9425cab9db04d695e092fedd00d72  tinymistral-moe-lazy-convert.bf16.gguf
      d6e6e1977ffa9cc365d425c107d7a79770b9425cab9db04d695e092fedd00d72  tinymistral-moe-quantize.bf16.gguf
      316c22ee97cd3717736ff7cd4ee6cabfdb811b389bc23c4d7cf071c0b514144b  tinymistral-moe-lazy-convert.q8_0.gguf
      316c22ee97cd3717736ff7cd4ee6cabfdb811b389bc23c4d7cf071c0b514144b  tinymistral-moe-quantize.q8_0.gguf
  • Refact 1.6B fim (torch.bfloat16) https://huggingface.co/smallcloudai/Refact-1_6B-fim
    • @compilade checksums match
      $ sha256sum refact-*.{bf16,q8_0}.gguf
      7536808b072cb49466d4346e9421379cb20b3730a22615ffc32d2e8fad1a794d  refact-lazy-convert.bf16.gguf
      7536808b072cb49466d4346e9421379cb20b3730a22615ffc32d2e8fad1a794d  refact-quantize.bf16.gguf
      0f904f322b3857bfa40e608b6d18fad96cb85bfa000769ccf67957f9a91194e8  refact-lazy-convert.q8_0.gguf
      0f904f322b3857bfa40e608b6d18fad96cb85bfa000769ccf67957f9a91194e8  refact-quantize.q8_0.gguf
  • Rocket 3B (StableLM) (torch.float16) https://huggingface.co/pansophic/rocket-3B
    • @compilade It's already in f16, but I'm quantizing from f32 anyway for consistency with the other tests.
      ⚠️ checksums DON'T match. Not sure why. Anyone know of a program for diffing very large files?
      AhA! Looking at the diff of the gguf-dump of each model shows this is a metadata problem! The ftype was missing!
      $ sha256sum rocket-3b-*.{bf16,q8_0}.gguf
      b9f3713f40ae8c567125b58e714f40af6fd4345fa86ca5ba95beb6f63ba84915  rocket-3b-lazy-convert.bf16.gguf
      356822c055c353e2291c147632607f61d16c66b4de117f00cf3acb88975c7057  rocket-3b-quantize.bf16.gguf
      356822c055c353e2291c147632607f61d16c66b4de117f00cf3acb88975c7057  rocket-3b-quantize-from-f16.bf16.gguf
      64ac7a549278d9dcf0bc13317d779409ae8e86a0894dad9678b923d634197021  rocket-3b-lazy-convert.q8_0.gguf
      3c9664651c5b5319c1c1d8b592527a228442a56d2ceb05ff349818d1ad707d4a  rocket-3b-quantize.q8_0.gguf
      3c9664651c5b5319c1c1d8b592527a228442a56d2ceb05ff349818d1ad707d4a  rocket-3b-quantize-from-f16.q8_0.gguf
      ✔️ After adding the ftype in 2b1e5ea: CHECKSUMS MATCH!!!
      $ sha256sum rocket-3b-{lazy-convert,quantize}.{bf16,q8_0}.gguf
      5c7478a782b593835422b865b9a639809e2cdac23ce479bb571e94e8124b275e  rocket-3b-lazy-convert.bf16.gguf
      5c7478a782b593835422b865b9a639809e2cdac23ce479bb571e94e8124b275e  rocket-3b-quantize.bf16.gguf
      9a9c298964e4d3ee1ff318e4b4d8235b7fa2f7912a52748a4fc4ab1eef502793  rocket-3b-lazy-convert.q8_0.gguf
      9a9c298964e4d3ee1ff318e4b4d8235b7fa2f7912a52748a4fc4ab1eef502793  rocket-3b-quantize.q8_0.gguf
  • StableLM 2 1.6B (torch.float16)
    • @compilade Had to ignore the pre-tokenizer detection error.
      ⚠️ checksums DON'T match. Is this specific to StableLM or to torch.float16 models? Bloom is in float16, yet it still worked for it.
      AHA, the problem is that the ftype is not put in the model!!!
      $ sha256sum stablelm2-1_6b-*.{bf16,q8_0}.gguf
      8125ba3fb52d8c23b7775a6cbb4349e6d3ee327a4fb8eabf0b967af5f1f0ec52  stablelm2-1_6b-lazy-convert.bf16.gguf
      503fb278646f090c3a9bff44c4e863ed7bdf5e641c4b5654013ea675e8bb1e83  stablelm2-1_6b-quantize.bf16.gguf
      4ea2ed24eda9dc0014cd9d114bff312dde96538b7a1e4b3471f15ccdfa0342d8  stablelm2-1_6b-lazy-convert.q8_0.gguf
      3b208039d839de88d21c22d50d3ebb50fa5f9ff96ee36ecaeb7c3ed04b5e55ae  stablelm2-1_6b-quantize.q8_0.gguf
      ✔️ After adding the ftype in 2b1e5ea: CHECKSUMS MATCH!!!!
      $ sha256sum stablelm2-1_6b-{lazy-convert,quantize}.{bf16,q8_0}.gguf
      44070519faa81b81b5720a0441e82e070698f9e7dc1f2be777878697a8ddd3a1  stablelm2-1_6b-lazy-convert.bf16.gguf
      44070519faa81b81b5720a0441e82e070698f9e7dc1f2be777878697a8ddd3a1  stablelm2-1_6b-quantize.bf16.gguf
      a41c8ef9e3728506c27d5fd0bb73257b6ee781bd7d2ad83623d4e85319a36423  stablelm2-1_6b-lazy-convert.q8_0.gguf
      a41c8ef9e3728506c27d5fd0bb73257b6ee781bd7d2ad83623d4e85319a36423  stablelm2-1_6b-quantize.q8_0.gguf
  • Mamba 2.8B (torch.float32) https://huggingface.co/jondurbin/bagel-dpo-2.8b-v0.2
    • @compilade checksums match
      $ sha256sum mamba-bagel-*.{bf16,q8_0}.gguf
      5243cbf5e394709db46332c71d66841bd077a195c714fa942ad2e1952501c1e4  mamba-bagel-lazy-convert.bf16.gguf
      5243cbf5e394709db46332c71d66841bd077a195c714fa942ad2e1952501c1e4  mamba-bagel-quantize.bf16.gguf
      0d11e763fae7d8b2186d6b72a90974832e1f095a49f7c9bb66e21c8f1fc7484d  mamba-bagel-lazy-convert.q8_0.gguf
      0d11e763fae7d8b2186d6b72a90974832e1f095a49f7c9bb66e21c8f1fc7484d  mamba-bagel-quantize.q8_0.gguf
  • Bloom 560m (torch.float16) https://huggingface.co/bigscience/bloom-560m
    • @compilade checksums match
      sha256sum bloom-560m-*.{bf16,q8_0}.gguf
      00914f5c6cda007f183ff748a97ee0d0ae409328837f4099a502927e6f2e3a9e  bloom-560m-lazy-convert.bf16.gguf
      00914f5c6cda007f183ff748a97ee0d0ae409328837f4099a502927e6f2e3a9e  bloom-560m-quantize.bf16.gguf
      1f389f7c25659580269e4f1d986ac93ba6307d0dbcb14f4d05a120406b287580  bloom-560m-lazy-convert.q8_0.gguf
      1f389f7c25659580269e4f1d986ac93ba6307d0dbcb14f4d05a120406b287580  bloom-560m-quantize.q8_0.gguf
  • Qwen 1.6B Chat (torch.bfloat16) https://huggingface.co/Qwen/Qwen-1_8B-Chat
    • @compilade ⚠️ same problem as with StableLMModel, the ftype is missing.
      $ sha256sum qwen-1_8-*.{bf16,q8_0}.gguf
      73168f805cec5ac836c6de8f62babbb57268a11bf1a5971704bdf28eb3989bfa  qwen-1_8-lazy-convert.bf16.gguf
      3168513b9a2e1d09f1e29ff165dd29a9808f8fbee4c67a3ce27186c73701d033  qwen-1_8-quantize.bf16.gguf
      71a1a117122290dac59bc3b6d0a94e481d6bd333c4c7a1cb50671a44d98fd248  qwen-1_8-lazy-convert.q8_0.gguf
      d5beaa3052d82fcae0098d641a03c2fb7e36f7db8c1100463f4f38d995237ab7  qwen-1_8-quantize.q8_0.gguf
      ✔️ After adding the ftype in 2b1e5ea: checksums match!
      $ sha256sum qwen-1_8-{lazy-convert,quantize}.{bf16,q8_0}.gguf
      f3e6576265bb0a9a3a5143c55980539d46feed959fdce141d7607ae98a075408  qwen-1_8-lazy-convert.bf16.gguf
      f3e6576265bb0a9a3a5143c55980539d46feed959fdce141d7607ae98a075408  qwen-1_8-quantize.bf16.gguf
      62ed4a9fe70fdf2c32f4d127e3a231fa6000d9fed9ae838a22297032f6409e7f  qwen-1_8-lazy-convert.q8_0.gguf
      62ed4a9fe70fdf2c32f4d127e3a231fa6000d9fed9ae838a22297032f6409e7f  qwen-1_8-quantize.q8_0.gguf
  • InternLM2 (torch.bfloat16) https://huggingface.co/internlm/internlm2-chat-1_8b
    • @compilade The ftype was missing, but I noticed it before first converting. After adding the ftype in 2b1e5ea: checksums match!
      $ sha256sum internlm2-chat-1_8b-{lazy-convert,quantize}.{bf16,q8_0}.gguf
      6eb940c38b768314b57c3df7bb42fd4b5d87e0f7cb9297a6685f9c602a9f310c  internlm2-chat-1_8b-lazy-convert.bf16.gguf
      6eb940c38b768314b57c3df7bb42fd4b5d87e0f7cb9297a6685f9c602a9f310c  internlm2-chat-1_8b-quantize.bf16.gguf
      2ba4217854da0ba0a7489ef0c44692aa298343a907e77acf486ba9f6d4915e9f  internlm2-chat-1_8b-lazy-convert.q8_0.gguf
      2ba4217854da0ba0a7489ef0c44692aa298343a907e77acf486ba9f6d4915e9f  internlm2-chat-1_8b-quantize.q8_0.gguf
  • Other architectures (will most likely work)

@compilade compilade added enhancement New feature or request python python script changes Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level labels May 12, 2024
I didn't notice these on my first pass.
@compilade compilade merged commit ee52225 into master May 13, 2024
25 checks passed
teleprint-me pushed a commit to teleprint-me/llama.cpp that referenced this pull request May 17, 2024
* convert-hf : support q8_0 conversion

* convert-hf : add missing ftype

This was messing with the checksums otherwise.

* convert-hf : add missing ftype to Baichuan and Xverse

I didn't notice these on my first pass.
@LostRuins
Copy link
Collaborator

Hi, this PR breaks model conversion on my system.

  File "E:\LLaMA\llamacpp\gguf-py\gguf\lazy.py", line 9, in <module>
    from numpy._typing import _Shape
ModuleNotFoundError: No module named 'numpy._typing'

I was using numpy-1.22.3 After force upgrading my env to the latest numpy-1.26.4, the latest script works.

However, I am hoping that it is possible to allow it to work with numpy 1.22 as it did before this commit as a fallback? A lot of toolchains might still be on slightly older versions of numpy, forcing the use of the latest newest version may not be ideal.

#7380

Comment on lines +231 to +235
if tensor_dtype == np.uint8:
block_size, type_size = GGML_QUANT_SIZES[raw_dtype]
if tensor_shape[-1] % type_size != 0:
raise ValueError(f"Quantized tensor row size ({tensor_shape[-1]}) is not a multiple of {dtype.name} type size ({type_size})")
tensor_shape = tuple(tensor_shape[:-1]) + (tensor_shape[-1] // type_size * block_size,)
Copy link
Contributor

@CISC CISC May 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has broken copying of tensors on i-quants (and probably several others as well), using

./gguf-new-metadata.py foo.IQ4_NL.gguf bar.gguf

you now get

ValueError: Quantized tensor row size (4096) is not a multiple of IQ4_NL type size (18)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue seems to be that the type_size is off by 2, however I don't see why the tensor should be reshaped in this scenario, so this should probably be re-evaluated.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for finding this!
I think it also breaks copying of all other quantized tensors in gguf-new-metadata.
Sorry about that.

I think I found a way to fix this while also simplifying what happens to the shape in the round-trip between GGUFReader and GGUFWriter. See #7483

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request python python script changes Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants