Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

preemptive request : regarding possible bit shuffling sync (just in case) #150

Open
LostRuins opened this issue May 12, 2023 · 10 comments
Open

Comments

@LostRuins
Copy link
Contributor

LostRuins commented May 12, 2023

Hello, this is just a pre-emptive request, on the potential chance that the llamacpp bit shuffling changes are sync'd to this repo, would it be possible to add some indication to these models to differentiate them from the old ones already in existence? A new field, a magic change, a version indicator or something would be very useful. Perhaps #147 can be considered too?

Reason being that since the models are otherwise indistinguishable (same file format and structure), it will be hard to tell whether a model file has bit shuffling or not (old models will load perfectly fine but just generate gibberish).

If then bit shuffling changes are not planned to be upstreamed, then please disregard this issue.

Thanks in advance!

  • Concedo
@ggerganov
Copy link
Owner

I'm planning to back-port the changes soon.

The way I see it is that users of the ggml library have to implement their own versioning scheme for the models.
The examples in the ggml repo serve just as sample implementations. They are not meant to be used in production.

I'm open to suggestions, but I don't see necessary to extend the examples with versions as they provide scripts for generating quantized models from scratch using the originally distributed Python model files.

@LostRuins
Copy link
Contributor Author

Ah I get that. But I would say that this repo has sort of become the de-facto standard, as all implementations I know of are based off the code here. Implementing my own koboldcpp versioning would fracture the ecosystem since it wouldn't be supported on, for example, Rustformers LLM or llama-cpp-python and vice-versa.

Plus there are quite a few people who use this GGML repo directly, converting their models here and sharing them for downstream use on HF, various forums and over discord servers! (You have no idea how popular GGML has become haha) There are already hundreds of existing quantized models out there.

So as you are the base repo, I can be reasonably confident that all other integrators will follow whatever versioning approach that you take, compared to leaving it to the multiple individual downstream actors.

@philpax any thoughts on this?

@henk717
Copy link

henk717 commented May 13, 2023

Plus one on this, its basically impossible for the end users to reliably differentiate and the only way we have been keeping it managable on a user support side is by having support for all of them which this would allow.

@jebcarter-space
Copy link

Throwing my support here as a user and someone who sees making arguments for the deployment of the upcoming open source models based on the llama architecture in a business environment, having some kind of versioning on the quantization style will be a big help for support not just in the near future but long term as branches fall back.

I know the quantization scripts are available and the base models can be returned to and re-quantized, but not everyone has the technical capacity for that.

Also, of course, offering my gratitude for the development of llamacpp and the democratization of this technology that it is allowing. My "lets see what an AI co-writer that can't be taken from me looks like" project dropped its starting threshold by a thousand bucks thanks to CPU inference.

  • Best wishes and good health

@LostRuins
Copy link
Contributor Author

LostRuins commented May 13, 2023

To get the ball rolling, this is my rough proposal (anyone feel free to chip in or modify!)

Change the file magic from ggml to ggmf (0x67676d66), similar to llama.cpp when it started adding versioning.
Then add a 4 byte field after the magic for the file version.

Currently, the existing users of the ggmf magic are llama.cpp (used file version == 1), and RWKV.cpp (using file version >= 100).
To avoid collision, I recommend beginning the file version from the integer value 1000 and incrementing from there. Thereafter, breaking changes can use increasing file versions 1001, 1002 and so on. This also avoids the file magic conflicting with the one primarily used in llama.cpp.

Thoughts?

@ggerganov
Copy link
Owner

ggerganov commented May 13, 2023

Here is another approach:

  • Do not change magic
  • Change this to GGML_FILE_VERSION 1000
  • Change this for all examples to:
    ftype += GGML_FILE_VERSION
    fout.write((char *) &ftype,           sizeof(hparams.f16));
  • In the loading code, when we read ftype from the header, we divide it by 1000 to get the quantization version and the reminder is the actual quantization type. I.e., the "old" quantized models, as well as the old and new F16 models, will have a quantization version of 0 and the new quantized models will have a version of 1
  • Upon breaking change, we bump GGML_FILE_VERSION by 1000
  • llama.cpp models remain using the current versioning as they have different magic anyway

The benefit of this approach is that all existing F16 model files will remain compatible and we don't have to update the existing python conversion scripts. This will simplify my work, as otherwise I would need to update the F16 ggml and whisper.cpp models that I am hosting, for no reason

@LostRuins
Copy link
Contributor Author

That is pretty clever. Hooray for multiplexing!

@philpax
Copy link
Contributor

philpax commented May 14, 2023

That would work for us. If the bit shuffling sync changes are brought to this repo, can it be done in such a way that both quantization methods are available?

@ggerganov
Copy link
Owner

@philpax

No - it would be very difficult to maintain so many SIMD routines.

I will now proceed with implementing the proposed versioning and syncing the changes from llama.cpp

@ggerganov
Copy link
Owner

I just added the GGML_QNT_VERSION constant to ggml.h.
This signifies the current quantization format version - currently 0

When I merge the llama.cpp changes, I will bump this version to 1
See the examples how to use this information to determine if a model file has an old quantization format or not

github-actions bot pushed a commit to KerfuffleV2/ggml-sys-bleedingedge that referenced this issue May 14, 2023
== Relevant log messages from source repo:

commit 601a033475645370483973817d987928ea95f36c
Author: Georgi Gerganov <ggerganov@gmail.com>
Date:   Sun May 14 10:20:19 2023 +0300

    ggml : add GGML_QNT_VERSION to track quantization format changes

    ggerganov/ggml#150 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants