-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
preemptive request : regarding possible bit shuffling sync (just in case) #150
Comments
I'm planning to back-port the changes soon. The way I see it is that users of the I'm open to suggestions, but I don't see necessary to extend the examples with versions as they provide scripts for generating quantized models from scratch using the originally distributed Python model files. |
Ah I get that. But I would say that this repo has sort of become the de-facto standard, as all implementations I know of are based off the code here. Implementing my own koboldcpp versioning would fracture the ecosystem since it wouldn't be supported on, for example, Rustformers LLM or llama-cpp-python and vice-versa. Plus there are quite a few people who use this GGML repo directly, converting their models here and sharing them for downstream use on HF, various forums and over discord servers! (You have no idea how popular GGML has become haha) There are already hundreds of existing quantized models out there. So as you are the base repo, I can be reasonably confident that all other integrators will follow whatever versioning approach that you take, compared to leaving it to the multiple individual downstream actors. @philpax any thoughts on this? |
Plus one on this, its basically impossible for the end users to reliably differentiate and the only way we have been keeping it managable on a user support side is by having support for all of them which this would allow. |
Throwing my support here as a user and someone who sees making arguments for the deployment of the upcoming open source models based on the llama architecture in a business environment, having some kind of versioning on the quantization style will be a big help for support not just in the near future but long term as branches fall back. I know the quantization scripts are available and the base models can be returned to and re-quantized, but not everyone has the technical capacity for that. Also, of course, offering my gratitude for the development of llamacpp and the democratization of this technology that it is allowing. My "lets see what an AI co-writer that can't be taken from me looks like" project dropped its starting threshold by a thousand bucks thanks to CPU inference.
|
To get the ball rolling, this is my rough proposal (anyone feel free to chip in or modify!) Change the file magic from Currently, the existing users of the Thoughts? |
Here is another approach:
The benefit of this approach is that all existing F16 model files will remain compatible and we don't have to update the existing python conversion scripts. This will simplify my work, as otherwise I would need to update the F16 |
That is pretty clever. Hooray for multiplexing! |
That would work for us. If the bit shuffling sync changes are brought to this repo, can it be done in such a way that both quantization methods are available? |
No - it would be very difficult to maintain so many SIMD routines. I will now proceed with implementing the proposed versioning and syncing the changes from |
I just added the When I merge the |
== Relevant log messages from source repo: commit 601a033475645370483973817d987928ea95f36c Author: Georgi Gerganov <ggerganov@gmail.com> Date: Sun May 14 10:20:19 2023 +0300 ggml : add GGML_QNT_VERSION to track quantization format changes ggerganov/ggml#150 (comment)
Hello, this is just a pre-emptive request, on the potential chance that the llamacpp bit shuffling changes are sync'd to this repo, would it be possible to add some indication to these models to differentiate them from the old ones already in existence? A new field, a magic change, a version indicator or something would be very useful. Perhaps #147 can be considered too?
Reason being that since the models are otherwise indistinguishable (same file format and structure), it will be hard to tell whether a model file has bit shuffling or not (old models will load perfectly fine but just generate gibberish).
If then bit shuffling changes are not planned to be upstreamed, then please disregard this issue.
Thanks in advance!
The text was updated successfully, but these errors were encountered: