This repository has been archived by the owner on Jun 24, 2024. It is now read-only.
Replies: 2 comments 1 reply
-
I could help to sync up the model implementations. But i dont know about the GGML Stuff your plan sounds reasonable but the faster we can move towards a new Format/Backend the better. |
Beta Was this translation helpful? Give feedback.
0 replies
-
What do you mean by this?
|
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
G'day, everyone!
In the last week, we've seen two quantization format changes. Our strategy last week was to try and bridge the quantizations by dequantizing and requantizing in the new format at launch, but I didn't implement it in time, and there are a couple of minor technical issues. At the time of writing, we do not support version 3/qnt2 yet (#252).
Going forward, we'll need a better way to handle this. I've been thinking about potential solutions, and this is what I'm thinking of:
ggml
backend #31 / Build and execute our own computation graph #137)This would mean that there would be a mapping between the N backends and M format (version)s, with each backend supporting different formats. We would likely select the best backend for the user by default, with an option for the user to override the backend.
In terms of how we'll actually do this:
ggml
, then introduce new versions ofggml-sys
that provide defined last-known-good versions of each quantization format.pt
, ONNX, and whatever else the ML world may throw at usDoes this sound like a reasonable plan of attack? What should we watch out for? Is there anything you'd like to see prioritised?
Let us know!
Beta Was this translation helpful? Give feedback.
All reactions