-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Request/Enhancment]1-bit quants #5390
Comments
I'm not that knowledgeable when it comesto efficient quantization techniques (@ikawrakow is the expert for that) but I don't expect 1 bit quantization to be usable. Do you have any references for papers or code where someone has previously achieved usable 1 bit quantization? |
I know of BitNet and QMoE: https://arxiv.org/abs/2310.16795 https://arxiv.org/abs/2310.11453 And there is this approach which is similar to GPTQ: https://arxiv.org/abs/2310.00034 |
Highly relevant fresh paper describing binarization (1 bit quant) SOTA: https://huggingface.co/papers/2402.04291 |
@benxh1995 Have you ever interacted with a model that has a perplexity of 32? (value for LLaMA-v2-7B from the SOTA paper you are quoting). A different question: Do you think that the 1-bit quantized LLaMA-v2-70B model with perplexity of of 8.4 will be competitive with a 4-bit quantized 7B model? Don't get me wrong, the results of the paper are remarkable for 1-bit quantization, but that does not make them useful in practice. Btw., the current SOTA for 2-bit quantization has a perplexity of 3.94 for LLaMA-v2-70B. I guess, putting that into the paper instead of the hopelessly outdated GPTQ 2-bit result would make the 1-bit look much less impressive. In this repo you have a functioning 2-bit quantization with a LLaMA-v2-70B perplexity of 4.07. Have you tried it? If not, please do (you can download ready 2-bit quantized models from here). If you did, and you thought that it was not adequate, you can be assured that you will like 1-bit models even less. |
@ikawrakow Yes sir, I regularly use the Yi iq2_xxs quants, as well as the Mixtral quants. I am following your work quite often. Props to you for achieving what is pretty much SOTA. My motivation in this request was for anything more that could be squeezed out, even at higher perplexity, just like iq2_xxs is around 2.03(?)bpw, could an iq1_s be around 1.5/1.7 bpw? and would that be feasible? I'm sorry for my ignorance. I'm just excited about the technology and squeezing out as much as possible out of constrained memory setups. |
There's these now: BiLLM: https://huggingface.co/papers/2402.04291 https://arxiv.org/abs/2402.04291 https://github.com/Aaronhuang-778/BiLLM AQLM: ... In this paper, we revisit the problem of "extreme" LLM compression--defined as targeting extremely low bit counts, such as 2 to 3 bits per parameter, from the point of view of classic methods in Multi-Codebook Quantization (MCQ).... https://github.com/Vahe1994/AQLM |
Interesting. This may allow 120B Models to run at decent speeds on Consumer GPUs. This quant could be used to target 70b+ models. I feel like anything smaller may be useless. |
Yes, that's what I've been thinking. I don't know enough about the research & history to understand if it's been done / reasonable but I just find it intuitively odd to train at high bit/parameter quality then jump through hoops to quantize things to 1-4 bits / parameter and not really knowing what one has lost in the process vs. just designing the model to train with 1-4 bit/parameter depths and let the training process "optimally set" each low resolution weight. But I can see why the researchers who have access to vast SOTA GPU training / inference farms could hardly care less about the VRAM problems of end users inferencing when they just want to achieve SOTA maximum quality results to publish. |
Just guessing out loud. On 1 bpw do we reach a point where bitwise operators can come into play to speed up the low level computation? |
Not for this variant. Quants take values of -1, 0, 1. If we one day arrive at the point where we can separate salient from non-salient weights, then one would hope to be able to use binary quants for the non-salient version. This is what BiLLM does. But then again, looking at the massive difference in quantization error between this PR and BiLLM, this may turn out to be not valuable. But given that you are bringing this up, are you dissatisfied with the performance? I get 212 t/s for TG-128 of a 7B model on my GPU (RTX-4080), which is ~60% higher than |
I didn't test the performance of the new quantization myself but generally speaking the improvements from more efficient compute at low batch sizes are relatively small. I expect bitwise operations to only make a large difference if there were custom matrix multiplication kernels for large batch sizes (like mul_mat_q) that are compute bound rather than I/O bound. |
This issue is stale because it has been open for 30 days with no activity. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Feature Description
I would like to request an enhancement of the quantification process to allow 1bit quants. They don't have to be SOTA, just usable enough for users.
Motivation
The motivation for this request is to allow users with 8GB RAM and 16GB RAM access to the higher end of models (with 1bit quants ~70B should approximately fit in 16GB RAM).
The text was updated successfully, but these errors were encountered: