Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Request/Enhancment]1-bit quants #5390

Closed
benxh1995 opened this issue Feb 7, 2024 · 13 comments
Closed

[Request/Enhancment]1-bit quants #5390

benxh1995 opened this issue Feb 7, 2024 · 13 comments
Labels
enhancement New feature or request stale

Comments

@benxh1995
Copy link

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • [ x ] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • [ x ] I carefully followed the README.md.
  • [ x ] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • [ x ] I reviewed the Discussions, and have a new bug or useful enhancement to share.

Feature Description

I would like to request an enhancement of the quantification process to allow 1bit quants. They don't have to be SOTA, just usable enough for users.

Motivation

The motivation for this request is to allow users with 8GB RAM and 16GB RAM access to the higher end of models (with 1bit quants ~70B should approximately fit in 16GB RAM).

@benxh1995 benxh1995 added the enhancement New feature or request label Feb 7, 2024
@JohannesGaessler
Copy link
Collaborator

JohannesGaessler commented Feb 7, 2024

I'm not that knowledgeable when it comesto efficient quantization techniques (@ikawrakow is the expert for that) but I don't expect 1 bit quantization to be usable. Do you have any references for papers or code where someone has previously achieved usable 1 bit quantization?

@benxh1995
Copy link
Author

I know of BitNet and QMoE:

https://arxiv.org/abs/2310.16795

https://arxiv.org/abs/2310.11453

And there is this approach which is similar to GPTQ: https://arxiv.org/abs/2310.00034

@benxh1995
Copy link
Author

Highly relevant fresh paper describing binarization (1 bit quant) SOTA: https://huggingface.co/papers/2402.04291

@ikawrakow
Copy link
Contributor

@benxh1995 Have you ever interacted with a model that has a perplexity of 32? (value for LLaMA-v2-7B from the SOTA paper you are quoting). A different question: Do you think that the 1-bit quantized LLaMA-v2-70B model with perplexity of of 8.4 will be competitive with a 4-bit quantized 7B model?

Don't get me wrong, the results of the paper are remarkable for 1-bit quantization, but that does not make them useful in practice. Btw., the current SOTA for 2-bit quantization has a perplexity of 3.94 for LLaMA-v2-70B. I guess, putting that into the paper instead of the hopelessly outdated GPTQ 2-bit result would make the 1-bit look much less impressive. In this repo you have a functioning 2-bit quantization with a LLaMA-v2-70B perplexity of 4.07. Have you tried it? If not, please do (you can download ready 2-bit quantized models from here). If you did, and you thought that it was not adequate, you can be assured that you will like 1-bit models even less.

@benxh1995
Copy link
Author

@ikawrakow Yes sir, I regularly use the Yi iq2_xxs quants, as well as the Mixtral quants. I am following your work quite often. Props to you for achieving what is pretty much SOTA. My motivation in this request was for anything more that could be squeezed out, even at higher perplexity, just like iq2_xxs is around 2.03(?)bpw, could an iq1_s be around 1.5/1.7 bpw? and would that be feasible?

I'm sorry for my ignorance. I'm just excited about the technology and squeezing out as much as possible out of constrained memory setups.

@ghchris2021
Copy link

There's these now:

BiLLM:
...In response to this challenge, we present BiLLM, a groundbreaking 1-bit post-training quantization scheme tailored for pretrained LLMs. ...

https://huggingface.co/papers/2402.04291

https://arxiv.org/abs/2402.04291

https://github.com/Aaronhuang-778/BiLLM

AQLM:

... In this paper, we revisit the problem of "extreme" LLM compression--defined as targeting extremely low bit counts, such as 2 to 3 bits per parameter, from the point of view of classic methods in Multi-Codebook Quantization (MCQ)....

https://github.com/Vahe1994/AQLM

https://arxiv.org/abs/2401.06118

https://huggingface.co/BlackSamorez

@nelsonhurstdev
Copy link

Interesting. This may allow 120B Models to run at decent speeds on Consumer GPUs. This quant could be used to target 70b+ models. I feel like anything smaller may be useless.

@ghchris2021
Copy link

Interesting. This may allow 120B Models to run at decent speeds on Consumer GPUs. This quant could be used to target 70b+ models. I feel like anything smaller may be useless.

Yes, that's what I've been thinking.
For 34B for near future I can envision calling it good enough to use Q5-Q8 quants, multi-gpu to get acceptable
quality / performance.
But I'd really like to run things in the 70-120B range this or next year or so locally with good quality & useful performance levels without assuming I'll have more than 24-32 GB VRAM in a box and very possibly / desirably less than that if feasible.
So 1-4 bits/parameter memory range is very interesting indeed.

I don't know enough about the research & history to understand if it's been done / reasonable but I just find it intuitively odd to train at high bit/parameter quality then jump through hoops to quantize things to 1-4 bits / parameter and not really knowing what one has lost in the process vs. just designing the model to train with 1-4 bit/parameter depths and let the training process "optimally set" each low resolution weight.

But I can see why the researchers who have access to vast SOTA GPU training / inference farms could hardly care less about the VRAM problems of end users inferencing when they just want to achieve SOTA maximum quality results to publish.

@mechanicmuthu
Copy link

Just guessing out loud. On 1 bpw do we reach a point where bitwise operators can come into play to speed up the low level computation?

@ikawrakow
Copy link
Contributor

Just guessing out loud. On 1 bpw do we reach a point where bitwise operators can come into play to speed up the low level computation?

Not for this variant. Quants take values of -1, 0, 1. If we one day arrive at the point where we can separate salient from non-salient weights, then one would hope to be able to use binary quants for the non-salient version. This is what BiLLM does. But then again, looking at the massive difference in quantization error between this PR and BiLLM, this may turn out to be not valuable.

But given that you are bringing this up, are you dissatisfied with the performance? I get 212 t/s for TG-128 of a 7B model on my GPU (RTX-4080), which is ~60% higher than Q4_0. My best guess is that at this matrix multiplication speed, in the range of 40% of the time goes into thread synchronization and other kernels that are independent of the quantization used.

@JohannesGaessler
Copy link
Collaborator

Just guessing out loud. On 1 bpw do we reach a point where bitwise operators can come into play to speed up the low level computation?

I didn't test the performance of the new quantization myself but generally speaking the improvements from more efficient compute at low batch sizes are relatively small. I expect bitwise operations to only make a large difference if there were custom matrix multiplication kernels for large batch sizes (like mul_mat_q) that are compute bound rather than I/O bound.

Copy link
Contributor

This issue is stale because it has been open for 30 days with no activity.

@github-actions github-actions bot added the stale label Mar 18, 2024
Copy link
Contributor

github-actions bot commented Apr 2, 2024

This issue was closed because it has been inactive for 14 days since being marked as stale.

@github-actions github-actions bot closed this as completed Apr 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request stale
Projects
None yet
Development

No branches or pull requests

6 participants