-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GGUF compatible quantization (2, 3, 4 bit / any bit) #285
Conversation
must be missing something but what does exactly "AWQ 3-bit + GGUF Q2_K" mean ? what is the exact pipeline ? |
It means you first apply scales and clipping from AWQ based on 3-bit calculations. Weights are kept in FP16. Then you quantize to the specified GGUF format |
ok I get it, so the only impact is here: https://github.com/casper-hansen/AutoAWQ/blob/main/awq/quantize/quantizer.py#L48 |
so You could use AWQ to quantize the way fit exactly the way GGUF work for full compatibility. am i understand correctly? |
This does not quantize the weights with AWQ, it only uses the scaling of weights and keeps it in FP16. Since it's just FP16 weights, then we can apply GGUF quantization. |
AWQ 2-bit + GGUF Q2_K 6.7290 +/- 0.04008 has higher Perplexity than GGUF Q2_K 6.1640 +/- 0.03474 |
We are not actually doing AWQ quantization. Like I referenced earlier, we only scale the weights which is different from quantizing. The model's weights are adjusted according to the scales but not quantized which is a separate process that we let llama.cpp run. This means the BPW and file size is the same as if you were to just use GGUF. |
If AutoAWQ here is only used for applying scales, what's the benefit of using lower bit AWQ if the final file size solely depends on GGUF quantization? isn't better off just use 8 bit AWQ for the sake of better scaling factors? Please help eloborate. |
No and here is why. Quantization is just about packing weights to adapt to INT4, nothing special happens during that process that is related to minimizing the impact of quantization. In other words, for the FP16 -> INT4 conversion to have the least quantization error, we must first compute optimal scales and then apply them before converting to a quantized model.
The practical step of quantizing to a specific format is handled by |
The main benefit is a higher quality model as can be observed from most cases from the perplexity numbers, except for Mixtral which I am working on better quantization for. |
Thank you for elaboration. I am curious that If AutoAWQ here is only used calculating scales, what's the deal of different bit mix between AWQ and gguf. My understanding is that if the scales are calcualted using 3-bit, the gguf quantization target should be also 3-bit.
Thank you for elaboration. I am curious that If AutoAWQ here is only used calculating scales, what's the deal of different bit mix between AWQ and gguf. My understanding is that if the scales are calcualted using 3-bit for example, the gguf quantization target should be also 3-bit to maintein consistency. Your experiement data however shows AWQ 4-bit + GGUF Q3_K_M > AWQ 3-bit + GGUF Q3_K_M. Is it because 3bit AWQ in general inaccurate/broken? |
The reason is that Q3_K_M is a mixed-bit quantization that GGUF applies. That means the Q3_K_M format is not just INT3, but it also has INT4 weights. We observe that INT4 is more effective for scaling in this case, likely because scaling with INT3 makes the quantization error much larger when you apply INT3 scales for INT4 weights. That is likely why we see that AWQ 4-bit works better for the Q3_K_M format. An optimization in the future in AutoAWQ could include the ability to do mixed-bit scaling. This could likely even improve AWQ quantization if applied thoughtfully, i.e. maybe some losses are higher than others and you could adjust the w_bit and retry to find a better scale. |
some result from a Qwen14B model. Looking forward to furfure mixed-bit scaling to further improvement. |
@sorasoras Thanks for these numbers! These look particularly good to me. Great improvements, especially Q2 is a large improvement. I just outlined the combinations below, they look good! Q2:
Q3:
Q4:
|
sidenote ggml-org/llama.cpp#4773 (comment) Perhaps, AWQ could do optimization for this new quants? I am not so sure through. |
I checked their reference code for their new SOTA KNN/QuIP method. Many elements are similar to AWQ, but there are many unique aspects of this new method that are directly taken from QuIP#. You could certainly try to implement the unique aspects of QuIP# into AutoAWQ like the importance matrix and modification for the E8 lattice search. However, I don't think it is feasible for me to do these things alone as AutoAWQ is already a large task to maintain mostly by myself. |
https://huggingface.co/ikawrakow/mistral-7b-quantized-gguf/blob/main/README.md has Mistral-7B quants in GGUF format where the perplexity seems lower throughout than what I see for AWQ in the above table. For convenience, here is a copy of the table that you will find there:
|
@ikawrakow These are certainly large improvements. I will need to implement the llama.cpp perplexity computation like I did for AutoGPTQ to see whether this beats what I get in AWQ if I pack and run native inference with 4-bit. Do you know why your numbers from main branch are different than my results? e.g. for Q4_K_M yours is 5.7539 and mine is 5.7518. |
Isn't a difference of |
Yes I remember we talked about this issue half year ago regarding the PPL calculation methods. It should bd aligned |
Yeah, llama.cpp should probably update their computations to match the original perplexity, but for now, I implemented it in AutoGPTQ and can just import that code to AutoAWQ. Either way, another good measure is quantization error in % like what was provided above. The relative % is easier to read and understand anyway. https://github.com/PanQiWei/AutoGPTQ/blob/main/auto_gptq/utils/perplexity_utils.py |
Is there a good way to measure Chat/finetuned models? The perplexity doesn't seem to make sense for finetuned model |
There is no single golden measurement we can use. Even perplexity has its faults, but it's good enough as a proxy of quantization error. I always look at MMLU values for instruct/chat but that takes a long time to evaluate. You can also evaluate perplexity but you would need special handling and a special dataset for this. |
I think using MT-bench is better. MT-bench should be able to load awq models with some mild changes, but not sure how to load gguf models with their framework |
MT-bench uses GPT-4 to judge the model. I'm not a particular fan of this for many reasons, it can be very misleading. |
Looks on par or better than AWQ, are you ready to make your private repo publicly available? Are these changes in ggml-org/llama.cpp#4773? or it is just built on top of master branch |
My repo, where I play with various quantization approaches (but also semi-regularly update with mainline |
It would be nice if i could play around the new SOTA2bit for other models. |
Is the "q_group_size" in If so, should "q_group_size" be set to 16 when using Q2_K where each block in Q2_K has 16 weights? |
AWQ has only ever been able to run 4-bit quantization. However, with this integration, we can run any-bit quantization and export to
llama.cpp
for inference. This results in lower perplexity while ensuring compatibility with the GGUF ecosystem.The difference between GGUF and AWQ is most pronounced on the q_0 and q_1 models but I include most perplexity numbers for the K method from llama.cpp since it reaches the lowest perplexity.
Perplexity
Perplexity measured with:
./perplexity -m <gguf_model> -f wikitext-2-raw/wiki.test.raw -ngl 33
Base Model: Mistral 7B (mistralai/Mistral-7B-v0.1)
FP16: 5.6934
Mixture of Experts Model: Mixtral 8x7B (mistralai/Mixtral-8x7B-v0.1)
Chat Model: Llama 2 7B Chat (TheBloke/Llama-2-7B-Chat-fp16)