Minimize Quantization Output Error with Bias Compensation

This repo is the official implementation of "Minimize Quantization Output Error with Bias Compensation" (CAAI AIR 2024).

Autthors: Cheng Gong, Haoshuai Zheng, Mengting Hu, Zheng Lin, Deng-Ping Fan, Yuzhi Zhang, Tao Li

[arXiv][Code]

Description

Bias Compensation (BC) is novel method aiming to minimize the output error caused by quantization, thus realizing ultra-low-precision quantization without model fine-tuning. Instead of optimizing the non-convex quantization process as in most previous methods, the proposed BC bypasses the step to directly minimize the quantizing output error by identifying a bias vector for compensation. We have established that the minimization of output error through BC is a convex problem and provides an efficient strategy to procure optimal solutions associated with minimal output error, without the need for training or fine-tuning. We conduct extensive experiments on Vision Transformer models and Large Language Models, and the results show that our method notably reduces quantization output error, thereby permitting ultra-low-precision post-training quantization and enhancing the task performance of models. Especially, BC improves the accuracy of ViT-B with 4-bit PTQ4ViT by 36.89% on the ImageNet-1k task, and decreases the perplexity of OPT-350M with 3-bit GPTQ by 5.97 on WikiText2.

Figure 1 shows the comparison between the previous PTQ methods and our proposed bias compensation. (A) indicates local quantizer optimization methods. (B) and (C) are layer-wise quantizer and parameter optimization methods, respectively. Previous methods optimize the quantizer parameter or layer-wise weights to minimize the quantization loss or output error, which is non-convex and difficult to solve. Our method shown in (D) directly minimizes the output error by solving the best bias vector, which is convex and guarantees minimal output error.

Figure 2 shows the detailed application of bias compensation for a quantized layer or quantized block. We use absolute error as output error in this figure for easy understanding. Applying bias compensation after quantization can significantly reduce the output error without increasing additional computational complexity.

In summary, BC is a valuable approach as it provides another method to improve the task performance of quantized models other than optimizing quantizers. BC is both simple (without increasing time and space complexity) and effective (outperforming the latest methods, such as GPTQ and AWQ) while being decoupled with quantizers (can be combined with different quantizers). It differs from previous PTQ and QAT methods, as it is the first approach to be directly applied to correct the output error after the quantization layer.

Installation

You can clone this project, then enter this bias-compensation folder, and install this project as shown below.

$ git clone https://github.com/GongCheng1919/bias-compensation.git
$ cd bias-compensation
$ python setup.py install

Usage

You can utilize bias compensation as a torch nn module and incorporate it at any position within a neural network architecture. The pseudocode example below demonstrates its usage:

from bias_compensation.quantizers.BiasCompensation import BiasCompensation
from bias_compensation.quantize.quantize import quantize_module_act
layer = Float_Layer
## Your quantiziation code should be here ##
q_layer = Quantized_Layer
channel_size = Bias_Vector_Size
channel_axis = Bias_Vector_Axis
bc = BiasCompensation(channel_size=channel_size, channel_axis=channel_axis)
## Apply this quantizer to the layer output ##
quantize_module_act(layer,bc,act_id=0,pre=False)
## Conduct your Calibra_Data ##
calibra_data = Calibra_Data
float_output = layer(calibra_data)
quantized_output = q_layer(calibra_data)
## Optimize the bias vector ##
bc.update(float_output,quantized_output)

For a detailed usage example, please refer to this file.

Experimental Results

Results on ViTs: We take the PTQ4ViT as the base quantizer for BC and the quantizers including Base PTQ, EasyQuant, APQ-ViT, NoisyQuant, and RepQ-ViT as baselines for comparison. The experimental results of the baselines are cited from their original paper, and the results of PTQ4ViT are obtained based on the open implementation. The results are shown below (The default input resolution is 224 × 224, and * means 384 × 384).

Model	W/A	ViT-S	ViT-B	ViT-B*	Deit-S	Deit-B	Deit-B*	Swin-T	Swin-S	Swin-B	Swin-B*
# Parameter(M)	-	22.1	86.6	86.9	22.4	86.6	86.9	28.3	49.9	88.1	90.8
FP32	32/32	81.39	84.54	86.00	79.85	81.80	83.11	81.39	83.23	85.27	86.44
Base PTQ	8/8	80.46	83.89	85.35	77.65	80.94	82.33	80.96	82.75	84.79	86.16
EasyQuant	8/8	80.75	83.89	85.53	78.98	81.19	82.10	80.95	83.00	85.10	86.39
PTQ4ViT	8/8	81.00	84.25	85.82	79.47	81.48	82.97	81.24	83.10	85.14	86.39
APQ-ViT	8/8	81.25	84.26	-	79.78	81.72	-	-	83.16	85.16	86.40
NoisyQuant	8/8	81.15	84.22	85.86	79.51	81.45	82.49	81.25	83.13	85.20	86.44
PTQ4ViT+BC	8/8	81.15	84.33	85.96	79.58	81.76	83.12	81.34	83.18	85.21	86.44
Base PTQ	6/6	70.24	75.66	46.88	72.26	78.78	68.44	78.45	81.74	83.35	85.22
EasyQuant	6/6	75.13	81.42	82.02	75.27	79.47	81.26	79.51	82.45	84.30	85.89
PTQ4ViT	6/6	78.63	81.65	83.34	76.28	80.25	81.55	80.47	82.38	84.01	85.38
APQ-ViT	6/6	79.10	82.21	-	77.76	80.42	-	-	82.67	84.18	85.60
NoisyQuant	6/6	78.65	82.32	83.22	77.43	80.70	81.65	80.51	82.86	84.68	86.03
RepQ-ViT	6/6	80.43	83.62	-	78.90	81.27	-	-	82.79	84.57	-
PTQ4ViT+BC	6/6	79.22	83.00	85.00	78.60	81.29	82.44	80.56	82.46	84.28	85.60
PTQ4ViT	4/4	34.02	35.30	31.40	24.05	60.72	75.93	74.46	76.46	74.49	77.26
APQ-ViT	4/4	47.95	41.41	-	43.55	67.48	-	-	77.15	76.48	80.84
RepQ-ViT	4/4	65.05	68.48	-	69.03	75.61	-	-	79.45	78.32	-
PTQ4ViT+BC	4/4	54.74	67.27	68.29	67.47	75.78	80.15	74.74	76.79	77.78	79.97

You can use following commend to obtain these results.

$ bash ./script/vits/run-vit-deit-swin.sh

Results on LLMs: We take the naive Rounding-To-Nearest (NTR) (implemented in GPTQ), GPTQ and AWQ as baselines in this experiment. We use the open implementation of GPTQ and pre-trained quantized LLM models in HuggingFace to obtain reproducible results.

Method	Bits	Wiki. (OPT-125M)	PTB (OPT-125M)	C4 (OPT-125M)	Wiki. (OPT-350M)	PTB (OPT-350M)	C4 (OPT-350M)	Wiki. (BLOOM-560M)	PTB (BLOOM-560M)	C4 (BLOOM-560M)
FP16	16	27.66	32.55	24.61	22.00	26.08	20.71	22.42	41.26	24.38
RTN	4	37.28	45.11	31.64	25.94	31.12	23.94	25.89	48.57	27.42
GPTQ	4	31.22	36.93	26.94	24.20	28.89	22.60	23.98	44.53	25.60
GPTQ+BC	4	29.90	36.11	26.03	22.85	27.23	21.34	23.77	44.44	25.46
RTN	3	1276.92	1209.34	731.60	64.56	81.85	50.14	56.98	117.15	58.96
GPTQ	3	54.68	65.74	38.27	33.75	38.66	28.53	32.45	62.66	32.25
GPTQ+BC	3	45.19	52.15	33.78	27.78	33.79	24.79	31.49	60.36	31.35
RTN	2g64	7042.44	5057.60	3869.38	4354.61	3560.32	2346.04	502.39	627.17	326.20
GPTQ	2g64	192.96	200.53	114.07	519.97	506.01	231.55	74.06	182.64	59.23
GPTQ+BC	2g64	108.98	137.12	70.11	88.09	95.57	56.17	69.12	178.73	56.43

You can use following commend to obtain these results.

$ bash ./script/llms/run-opt-bloom.sh

Besides, we also conduct a comparison of GPTQ, AWQ and GPTQ+BC below.

Method	Bits	Wiki. (OPT-125M)	PTQ (OPT-125M)	C4 (OPT-125M)	Wiki. (BLOON-560M)	PTQ (BLOON-560M)	C4 (BLOON-560M)
FP16	16	27.66	32.55	24.61	22.42	41.26	24.38
RTN	4	37.28	45.11	31.64	25.89	48.57	27.42
GPTQ	4	31.22	36.93	26.94	23.98	44.53	25.60
GPTQ+BC	4	29.90	36.11	26.03	23.77	44.44	25.46
GPTQ	4g256	30.16	35.78	26.38	23.39	42.99	25.08
AWQ	4g256	30.38	35.23	26.35	28.45	52.93	34.96
GPTQ+BC	4g256	28.99	33.96	25.40	23.24	42.68	24.97
GPTQ	4g128	29.79	35.25	25.95	23.23	42.72	24.93
AWQ	4g128	29.14	34.96	25.90	28.22	55.37	34.93
GPTQ+BC	4g128	28.50	33.77	25.21	23.10	42.68	24.83
RTN	3	1276.92	1209.34	731.60	56.98	117.15	58.96
GPTQ	3	54.68	65.74	38.27	32.45	62.66	32.25
GPTQ+BC	3	45.19	52.15	33.78	31.49	60.36	31.35
GPTQ	3g256	41.64	52.96	32.46	26.76	51.06	28.11
AWQ	3g256	41.87	50.92	35.55	33.21	62.04	41.32
GPTQ+BC	3g256	35.65	44.33	29.04	26.35	50.11	27.69
GPTQ	3g128	37.78	43.20	30.00	25.52	48.44	27.15
AWQ	3g128	36.52	44.02	32.02	31.77	62.97	39.66
GPTQ+BC	3g128	33.63	39.54	27.79	25.15	47.51	26.84
RTN	2g64	7042.44	5057.60	3869.38	502.39	627.17	326.20
GPTQ	2g64	192.96	200.53	114.07	74.06	182.64	59.23
AWQ	2g64	133.27	147.97	97.18	111.49	316.53	145.12
GPTQ+BC	2g64	108.98	137.12	70.11	69.12	178.73	56.43

It can be seen that AWQ performs better than GPTQ on OPT, and GPTQ+BC consistently outperforms GPTQ and AWQ on all datasets and configurations.

You can use following commend to obtain these results.

$ bash ./script/llms/run-grouping-q.sh

We conduct a comparison of the latest BiLLM quantization, BiLLM and BiLLM+BC below.

Method	Bits	Wiki. (OPT-125M)	PTB (OPT-125M)	C4 (OPT-125M)	Wiki. (OPT-350M)	PTB (OPT-350M)	C4 (OPT-350M)	Wiki. (LLaMA-7B)	PTB (LLaMA-7B)	C4 (LLaMA-7B)
FP16	16	27.66	32.55	24.61	22.00	26.08	20.71	5.68	41.15	7.34
BiLLM	2-order binary	2409.61	2581.31	1632.83	1949.35	2567.82	822.90	35.10	351.01	40.32
BiLLM+BC	2-order binary	348.64	362.20	199.67	116.31	138.91	100.99	22.51	348.88	26.17

The results show that BC can significantly improve the performance of BiLLM across various LLMs and datasets.

Reference

If you find bias compensation useful or relevant to your research, you can cite their paper:

@article{gong2024minimize,
    title={Minimize Quantization Output Error with Bias Compensation},
    author={Cheng Gong and Haoshuai Zheng and Mengting Hu and Zheng Lin and Deng-Ping Fan and Yuzhi Zhang and Tao Li},
    year={2024},
    journal = {CAAI Artificial Intelligence Research (CAAI AIR)},
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
bias_compensation		bias_compensation
imgs		imgs
scripts		scripts
submodules		submodules
.gitignore		.gitignore
README.md		README.md
quantize-demo.ipynb		quantize-demo.ipynb
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Minimize Quantization Output Error with Bias Compensation

Description

Installation

Usage

Experimental Results

Reference

About

Releases

Packages

Languages

GongCheng1919/bias-compensation

Folders and files

Latest commit

History

Repository files navigation

Minimize Quantization Output Error with Bias Compensation

Description

Installation

Usage

Experimental Results

Reference

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages