Official PyTorch implementation of Outlier Suppression: Pushing the Limit of Low-bit Transformer, NeurIPS 2022.
The outlier suppression framework effectively suppresses the outliers for transformer language models to pursue superiority in quantization. It contains two components: Gamma Migration and Token-Wise Clipping.
The framework can be adopted in both post-training quantization (PTQ) and quantization-aware training (QAT) pipelines and can be applied to different models and tasks. We are giving examples here to help run our framework on BERT, RoBERTa, and BART models across text classification, question answering, and summarization tasks.
Install huggginface (version 4.18.0) and datasets (version 1.17.0) in advance. Use Tesla V100 to run the code.
Download the data (metric) in advance, and put them in your ~/.cache/huggingface directory. Details can be found in doc of huggingface. Or you can run the code directly. It will download the data automatically.
We provide the fine-tuned FP32 models in https://huggingface.co/ModelTC. You can download and put them in your own directory.
We give the config for each task in exp directory.
exp/
├── bert_ptq
├── xsum
└── cnn_dailymail
We take bert models as an example. For RoBERTa and BART, you only need to change the model_path ('model_name_or_path') in config.yaml.
Configuration of BERT models are:
exp/bert_ptq/
├── twc_fine_gamma
│ └── cola
│ │ └── config.yaml
│ │ └── run.sh
│ └── mnli
│ └── ...
│ └── stsb
├── minmax
├── quantile
└── mse
Take an example of running CoLA task using our methods, others are the same. Run the following and you can get the results.
cd exp/bert_ptq/twc_fine_gamma/cola
bash run.sh
We take bert models as an example. For RoBERTa and BART, you only need to change the model_path ('model_name_or_path') in config.yaml.
Configuration of BERT models are:
exp/bert_ptq/twc_fine_gamma
├── squad_v1
│ └── config.yaml
│ └── run.sh
└── squad_v2
Run the following and you can get the result.
cd exp/bert_ptq/twc_fine_gamma/squad_v1
bash run.sh
exp/xsum/twc_fine_gamma
├── config.yaml
└── run.sh
Run the following and you can get the result.
cd exp/xsum/twc_fine_gamma
bash run.sh
We are still sorting out the code of the part. Looking forward to it!
We put some brief descriptions about the config for better understanding.
config.yaml
quant:
is_remove_padding: True
calibrate: 256
ln:
delay: False
a_qconfig:
quantizer: FixedFakeQuantize
observer: AvgMinMaxObserver
bit: 6
symmetric: False
ch_axis: -1
w_qconfig:
quantizer: FixedFakeQuantize
observer: MinMaxObserver
bit: 6
symmetric: True
ch_axis: 0
This is an example of MinMax Quantization in our config. Below is the explanation about each item.
- is_remove_padding (bool) - whether remove pad token during calibration: False | True. Because pad token will not influence the FP accuracy, it naturally needs to be removed during calibration to exert no influence on the quantization parameters. Default: True
- calibrate (int) - the number of calibration examples. Default: 256
- ln: config about Gamma Migration
- delay (bool) - whether activate Gamma Migration component: False | True. Default: False
- a_qconfig/ w_qconfig - config about quantization scheme on activation or weight
- bit (int) - quantization bit
- symmetric (bool) - whether use symmetric quantization: False | True. Default: for weight: False, for activation: True
- ch_axis (int) - per-tensor or per-channel quantization: -1 | 0. -1: per-tensor quantization, 0: per-channel quantization at dim 0. Default: -1/0
- quantizer (string, optional) - quantizer type: FixedFakeQuantize | LSQFakeQuantize | LSQPlusFakeQuantize. FixedFakeQuantize: normal quantizer, LSQFakeQuantize: mark scale as the parameter, LSQPlusFakeQuantize: mark scale and zero-point as parameters. Default: FixedFakeQuantize
- observer (string, optional) - collect activation/ weight statistics to identify the initiliazed scale and zero-point. For activation: AvgMinMaxObserver | AvgMSEFastObserver | AvgQuantileObserver | AvgPruneMinMaxObserver. AvgPruneMinMaxObserver: activate the coarse-grained phase of Token-Wise Clipping. Default: AvgMinMaxObserver. For weight: MinMaxObserver | MSEFastObserver | LSQPlusObserver. Default: MinMaxObserver. Usually, for 6/8-bit weight, MinMaxObserver is enough, for lower bit (4-bit), suggest MSEFastObserver or LSQPlusObserver (only suitable in QAT).
Based on these, below is the config of our outlier suppression framework. Set delay to True to enable the Gamma Migration. Set observer in activation as AvgPruneMinMaxObserver to enable the coarse-grained phase of Token-Wise Clipping. Set quantizer in activation as LSQPlusFakeQuantize to additionally do the fine-grained stage of Token-Wise Clipping.
quant:
is_remove_padding: True
calibrate: 256
ln:
delay: True
a_qconfig:
quantizer: LSQPlusFakeQuantize
observer: AvgPruneMinMaxObserver
bit: 6
symmetric: False
ch_axis: -1
w_qconfig:
quantizer: FixedFakeQuantize
observer: MinMaxObserver
bit: 6
symmetric: True
ch_axis: 0
If you find this repo useful for your research, please consider citing the paper:
@article{wei2022outlier,
title={Outlier Suppression: Pushing the Limit of Low-bit Transformer Language Models},
author={Wei, Xiuying and Zhang, Yunchen and Zhang, Xiangguo and Gong, Ruihao and Zhang, Shanghang and Zhang, Qi and Yu, Fengwei and Liu, Xianglong},
journal={arXiv preprint arXiv:2209.13325},
year={2022}
}