QUIK

This repository contains the code for QUIK, a method for quantizing the majority of the weights and activations to 4bit post-training.

QUIK is described in the following paper: https://arxiv.org/abs/2310.09259

Install

Dependencies

cmake
C++ compiler (GCC/clang/...)
nvcc

Instructions

git clone https://github.com/IST-DASLab/QUIK.git
cd QUIK
pip install -e .  # or pip install .

Example

LLama example

cd experiments
pip install -r requirements.txt
python llama.py --fp_features_num 256 --model meta-llama/Llama-2-7b-hf --hf_token <your_hf_token> --dataset c4 \ 
--w_bits 4 --w_clip --a_bits 4 --save_qmodel_path save_gptq_model_path --int8_down_proj --sim_eval --benchmark

Benchmark will be run on all available GPUs.

Linear layer benchmarks

Linear layer benchmarks can be run with python layer_benchmark.py. One can vary input size with command line parameters.

Model adapt to QUIK

First, one has to quantize the model weights using GPTQ algorithm. In llama.py it is done with llama_sequential function. From that we get quantized weights (that are still stored in torch.float16). Then ones needs create QUIK Linear layers using qlinear.MixedQLinear.from_float that must replace original Linear layers. See llama_replace_with_kernels in llama.py. Now the quantized model is ready for use.

Fake Quantization examples

To run the fake quantization example, check fake_quant directory.

Citation

The full paper is available on arxiv. The full citation is

@article{QUIK,
  title={QUIK: Towards End-to-end 4-Bit Inference on Generative Large Language Models},
  author={Ashkboos, Saleh and Markov, Ilia and Frantar, Elias and Zhong, Tingxuan and Wang, Xincheng and Ren, Jie and Hoefler, Torsten and Alistarh, Dan},
  journal={arXiv preprint arXiv:2310.09259},
  year={2023}
}

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
experiments		experiments
include		include
quik		quik
src		src
test		test
third-party		third-party
.clang-format		.clang-format
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

QUIK

Install

Dependencies

Instructions

Example

LLama example

Linear layer benchmarks

Model adapt to QUIK

Fake Quantization examples

Citation

About

Releases

Packages

Contributors 6

Languages

License

IST-DASLab/QUIK

Folders and files

Latest commit

History

Repository files navigation

QUIK

Install

Dependencies

Instructions

Example

LLama example

Linear layer benchmarks

Model adapt to QUIK

Fake Quantization examples

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 6

Languages

Packages