Finetuning large language models (LLMs) has been empirically effective on a variety of downstream tasks. Existing approaches to finetuning an LLM either focus on parameter-efficient finetuning, which only updates a small number of trainable parameters, or attempt to reduce the memory footprint during the training phase of the finetuning. Typically, the memory footprint during finetuning stems from three contributors: model weights, optimizer states, and intermediate activations. However, existing works still require considerable memory and none can simultaneously mitigate memory footprint for all three sources. In this paper, we present Quantized Side Tuing (QST), which enables memory-efficient and fast finetuning of LLMs by operating through a dual-stage process. First, QST quantizes an LLM's model weights into 4-bit to reduce the memory footprint of the LLM's original weights; QST also introduces a side network separated from the LLM, which utilizes the hidden states of the LLM to make task-specific predictions. Using a separate side network avoids performing backpropagation through the LLM, thus reducing the memory requirement of the intermediate activations. Furthermore, QST leverages several low-rank adaptors and gradient-free downsample modules to significantly reduce the trainable parameters, so as to save the memory footprint of the optimizer states. Experiments show that QST can reduce the total memory footprint by up to 2.3 times and speed up the finetuning process by up to 3 times while achieving competent performance compared with the state-of-the-art. When it comes to full finetuning, QST can reduce the total memory footprint up to 7 times.
- Feature 1: 4-bit Quantization.
- Feature 2: Side tuning.
- git clone the repo
git clone https://github.com/YouAreSpecialToMe/QST.git
- install requirements
cd QST pip install -r requirements.txt
- Leverage the HuggingFace and bitsandbytes library to load the 4-bit pre-trained model
model = AutoModelForCausalLM.from_pretrained( YourModelPath, load_in_4bit=True, device_map="auto", quantization_config=BitsAndBytesConfig( load_in_4bit=True, llm_int8_threshold=6.0, llm_int8_has_fp16_weight=False, bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", ), torch_dtype=torch.bfloat16, )
- Initialize the hyperparameters of the side network
qst_config = QSTConfig( add_layer_norm_before_adapter=False, add_layer_norm_after_adapter=True, r=16, dropout=0.1, activation="swish", fan_in_fan_out=False, peft_hidden_size=16 )
- Initialize the QST model based on the 4-bit pre-trained model
# Llama series model = QSTLlamaForCausalLM(model, config, qst_config) # OPT series model = QSTOPTForCausalLM(model, config, qst_config)
- You can use the HuggingFace trainer or customer-defined training process based on Pytorch to finetune QST
trainer = Trainer( model, ... # Other training args )
You can use qst-70b.sh to finetune Llama-2-70b model.
bash qst-70b.sh
You can download the checkpoint from
https://huggingface.co/YouAreSpecialToMe/QST-70B-checkpoint/tree/main
You first need to download the checkpoint of QST-70B and modify the path in the "chatbot_sample.py"
model.load_qst_state("YourPath/QST-70B-checkpoint/")
and then run the following script
python chatbot_sample.py
QST docker image is available on docker hub.
- pull docker image
docker pull geniedan/qst-docker:v5
- load the docker image
- if you have the model saved in your local device, please run
docker run --gpus all -it --rm -v /path/to/your/model:/models qst-docker:v5 bash -c "source /opt/conda/etc/profile.d/conda.sh && conda activate QST && ./qst-70b.sh /models/your-model-name"
remember to replace /path/to/your/model and your-model-name with your local path and model name.
- if you wish to pull load a model from hugging face, please run
docker run --gpus all -it --rm qst-docker:v5 bash -c "source /opt/conda/etc/profile.d/conda.sh && conda activate QST && ./qst-70b.sh model-name"
If you find our work helpful, please consider to cite our work:
@misc{zhang2024quantizedtuningfastmemoryefficient,
title={Quantized Side Tuning: Fast and Memory-Efficient Tuning of Quantized Large Language Models},
author={Zhengxin Zhang and Dan Zhao and Xupeng Miao and Gabriele Oliaro and Qing Li and Yong Jiang and Zhihao Jia},
year={2024},
eprint={2401.07159},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2401.07159},
}
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
This code is based on QLoRA, Standford Alpaca, and FastChat repos.