LoftQ: LoRA-fine-tuning-aware Quantization

Introduction

LoftQ finds quantized LoRA initialization: quantized backbone Q and LoRA adapters A and B, given a pre-trained weight W.

Quick Start

Steps:

Apply LoftQ to a full-precision pre-trained weight and save.
Load LoftQ initialization and train.

For step 1, we have provided off-the-shelf LoftQ initializations (see supported model list) in Huggingface Hub LoftQ. If you want to do it yourself, jump to LoftQ DIY.

For step 2, below is an example of loading 4bit Mistral-7B with 64rank LoRA adapters from Huggingface Hub.

import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel

MODEL_ID = "LoftQ/Mistral-7B-v0.1-4bit-64rank"

base_model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID, 
    torch_dtype=torch.bfloat16,  # you may change it with different models
    quantization_config=BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_compute_dtype=torch.bfloat16,  # bfloat16 is recommended
        bnb_4bit_use_double_quant=False,
        bnb_4bit_quant_type='nf4',
    ),
)
peft_model = PeftModel.from_pretrained(
    base_model,
    MODEL_ID,
    subfolder="loftq_init",
    is_trainable=True,
)

# Do training with peft_model ...

LoftQ DIY

Apply LoftQ and save

We provide quantize_save_load.py as an example to apply LoftQ with different bits(--bits), ranks(--rank), and alternating steps (--iter, a hyper-parameter in LoftQ, see Algorithm 1 in LoftQ paper). Currently, this example supports llama-2, falcon, mistral, bart, t5, deberta, bert, roberta.

Below is an example of obtaining 4bit LLAMA-2-7b with 16-rank LoRA adapters by 5 alternating steps.

SAVE_DIR="model_zoo/loftq/"
python quantize_save_load.py \
    --model_name_or_path meta-llama/Llama-2-7b-hf \  # high-precision model id in HF
    --token HF_TOKEN \  # your HF token if the model is private, e.g., llama-2
    --bits 4 \
    --iter 5 \
    --rank 16 \
    --save_dir $SAVE_DIR

The above commands end up with creating the model directory under $SAVE_DIR. Specifically, the model directory is named as

MODEL_DIR = SAVE_DIR + f"{args.model_name_or_path.split('/')[-1]}-{args.bits}bits-{args.rank}rank"

In this example, MODEL_DIR="model_zoo/loftq/Llama-2-7b-hf-4bit-16rank", where the backbone is stored in $MODEL_DIR and the LoRA adapters are at the sub-folder $MODEL_DIR/loftq_init.

Load and train

Similar to loading from Huggingface Hub, we only need to change the MODEL_ID to the MODEL_DIR.

import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel

MODEL_DIR = "model_zoo/loftq/Llama-2-7b-hf-4bit-16rank"

base_model = AutoModelForCausalLM.from_pretrained(
    MODEL_DIR, 
    torch_dtype=torch.bfloat16,
    quantization_config=BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_compute_dtype=torch.bfloat16,
        bnb_4bit_use_double_quant=False,
        bnb_4bit_quant_type='nf4',
    ),
)
peft_model = PeftModel.from_pretrained(
    base_model,
    MODEL_DIR,
    subfolder="loftq_init",
    is_trainable=True,
)
# Do training with peft_model ...

LoftQ Fine-tuning

We also provide an example to fine-tune LoftQ on GSM8K. We load the quantized backbone and LoRA adapters from the LoftQ Huggingface hub.

python train_gsm8k_llama.py \
    --model_name_or_path LoftQ/Llama-2-13b-hf-4bit-64rank \
    --output_dir exp_results/gsm8k/llama-2-13b/bit4-rank64/lr1e-4 \
    --learning_rate 1e-4  \
    --weight_decay 0.1 \
    --lr_scheduler_type cosine \
    --num_warmup_steps 100 \
    --seed 202 \
    --dataset_name gsm8k \
    --dataset_config main \
    --pad_to_max_length \
    --max_source_length 128 \
    --max_target_length 256 \
    --num_train_epochs 5 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 4 \
    --with_tracking \
    --report_to tensorboard

Appendix: Off-the-shelf Model List

Model Name	Bits	Ranks
LLAMA-2-7b	4	64
LLAMA-2-13b	4	64
LLAMA-2-70b	4	64
Mistral	4	64
Mistral	4	32
BART-large	4	8
BART-large	4	16
BART-large	4	32
BART-large	2	8

In-place application of LoftQ initialization

PEFT provides a convenience function replace_lora_weights_loftq to apply LoftQ initialization in-place to the quantized model. Check out this notebook for an example.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

LoftQ: LoRA-fine-tuning-aware Quantization

Introduction

Quick Start

LoftQ DIY

Apply LoftQ and save

Load and train

LoftQ Fine-tuning

Appendix: Off-the-shelf Model List

In-place application of LoftQ initialization

Files

README.md

Latest commit

History

README.md

File metadata and controls

LoftQ: LoRA-fine-tuning-aware Quantization

Introduction

Quick Start

LoftQ DIY

Apply LoftQ and save

Load and train

LoftQ Fine-tuning

Appendix: Off-the-shelf Model List

In-place application of LoftQ initialization