LoftQ finds quantized LoRA initialization: quantized backbone Q and LoRA adapters A and B, given a pre-trained weight W.
Steps:
- Apply LoftQ to a full-precision pre-trained weight and save.
- Load LoftQ initialization and train.
For step 1, we have provided off-the-shelf LoftQ initializations (see supported model list) in Huggingface Hub LoftQ. If you want to do it yourself, jump to LoftQ DIY.
For step 2, below is an example of loading 4bit Mistral-7B with 64rank LoRA adapters from Huggingface Hub.
import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel
MODEL_ID = "LoftQ/Mistral-7B-v0.1-4bit-64rank"
base_model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
torch_dtype=torch.bfloat16, # you may change it with different models
quantization_config=BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16, # bfloat16 is recommended
bnb_4bit_use_double_quant=False,
bnb_4bit_quant_type='nf4',
),
)
peft_model = PeftModel.from_pretrained(
base_model,
MODEL_ID,
subfolder="loftq_init",
is_trainable=True,
)
# Do training with peft_model ...
We provide quantize_save_load.py as an example to apply LoftQ with
different bits(--bits
), ranks(--rank
), and alternating steps (--iter
, a hyper-parameter in LoftQ, see Algorithm 1 in LoftQ paper). Currently, this example supports
llama-2
, falcon
, mistral
, bart
, t5
, deberta
, bert
, roberta
.
Below is an example of obtaining 4bit LLAMA-2-7b with 16-rank LoRA adapters by 5 alternating steps.
SAVE_DIR="model_zoo/loftq/"
python quantize_save_load.py \
--model_name_or_path meta-llama/Llama-2-7b-hf \ # high-precision model id in HF
--token HF_TOKEN \ # your HF token if the model is private, e.g., llama-2
--bits 4 \
--iter 5 \
--rank 16 \
--save_dir $SAVE_DIR
The above commands end up with creating the model directory under $SAVE_DIR
.
Specifically, the model directory is named as
MODEL_DIR = SAVE_DIR + f"{args.model_name_or_path.split('/')[-1]}-{args.bits}bits-{args.rank}rank"
In this example, MODEL_DIR="model_zoo/loftq/Llama-2-7b-hf-4bit-16rank"
, where the backbone is stored in $MODEL_DIR
and the LoRA adapters are at the sub-folder $MODEL_DIR/loftq_init
.
Similar to loading from Huggingface Hub, we only need to change the MODEL_ID
to the MODEL_DIR
.
import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel
MODEL_DIR = "model_zoo/loftq/Llama-2-7b-hf-4bit-16rank"
base_model = AutoModelForCausalLM.from_pretrained(
MODEL_DIR,
torch_dtype=torch.bfloat16,
quantization_config=BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=False,
bnb_4bit_quant_type='nf4',
),
)
peft_model = PeftModel.from_pretrained(
base_model,
MODEL_DIR,
subfolder="loftq_init",
is_trainable=True,
)
# Do training with peft_model ...
We also provide an example to fine-tune LoftQ on GSM8K. We load the quantized backbone and LoRA adapters from the LoftQ Huggingface hub.
python train_gsm8k_llama.py \
--model_name_or_path LoftQ/Llama-2-13b-hf-4bit-64rank \
--output_dir exp_results/gsm8k/llama-2-13b/bit4-rank64/lr1e-4 \
--learning_rate 1e-4 \
--weight_decay 0.1 \
--lr_scheduler_type cosine \
--num_warmup_steps 100 \
--seed 202 \
--dataset_name gsm8k \
--dataset_config main \
--pad_to_max_length \
--max_source_length 128 \
--max_target_length 256 \
--num_train_epochs 5 \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 4 \
--with_tracking \
--report_to tensorboard
Model Name | Bits | Ranks |
---|---|---|
LLAMA-2-7b | 4 | 64 |
LLAMA-2-13b | 4 | 64 |
LLAMA-2-70b | 4 | 64 |
Mistral | 4 | 64 |
Mistral | 4 | 32 |
BART-large | 4 | 8 |
BART-large | 4 | 16 |
BART-large | 4 | 32 |
BART-large | 2 | 8 |
PEFT provides a convenience function replace_lora_weights_loftq
to apply LoftQ initialization in-place to the quantized model. Check out this notebook for an example.