Help Needed: Implementing QLoRA on AutoGPTQ Model - Unsupported Module Error #1858
-
Hello everyone, I'm fairly new to working with advanced model optimizations and have recently been exploring the PEFT library for applying LoRA to a quantized model using AutoGPTQ. I wanted to test the integration of GPTQ and LoRA for a project, but I've run into an issue I can't seem to resolve. Objective: Issue: `import os logging.basicConfig(format="%(asctime)s %(levelname)s [%(name)s] %(message)s", level=logging.INFO) model_path = "BEE-spoke-data/smol_llama-101M-GQA" tokenizer = AutoTokenizer.from_pretrained(model_path) quantize_config = BaseQuantizeConfig(bits=4, group_size=128, desc_act=False) text = "Hello, world! Auto-GPTQ is a model quantization tool." try: model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, device=device) Is there a specific configuration or step I am missing to properly apply LoRA to a quantized model? Thank you! |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 4 replies
-
Apologies for the formatting in the initial post. Here’s the code snippet with proper formatting: import os
import torch
from transformers import AutoTokenizer, TextGenerationPipeline
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from peft import LoraConfig, get_peft_model
import logging
# Setup basic configuration for logging
logging.basicConfig(format="%(asctime)s %(levelname)s [%(name)s] %(message)s", level=logging.INFO)
# Constants for model paths and device setup
model_path = "BEE-spoke-data/smol_llama-101M-GQA"
quantized_model_dir = "tiny-llama-quantized-4bit"
new_model_path = "output" # Define the path where the trained models will be saved
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
device = "cuda:0"
# Initialize tokenizer with explicit EOS and PAD token setup
tokenizer = AutoTokenizer.from_pretrained(model_path)
tokenizer.pad_token = tokenizer.eos_token
# Log the tokenizer configurations
logging.info(f"Tokenizer EOS token: {tokenizer.eos_token}")
logging.info(f"Tokenizer padding side: {tokenizer.padding_side}")
# Setup the quantization configuration
quantize_config = BaseQuantizeConfig(bits=4, group_size=128, desc_act=False)
# Load the model and move it to the appropriate device
model = AutoGPTQForCausalLM.from_pretrained(model_path, quantize_config=quantize_config)
model.to(device)
# Prepare inputs for quantization
text = "Hello, world! Auto-GPTQ is a model quantization tool."
inputs = tokenizer(text, return_tensors="pt", add_special_tokens=True, padding=True, truncation=True, max_length=512)
inputs = {key: value.to(device) for key, value in inputs.items()}
# Ensure all required inputs are present
if 'attention_mask' not in inputs:
inputs['attention_mask'] = torch.ones_like(inputs['input_ids'])
# Attempt quantization and handle any errors
try:
model.quantize([{'input_ids': inputs['input_ids'], 'attention_mask': inputs['attention_mask']}])
model.save_quantized(quantized_model_dir)
logging.info("Model quantized and saved successfully.")
except Exception as e:
logging.error(f"Error during quantization: {e}")
# Reload the quantized model
model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, device=device)
# Apply LoRA configuration
lora_config = LoraConfig(r=4, lora_alpha=8, target_modules=["q_proj", "k_proj", "v_proj", "o_proj"], lora_dropout=0.05, bias="none", task_type="CAUSAL_LM")
model = get_peft_model(model, lora_config)
# Setup training configuration
from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling
training_args = TrainingArguments(
output_dir=new_model_path,
per_device_train_batch_size=2,
gradient_accumulation_steps=2,
num_train_epochs=1,
warmup_steps=100,
learning_rate=2e-4,
logging_steps=100,
save_steps=500,
evaluation_strategy="steps",
save_total_limit=1,
load_best_model_at_end=True,
metric_for_best_model="loss",
greater_is_better=False
)
trainer = Trainer(
model=model,
args=training_args,
data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False)
)
# Start training
trainer.train()
# Setup a text generation pipeline
pipeline = TextGenerationPipeline(model=model, tokenizer=tokenizer, device=device)
output = pipeline("This is a test of the quantized and LoRA model.")
print(output[0]['generated_text']) |
Beta Was this translation helpful? Give feedback.
-
I resolved the issue by experimenting with two different approaches instead of trying to quantize with AutoGPTQ and then apply LoRA directly. Approach 1: I applied 4-bit quantization using the bitsandbytes library and then applied LoRA using PEFT. |
Beta Was this translation helpful? Give feedback.
-
This can be resolved by Best regards, Shuyue |
Beta Was this translation helpful? Give feedback.
I resolved the issue by experimenting with two different approaches instead of trying to quantize with AutoGPTQ and then apply LoRA directly.
Approach 1: I applied 4-bit quantization using the bitsandbytes library and then applied LoRA using PEFT.
Approach 2: I loaded the GPTQ model via Transformers, applied the GPTQ configuration, and then applied LoRA using PEFT.
Following the advice given, I loaded the quantized model through Transformers before applying LoRA, which resolved the compatibility issues.