Help Needed: Implementing QLoRA on AutoGPTQ Model - Unsupported Module Error #1858

SkanderGhariani · 2024-06-14T12:03:01Z

SkanderGhariani
Jun 14, 2024

Hello everyone,

I'm fairly new to working with advanced model optimizations and have recently been exploring the PEFT library for applying LoRA to a quantized model using AutoGPTQ. I wanted to test the integration of GPTQ and LoRA for a project, but I've run into an issue I can't seem to resolve.

Objective:
I'm trying to quantize a model using AutoGPTQ and then apply LoRA using PEFT. My goal is to understand how these two techniques can be combined for efficient model inference.

Issue:
I successfully quantized my model but encountered a problem when trying to apply the LoRA configuration. I received a ValueError stating that the QuantLinear() module is not supported, even though I was under the impression that such configurations were supported. Here's the specific error message:
ValueError: Target module QuantLinear() is not supported. Currently, only the following modules are supported: torch.nn.Linear, torch.nn.Embedding, torch.nn.Conv2d, transformers.pytorch_utils.Conv1D.
Code Snippet:

`import os
import torch
from transformers import AutoTokenizer, TextGenerationPipeline
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from peft import LoraConfig, get_peft_model
import logging

logging.basicConfig(format="%(asctime)s %(levelname)s [%(name)s] %(message)s", level=logging.INFO)

model_path = "BEE-spoke-data/smol_llama-101M-GQA"
quantized_model_dir = "tiny-llama-quantized-4bit"
new_model_path = "output"
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
device = "cuda:0"

tokenizer = AutoTokenizer.from_pretrained(model_path)
tokenizer.pad_token = tokenizer.eos_token

quantize_config = BaseQuantizeConfig(bits=4, group_size=128, desc_act=False)
model = AutoGPTQForCausalLM.from_pretrained(model_path, quantize_config=quantize_config)
model.to(device)

text = "Hello, world! Auto-GPTQ is a model quantization tool."
inputs = tokenizer(text, return_tensors="pt", add_special_tokens=True, padding=True, truncation=True, max_length=512)
inputs = {key: value.to(device) for key, value in inputs.items()}
if 'attention_mask' not in inputs:
inputs['attention_mask'] = torch.ones_like(inputs['input_ids'])

try:
model.quantize([{'input_ids': inputs['input_ids'], 'attention_mask': inputs['attention_mask']}])
model.save_quantized(quantized_model_dir)
logging.info("Model quantized and saved successfully.")
except Exception as e:
logging.error(f"Error during quantization: {e}")

model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, device=device)
lora_config = LoraConfig(r=4, lora_alpha=8, target_modules=["q_proj", "k_proj", "v_proj", "o_proj"], lora_dropout=0.05, bias="none", task_type="CAUSAL_LM")
model = get_peft_model(model, lora_config)
`
Questions:

Is there a specific configuration or step I am missing to properly apply LoRA to a quantized model?
Are there limitations or considerations with QuantLinear() that I should be aware of?
Any guidance or suggestions on how to correctly set up this integration or insights into what might be going wrong would be greatly appreciated!

Thank you!

Answered by SkanderGhariani

Jun 18, 2024

I resolved the issue by experimenting with two different approaches instead of trying to quantize with AutoGPTQ and then apply LoRA directly.

Approach 1: I applied 4-bit quantization using the bitsandbytes library and then applied LoRA using PEFT.
Approach 2: I loaded the GPTQ model via Transformers, applied the GPTQ configuration, and then applied LoRA using PEFT.
Following the advice given, I loaded the quantized model through Transformers before applying LoRA, which resolved the compatibility issues.

View full answer

SkanderGhariani · 2024-06-14T12:14:17Z

SkanderGhariani
Jun 14, 2024
Author

Apologies for the formatting in the initial post. Here’s the code snippet with proper formatting:

import os
import torch
from transformers import AutoTokenizer, TextGenerationPipeline
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from peft import LoraConfig, get_peft_model
import logging

# Setup basic configuration for logging
logging.basicConfig(format="%(asctime)s %(levelname)s [%(name)s] %(message)s", level=logging.INFO)

# Constants for model paths and device setup
model_path = "BEE-spoke-data/smol_llama-101M-GQA"
quantized_model_dir = "tiny-llama-quantized-4bit"
new_model_path = "output"  # Define the path where the trained models will be saved
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
device = "cuda:0"

# Initialize tokenizer with explicit EOS and PAD token setup
tokenizer = AutoTokenizer.from_pretrained(model_path)
tokenizer.pad_token = tokenizer.eos_token

# Log the tokenizer configurations
logging.info(f"Tokenizer EOS token: {tokenizer.eos_token}")
logging.info(f"Tokenizer padding side: {tokenizer.padding_side}")

# Setup the quantization configuration
quantize_config = BaseQuantizeConfig(bits=4, group_size=128, desc_act=False)

# Load the model and move it to the appropriate device
model = AutoGPTQForCausalLM.from_pretrained(model_path, quantize_config=quantize_config)
model.to(device)

# Prepare inputs for quantization
text = "Hello, world! Auto-GPTQ is a model quantization tool."
inputs = tokenizer(text, return_tensors="pt", add_special_tokens=True, padding=True, truncation=True, max_length=512)
inputs = {key: value.to(device) for key, value in inputs.items()}

# Ensure all required inputs are present
if 'attention_mask' not in inputs:
    inputs['attention_mask'] = torch.ones_like(inputs['input_ids'])

# Attempt quantization and handle any errors
try:
    model.quantize([{'input_ids': inputs['input_ids'], 'attention_mask': inputs['attention_mask']}])
    model.save_quantized(quantized_model_dir)
    logging.info("Model quantized and saved successfully.")
except Exception as e:
    logging.error(f"Error during quantization: {e}")

# Reload the quantized model
model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, device=device)

# Apply LoRA configuration
lora_config = LoraConfig(r=4, lora_alpha=8, target_modules=["q_proj", "k_proj", "v_proj", "o_proj"], lora_dropout=0.05, bias="none", task_type="CAUSAL_LM")
model = get_peft_model(model, lora_config)

# Setup training configuration
from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling

training_args = TrainingArguments(
    output_dir=new_model_path,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=2,
    num_train_epochs=1,
    warmup_steps=100,
    learning_rate=2e-4,
    logging_steps=100,
    save_steps=500,
    evaluation_strategy="steps",
    save_total_limit=1,
    load_best_model_at_end=True,
    metric_for_best_model="loss",
    greater_is_better=False
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False)
)

# Start training
trainer.train()

# Setup a text generation pipeline
pipeline = TextGenerationPipeline(model=model, tokenizer=tokenizer, device=device)
output = pipeline("This is a test of the quantized and LoRA model.")

print(output[0]['generated_text'])

2 replies

BenjaminBossan Jun 14, 2024
Maintainer

I could reproduce the error. AFAIK, you can only use auto-gptq in PEFT if you load the gptq model via transformers, e.g. like so:

import torch
from transformers import AutoModelForCausalLM, GPTQConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

causal_lm_model_id = "marcsun13/opt-350m-gptq-4bit"
quantization_config = GPTQConfig(bits=4, use_exllama=False)

model = AutoModelForCausalLM.from_pretrained(
    causal_lm_model_id,
    quantization_config=quantization_config,
)

model = prepare_model_for_kbit_training(model)
config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)
model = get_peft_model(model, config)
# ...

Maybe you can adapt this snippet to your example and check if it works.

I tried the model as you loaded it but there are a bunch of issues with attributes not being set as expected or missing, which finally results in the model not being recognized by PEFT as a gptq model. This leads to the error that you reported.

I'm not sure if it would be possible to load the model as you suggest and apply PEFT, or how much extra effort would be required to support that. Maybe @SunMarc knows more.

SkanderGhariani Jun 14, 2024
Author

Thank you for your answer ! You’ve given me a idea to try out—quantizing with AutoGPTQ first and then reloading the model through Transformers like you showed in the code snippet to apply LoRA.

SkanderGhariani · 2024-06-18T10:41:27Z

SkanderGhariani
Jun 18, 2024
Author

I resolved the issue by experimenting with two different approaches instead of trying to quantize with AutoGPTQ and then apply LoRA directly.

Approach 1: I applied 4-bit quantization using the bitsandbytes library and then applied LoRA using PEFT.
Approach 2: I loaded the GPTQ model via Transformers, applied the GPTQ configuration, and then applied LoRA using PEFT.
Following the advice given, I loaded the quantized model through Transformers before applying LoRA, which resolved the compatibility issues.

1 reply

BenjaminBossan Jun 18, 2024
Maintainer

Thanks for reporting back.

SuperBruceJia · 2024-07-06T02:52:54Z

SuperBruceJia
Jul 6, 2024

ValueError: Target module QuantLinear() is not supported. Currently, only the following modules are supported: torch.nn.Linear, torch.nn.Embedding, torch.nn.Conv2d, transformers.pytorch_utils.Conv1D.

This can be resolved by pip install auto-gptq==0.6.0 on my side.

Best regards,

Shuyue
July 5th, 2024

1 reply

lp-noel Oct 30, 2024

This did not resolve it for me, can you share the other packages you have installed and their versions?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Help Needed: Implementing QLoRA on AutoGPTQ Model - Unsupported Module Error #1858

{{title}}

Replies: 3 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Help Needed: Implementing QLoRA on AutoGPTQ Model - Unsupported Module Error #1858

SkanderGhariani Jun 14, 2024

Replies: 3 comments · 4 replies

SkanderGhariani Jun 14, 2024 Author

BenjaminBossan Jun 14, 2024 Maintainer

SkanderGhariani Jun 14, 2024 Author

SkanderGhariani Jun 18, 2024 Author

BenjaminBossan Jun 18, 2024 Maintainer

SuperBruceJia Jul 6, 2024

lp-noel Oct 30, 2024

SkanderGhariani
Jun 14, 2024

Replies: 3 comments 4 replies

SkanderGhariani
Jun 14, 2024
Author

BenjaminBossan Jun 14, 2024
Maintainer

SkanderGhariani Jun 14, 2024
Author

SkanderGhariani
Jun 18, 2024
Author

BenjaminBossan Jun 18, 2024
Maintainer

SuperBruceJia
Jul 6, 2024