qllm

🌩 Lightweight model quantization tool

The motivation behind this tool is to reduce model size and memory usage for inference. Storing large models can be burdensome, especially when dealing with edge devices. A model with a smaller footprint often simplifies and streamlines the inference process.

qllm tool allows developers to quantize any Hugging Face model from the HF Hub without losing much in terms of performance and accuracy. The underlying method is a simple linear quantization technique (both asymmetric and symmetric), enabling quick and ready-to-go quantization on the fly.

Note: This tool is not ready for production environments and should only be used for building intuition. Consider it a tool for model quantization experimentation.

Getting started

Download and load any model from huggingface. This example uses facebook/opt-350m model.

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch 

model_id = "facebook/opt-350m"
model = AutoModelForCausalLM.from_pretrained(
    model_id, 
    torch_dtype=torch.bfloat16, 
    low_cpu_mem_usage=True
)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# The Original footprint of the model: 662.392832 MB

Run and test the original model.

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
print(pipe("What are we having for dinner?"))

# [{'generated_text': "What are we having for dinner?\nI'm having a steak and a salad.\nI'm"}]

Load and replace the Linear Layers of the model by qllm's LL layers and run quantization.

from qllm import LinearQuantizer
from qllm.layers import W8A16LL

LinearQuantizer.replace_and_quantize_modules(model, W8A16LL, ['lm_head'])

# Model footprint after quantization: 359.799808 MB

Test the model after quantization

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
print(pipe("What are we having for dinner?"))

# [{'generated_text': "What are we having for dinner?\nI'm having a steak dinner.\nI'm having a"}]

Observation

It is around 46% reduction in model footprint which is outstanding. This may not be the same for every model but reduction of anywhere around 30% is significantly helpful from the inference point of view.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
examples		examples
qllm		qllm
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

qllm

🌩 Lightweight model quantization tool

Getting started

Observation

About

Releases

Packages

Languages

License

d1pankarmedhi/qllm

Folders and files

Latest commit

History

Repository files navigation

qllm

🌩 Lightweight model quantization tool

Getting started

Observation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages