The example script showcases how to utilize the ModelOpt-Windows toolkit for optimizing ONNX (Open Neural Network Exchange) models through quantization. This toolkit is designed for developers looking to enhance model performance, reduce size, and accelerate inference times, while preserving the accuracy of neural networks deployed with backends like DirectML on local RTX GPUs running Windows.
Quantization is a technique that converts models from floating-point to lower-precision formats, such as integers, which are more computationally efficient. This process can significantly speed up execution on supported hardware, while also reducing memory and bandwidth requirements.
This example takes an ONNX model as input, along with the necessary quantization settings, and generates a quantized ONNX model as output. This script can be used for quantizing popular, ONNX Runtime GenAI built Large Language Models (LLMs) in the ONNX format.
-
Install ModelOpt-Windows. Refer installation instructions.
-
Install required dependencies
pip install -r requirements.txt
You may generate the base model using the model builder that comes with onnxruntime-genai. The ORT-GenAI's model-builder downloads the original Pytorch model from Hugging Face, and produces an ONNX GenAI compatible base model in ONNX format. See example command-line below:
python -m onnxruntime_genai.models.builder -m meta-llama/Meta-Llama-3-8B -p fp16 -e dml -o E:\llama3-8b-fp16-dml-genai
To begin quantization, run the script like below:
python quantize.py --model_name=meta-llama/Meta-Llama-3-8B \
--onnx_path="E:\model_store\genai\llama3-8b-fp16-dml-genai\opset_21\model.onnx" \
--output_path="E:\model_store\genai\llama3-8b-fp16-dml-genai\opset_21\cnn_32_lite_0.1_16\model.onnx" \
--calib_size=32 --algo=awq_lite --dataset=cnn
The table below lists key command-line arguments of the ONNX PTQ example script.
Argument | Supported Values | Description |
---|---|---|
--calib_size |
32 (default), 64, 128 | Specifies the calibration size. |
--dataset |
cnn (default), pilevel | Choose calibration dataset: cnn_dailymail or pile-val. |
--algo |
awq_lite (default), awq_clip | Select the quantization algorithm. |
--onnx_path |
input .onnx file path | Path to the input ONNX model. |
--output_path |
output .onnx file path | Path to save the quantized ONNX model. |
--use_zero_point |
True, False (default) | Enable zero-point based quantization. |
--block-size |
32, 64, 128 (default) | Block size for AWQ. |
--awqlite_alpha_step |
0.1 (default) | Step-size for AWQ scale search, user-defined |
--awqlite_run_per_subgraph |
True, False (default) | If True, runs AWQ scale search at the subgraph level |
--awqlite_fuse_nodes |
True (default), False | If True, fuses input scales in parent nodes. |
--awqclip_alpha_step |
0.05 (default) | Step-size for AWQ weight clipping, user-defined |
--awqclip_alpha_min |
0.5 (default) | Minimum AWQ weight-clipping threshold, user-defined |
--awqclip_bsz_col |
1024 (default) | Chunk size in columns during weight clipping, user-defined |
--calibration_eps |
dml, cuda, cpu (default: [dml,cpu]) | List of calibration endpoints. |
Run the following command to view all available parameters in the script:
python quantize.py --help
Please refer to quantize.py
for further details on command-line parameters.
To evaluate the quantized model, please refer to the accuracy benchmarking and onnxruntime-genai performance benchmarking.
Once an ONNX FP16 model is quantized using ModelOpt-Windows, the resulting quantized ONNX model can be deployed on the DirectML backend using ORT-GenAI or ORT.
Refer to the following example scripts and tutorials for deployment:
Model | int4_awq |
---|---|
Llama3.1-8B-Instruct | Yes |
Phi3.5-mini-Instruct | Yes |
Mistral-7B-Instruct-v0.3 | Yes |
Llama3.2-3B-Instruct | Yes |
Gemma-2b-it | Yes |
Nemotron Mini 4B Instruct | Yes |
-
Configure Directories
- Update the
cache_dir
variable in themain()
function to specify the path where you want to store Hugging Face files (optional). - If you're low on space on the C: drive, change the TMP and TEMP environment variable to a different drive (e.g.,
D:\temp
).
- Update the
-
Authentication for Restricted Models
If the model you wish to use is hosted on Hugging Face and requires authentication, log in using the huggingface-cli before running the quantization script.
huggingface-cli login --token <HF_TOKEN>
-
Check Read/Write Permissions
Ensure that both the input and output model paths have the necessary read and write permissions to avoid any permission-related errors.