NVIDIA TensorRT Model Optimizer - Windows

A Library to Quantize and Compress Deep Learning Models for Optimized Inference on Native Windows RTX GPUs

Latest News

[2024/11/19] Microsoft and NVIDIA Supercharge AI Development on RTX AI PCs
[2024/11/18] Quantized INT4 ONNX models available on Hugging Face for download

Overview

The TensorRT Model Optimizer - Windows (ModelOpt-Windows) is engineered to deliver advanced model compression techniques, including quantization, to Windows RTX PC systems. Specifically tailored to meet the needs of Windows users, ModelOpt-Windows is optimized for rapid and efficient quantization, featuring local GPU calibration, reduced system and video memory consumption, and swift processing times. The primary objective of the ModelOpt-Windows is to generate optimized, standards-compliant ONNX-format models for DirectML backends. This makes it an ideal solution for seamless integration with ONNX Runtime (ORT) and DirectML (DML) frameworks, ensuring broad compatibility with any inference framework supporting the ONNX standard. Furthermore, ModelOpt-Windows integrates smoothly within the Windows ecosystem, with full support for tools and SDKs such as Olive and ONNX Runtime, enabling deployment of quantized models across various independent hardware vendors (IHVs) through the DML path and TensorRT path.

Model Optimizer is available for free for all developers on NVIDIA PyPI. This repository is for sharing examples and GPU-optimized recipes as well as collecting feedback from the community.

Installation

ModelOpt-Windows can be installed either as a standalone toolkit or through Microsoft's Olive.

Standalone Toolkit Installation (with CUDA 12.x)

To install ModelOpt-Windows as a standalone toolkit on CUDA 12.x systems, run the following commands:

pip install nvidia-modelopt[onnx] --extra-index-url https://pypi.nvidia.com
pip install cupy-cuda12x

Installation with Olive

To install ModelOpt-Windows through Microsoft's Olive, use the following commands:

pip install olive-ai[nvmo]
pip install onnxruntime-genai-directml>=0.4.0
pip install onnxruntime-directml==1.20.0

For more details, please refer to the detailed installation instructions.

Techniques

Quantization

Quantization is an effective model optimization technique for large models. Quantization with ModelOpt-Windows can compress model size by 2x-4x, speeding up inference while preserving model quality. ModelOpt-Window enables highly performant quantization formats including INT4, FP8*, INT8*, etc. and supports advanced algorithms such as AWQ and SmoothQuant* focusing on post-training quantization (PTQ) for ONNX and PyTorch* models with DirectML and TensorRT* inference backends.

For more details, please refer to the detailed quantization guide.

Examples

PTQ for LLMs covers how to use ONNX Post-Training Quantization (PTQ) and deployment with DirectML
MMLU Benchmark provides an example script for MMLU benchmark and demonstrates how to run it with various popular backends like DirectML, TensorRT-LLM* and model formats like ONNX and PyTorch*.

Support Matrix

Please refer to support matrix for a full list of supported features and models.

Benchmark Results

Please refer to benchmark results for performance and accuracy comparisons of popular Large Language Models (LLMs).

Collection Of Optimized ONNX Models

The ready-to-deploy optimized ONNX models from ModelOpt-Windows are available at HuggingFace NVIDIA collections. These models can be deployed using DirectML backend. Follow the instructions provided along with the published models for deployment.

Release Notes

Please refer to changelog

* Experimental support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

NVIDIA TensorRT Model Optimizer - Windows

A Library to Quantize and Compress Deep Learning Models for Optimized Inference on Native Windows RTX GPUs

Latest News

Table of Contents

Overview

Installation

Standalone Toolkit Installation (with CUDA 12.x)

Installation with Olive

Techniques

Quantization

Examples

Support Matrix

Benchmark Results

Collection Of Optimized ONNX Models

Release Notes

Files

README.md

Latest commit

History

README.md

File metadata and controls

NVIDIA TensorRT Model Optimizer - Windows

A Library to Quantize and Compress Deep Learning Models for Optimized Inference on Native Windows RTX GPUs

Latest News

Table of Contents

Overview

Installation

Standalone Toolkit Installation (with CUDA 12.x)

Installation with Olive

Techniques

Quantization

Examples

Support Matrix

Benchmark Results

Collection Of Optimized ONNX Models

Release Notes