- [2024/11/19] Microsoft and NVIDIA Supercharge AI Development on RTX AI PCs
- [2024/11/18] Quantized INT4 ONNX models available on Hugging Face for download
- Overview
- Installation
- Techniques
- Examples
- Support Matrix
- Benchmark Results
- Collection of Optimized ONNX Models
- Release Notes
The TensorRT Model Optimizer - Windows (ModelOpt-Windows) is engineered to deliver advanced model compression techniques, including quantization, to Windows RTX PC systems. Specifically tailored to meet the needs of Windows users, ModelOpt-Windows is optimized for rapid and efficient quantization, featuring local GPU calibration, reduced system and video memory consumption, and swift processing times. The primary objective of the ModelOpt-Windows is to generate optimized, standards-compliant ONNX-format models for DirectML backends. This makes it an ideal solution for seamless integration with ONNX Runtime (ORT) and DirectML (DML) frameworks, ensuring broad compatibility with any inference framework supporting the ONNX standard. Furthermore, ModelOpt-Windows integrates smoothly within the Windows ecosystem, with full support for tools and SDKs such as Olive and ONNX Runtime, enabling deployment of quantized models across various independent hardware vendors (IHVs) through the DML path and TensorRT path.
Model Optimizer is available for free for all developers on NVIDIA PyPI. This repository is for sharing examples and GPU-optimized recipes as well as collecting feedback from the community.
ModelOpt-Windows can be installed either as a standalone toolkit or through Microsoft's Olive.
To install ModelOpt-Windows as a standalone toolkit on CUDA 12.x systems, run the following commands:
pip install nvidia-modelopt[onnx] --extra-index-url https://pypi.nvidia.com
pip install cupy-cuda12x
To install ModelOpt-Windows through Microsoft's Olive, use the following commands:
pip install olive-ai[nvmo]
pip install onnxruntime-genai-directml>=0.4.0
pip install onnxruntime-directml==1.20.0
For more details, please refer to the detailed installation instructions.
Quantization is an effective model optimization technique for large models. Quantization with ModelOpt-Windows can compress model size by 2x-4x, speeding up inference while preserving model quality. ModelOpt-Window enables highly performant quantization formats including INT4, FP8*, INT8*, etc. and supports advanced algorithms such as AWQ and SmoothQuant* focusing on post-training quantization (PTQ) for ONNX and PyTorch* models with DirectML and TensorRT* inference backends.
For more details, please refer to the detailed quantization guide.
- PTQ for LLMs covers how to use ONNX Post-Training Quantization (PTQ) and deployment with DirectML
- MMLU Benchmark provides an example script for MMLU benchmark and demonstrates how to run it with various popular backends like DirectML, TensorRT-LLM* and model formats like ONNX and PyTorch*.
Please refer to support matrix for a full list of supported features and models.
Please refer to benchmark results for performance and accuracy comparisons of popular Large Language Models (LLMs).
The ready-to-deploy optimized ONNX models from ModelOpt-Windows are available at HuggingFace NVIDIA collections. These models can be deployed using DirectML backend. Follow the instructions provided along with the published models for deployment.
Please refer to changelog
* Experimental support