OLive, meaning ONNX Runtime(ORT) Go Live, is a python package that automates the process of accelerating models with ONNX Runtime(ORT). It contains two parts including model conversion to ONNX with correctness checking and auto performance tuning with ORT. Users can run these two together through a single pipeline or run them independently as needed.
Simplify multiple frameworks to ONNX conversion experience by integrating existing ONNX conversion tools into a single package, as well as validating the converted models' correctness. Currently supported frameworks are PyTorch and TensorFlow.
- TensorFlow: OLive supports conversion with TensorFlow model in saved model, frozen graph, and checkpoint format. User needs to provider inputs' names and outputs' names for frozen graph and checkpoint conversion.
- PyTorch: User needs to provide inputs' names and shapes to convert PyTorch model. Besides, user needs to provide outputs' names and shapes to convert torchscript PyTorch model.
ONNX Runtime(ORT) is a high performance inference engine to run ONNX model. It enables many advanced tuning knobs for user to further optimize inference performance. OLive heuristically explores optimization search space in ORT to select the best ORT settings for a specific model on a specific hardware. It outputs the option combinations with the best performance for latency or for throughput.
Optimization fileds:
- Execution Providers:
- MLAS(default CPU EP), Intel DNNL and OpenVino for CPU
- Nvidia CUDA and TensorRT for GPU
- Environment Variables:
- OMP_WAIT_POLICY
- OMP_NUM_THREADS
- KMP_AFFINITY
- OMP_MAX_ACTIVE_LEVELS
- Session Options:
- inter_op_num_threads
- intra_op_num_threads
- execution_mode
- graph_optimization_level
- INT8 Quantization
- Transformer Model Optimization
OLive package can be downloaded here and installed with command pip install onnxruntime_olive-0.4.0-py3-none-any.whl
.
Supported python version: 3.7, 3.8, 3.9
User needs to install CUDA and cuDNN dependencies for perf tuning with OLive on GPU. The table below shows the ORT version and required CUDA and cuDNN version in the latest OLive.
ONNX Runtime | CUDA | cuDNN |
---|---|---|
1.11.0 | 11.4 | 8.2 |
There are three ways to use OLive:
- Use With Command Line: Run the OLive with command line using Python.
- Use With Jupyter Notebook: Quickstart of the OLive with tutorial using Jupyter Notebook.
- Use With OLive Server: Setup local OLive server for model conversion, optimizaton, and visualization service.
- Get best tuning result with
best_test_name
, which includes inference session settings, environment variable settings, and latency result. - Set related environment variables in your environment.
- OMP_WAIT_POLICY
- OMP_NUM_THREADS
- KMP_AFFINITY
- OMP_MAX_ACTIVE_LEVELS
- ORT_TENSORRT_FP16_ENABLE
- Create onnxruntime inference session with related settings.
- inter_op_num_threads
- intra_op_num_threads
- execution_mode
- graph_optimization_level
- execution_provider
import onnxruntime as ort sess_options = ort.SessionOptions() sess_options.inter_op_num_threads = inter_op_num_threads sess_options.intra_op_num_threads = intra_op_num_threads sess_options.execution_mode = execution_mode sess_options.graph_optimization_level = ort.GraphOptimizationLevel(graph_optimization_level) onnx_session = ort.InferenceSession(model_path, sess_options, providers=[execution_provider])
10/28/2021
Update OLive from docker container based usage to python package based usage for more flexibilities.
Enable more optimization options for performance tuning with ORT, including INT8 quantization, mix precision in ORT-TensorRT, and transformer model optimization.
We’d love to embrace your contribution to OLive. Please refer to CONTRIBUTING.md.
Copyright (c) Microsoft Corporation. All rights reserved.
Licensed under the MIT License.