From 5c2d09498f6610a581d9218def21b5e9b6b11f43 Mon Sep 17 00:00:00 2001 From: Dave Welsch Date: Sun, 15 Sep 2024 15:56:24 -0700 Subject: [PATCH 1/5] Edited Quantization User Guide. Signed-off-by: Dave Welsch --- Docs/user_guide/adaround.rst | 90 ++++--- Docs/user_guide/auto_quant.rst | 43 ++-- Docs/user_guide/bn_reestimation.rst | 6 +- Docs/user_guide/index.rst | 48 ++-- Docs/user_guide/model_quantization.rst | 231 +++++++++--------- .../post_training_quant_techniques.rst | 8 +- Docs/user_guide/quant_analyzer.rst | 80 +++--- .../quantization_aware_training.rst | 53 ++-- .../quantization_feature_guidebook.rst | 94 ++++--- Docs/user_guide/quantization_sim.rst | 173 ++++++------- Docs/user_guide/visualization_quant.rst | 35 +-- 11 files changed, 399 insertions(+), 462 deletions(-) diff --git a/Docs/user_guide/adaround.rst b/Docs/user_guide/adaround.rst index 364775a302..2c5898d39f 100644 --- a/Docs/user_guide/adaround.rst +++ b/Docs/user_guide/adaround.rst @@ -1,84 +1,82 @@ .. _ug-adaround: -===================== +############## AIMET AdaRound -===================== +############## - AIMET quantization features, by default, use the "nearest rounding" technique for achieving quantization. - In the following figure, a single weight value in a weight tensor is shown as an illustrative example. When using the - "nearest rounding" technique, this weight value is quantized to the nearest integer value. The Adaptive Rounding - (AdaRound) feature, uses a smaller subset of the unlabelled training data to adaptively round the weights of modules - with weights. In the following figure, the weight value is quantized to the integer value far from it. AdaRound, - optimizes a loss function using the unlabelled training data to adaptively decide whether to quantize a specific - weight to the integer value near it or away from it. Using the AdaRound quantization, a model is able to achieve an - accuracy closer to the FP32 model, while using low bit-width integer quantization. - - When creating a QuantizationSimModel using the AdaRounded model, use the QuantizationSimModel provided API for - setting and freezing parameter encodings before computing the encodings. Please refer the code example in the AdaRound - API section. +By default, AIMET uses *nearest rounding* for quantization. A single weight value in a weight tensor is illustrated in the following figure. In nearest rounding, this weight value is quantized to the nearest integer value. +The Adaptive Rounding (AdaRound) feature uses a smaller subset of the unlabeled training data to adaptively round weights. In the following figure, the weight value is quantized to the integer value far from it. .. image:: ../images/adaround.png :width: 900px -AdaRound Use Cases -===================== +AdaRound optimizes a loss function using the unlabelled training data to decide whether to quantize a weight to the closer or further integer value. AdaRound quantization acieves accuracy closer to the FP32 model using low bit-width integer quantization. + +When creating a QuantizationSimModel using AdaRounded, use the QuantizationSimModel provided in the API to set and freeze parameter encodings before computing the encodings. Refer the code example in the AdaRound API. + +AdaRound use cases +================== + +Terminology +----------- -Common terminology -===================== - * BC - Bias Correction - * BNF - Batch Norm Folding - * CLE - Cross Layer Equalization - * HBF - High Bias Folding - * QAT - Quantization Aware Training - * { } - An optional step in the use case +The following abbreviations are used in the following use case descriptions: +BC + Bias Correction +BNF + Batch Norm Folding +CLE + Cross Layer Equalization +HBF + High Bias Folding +QAT + Quantization Aware Training +{ } + An optional step in the use case -Use Cases -===================== +Recommended +----------- #. {BNF} --> {CLE} --> AdaRound - Applying BNF and CLE are optional steps before applying AdaRound. Some models benefit from applying CLE - while some don't get any benefit. + Applying BNF and CLE are optional steps before applying AdaRound. Some models benefit from applying CLE while some don't. #. AdaRound --> QAT - AdaRound is a post-training quantization feature. But, for some models applying BNF and CLE may not be beneficial. - For these models, QAT after AdaRound may be beneficial. AdaRound is considered as a better weights initialization - step which helps for faster QAT. + AdaRound is a post-training quantization feature, but for some models applying BNF and CLE may not help. For these models, applying AdaRound before QAT might help. AdaRound is a better weights initialization step that speeds up QAT. +Not recommended +---------------- - Not recommended -===================== Applying BC either before or after AdaRound is not recommended. #. AdaRound --> BC #. BC --> AdaRound - - AdaRound Hyper parameters guidelines +AdaRound hyper parameters guidelines ===================================== -There are couple of hyper parameters required during AdaRound optimization and are exposed to users. But some of them -are with their default values which lead to good and stable results over many models and not recommended to change often. - -Following is guideline for Hyper parameters: - -#. Hyper Parameters to be changed often: number of batches (approximately 500-1000 images, if batch size of data loader - is 64, then 16 number of batches leads to 1024 images), number of iterations(default 10000) +A number of hyper parameters used during AdaRound optimization are exposed to users. The default values of some of these parameters lead to stable, good results over many models; we recommend that you not change these. -#. Hyper Parameters to be changed moderately: regularization parameter (default 0.01) +Use the following guideline for adjusting hyper parameters with AdaRound. -#. Hyper Parameters to be changed least: beta range(default (20, 2)), warm start period (default 20%) +* Hyper Parameters to be changed often + * Number of batches (approximately 500-1000 images. If batch size of data loader is 64, then 16x the number of batches leads to 1024 images) + * Number of iterations(default 10000) -| +* Hyper Parameters to change with caution + * Regularization parameter (default 0.01) +* Hyper Parameters to avoid changing + * Beta range(default (20, 2)) + * Warm start period (default 20%) AdaRound API ============ -Please refer to the links below to view the AdaRound API for each AIMET variant: +See the AdaRound API variant for your platform: - :ref:`AdaRound for PyTorch` - :ref:`AdaRound for Keras` diff --git a/Docs/user_guide/auto_quant.rst b/Docs/user_guide/auto_quant.rst index 179193ab43..0a80f42694 100644 --- a/Docs/user_guide/auto_quant.rst +++ b/Docs/user_guide/auto_quant.rst @@ -1,48 +1,47 @@ .. _ug-auto-quant: -=============== +############### AIMET AutoQuant -=============== +############### Overview ======== -AIMET offers a suite of neural network post-training quantization techniques. Often, applying these techniques in a -specific sequence, results in better accuracy and performance. Without the AutoQuant feature, the AIMET -user needs to manually try out various combinations of AIMET quantization features. This manual process is -error-prone and often time-consuming. -The AutoQuant feature, analyzes the model, determines the sequence of AIMET quantization techniques and applies these -techniques. In addition, the user can specify the amount of accuracy drop that can be tolerated, in the AutoQuant API. -As soon as this threshold accuracy is reached, AutoQuant stops applying any additional quantization technique. In -summary, the AutoQuant feature saves time and automates the quantization of the neural networks. +AIMET offers a suite of neural network post-training quantization techniques. Often, applying these techniques in a specific sequence results in better accuracy and performance. + +The AutoQuant feature analyzes the model, determines the best sequence of AIMET quantization techniques, and applies these techniques. You can specify the accuracy drop that can be tolerated in the AutoQuant API. +As soon as this threshold accuracy is reached, AutoQuant stops applying quantization techniques. + +Without the AutoQuant feature, you must manually try combinations of AIMET quantization techniques. This manual process is error-prone and time-consuming. Workflow ======== -Before entering the optimization workflow, AutoQuant performs the following preparation steps: +The workflow looks like this: - 1) Check the validity of the model and convert it into an AIMET quantization-friendly format (denoted as `Prepare Model` below). - 2) Select the best-performing quantization scheme for the given model (denoted as `QuantScheme Selection` below) -After the prepration steps, AutoQuant mainly consists of the following three stages: + .. image:: ../images/auto_quant_v2_flowchart.png - 1) BatchNorm folding - 2) :ref:`Cross-Layer Equalization ` - 3) :ref:`AdaRound ` -These techniques are applied in a best-effort manner until the model meets the allowed accuracy drop. -If applying AutoQuant fails to satisfy the evaluation goal, AutoQuant will return the model to which the best combination -of the above techniques is applied. +Before entering the optimization workflow, AutoQuant prepares by: - .. image:: ../images/auto_quant_v2_flowchart.png +1. Checking the validity of the model and converting the model into an AIMET quantization-friendly format (`Prepare Model`). +2. Selecting the best-performing quantization scheme for the given model (`QuantScheme Selection`) + +After the prepration steps, AutoQuant proceeds to try three techniques: +1. BatchNorm folding +2. :ref:`Cross-Layer Equalization (CLE) ` +3. :ref:`AdaRound ` +These techniques are applied in a best-effort manner until the model meets the allowed accuracy drop. +If applying AutoQuant fails to satisfy the evaluation goal, AutoQuant returns the model that returned the best results. AutoQuant API ============= -Please refer to the links below to view the AutoQuant API for each AIMET variant: +See the AutoQuant API for your AIMET variant: - :ref:`AutoQuant for PyTorch` - :ref:`AutoQuant for ONNX` diff --git a/Docs/user_guide/bn_reestimation.rst b/Docs/user_guide/bn_reestimation.rst index f07d0ebcf7..c69bd083b8 100644 --- a/Docs/user_guide/bn_reestimation.rst +++ b/Docs/user_guide/bn_reestimation.rst @@ -2,14 +2,14 @@ ====================== -AIMET BN Re-estimation +AIMET Batch Normal Re-estimation ====================== Overview ======== -The BN Re-estimation feature utilizes a small subset of training data to individually re-estimate the statistics of the -Batch Normalization (BN) layers in a model. These BN statistics are then used to adjust the quantization scale parameters +The Batch Normal (BN) re-estimation feature utilizes a small subset of training data to individually re-estimate the statistics of the +BN layers in a model. These BN statistics are then used to adjust the quantization scale parameters of the preceeding Convolution or Linear layers. Effectively, the BN layers are folded. The BN Re-estimation feature is applied after performing Quantization Aware Training (QAT) with Range Learning, with diff --git a/Docs/user_guide/index.rst b/Docs/user_guide/index.rst index ec16f074c8..0691d4d38a 100644 --- a/Docs/user_guide/index.rst +++ b/Docs/user_guide/index.rst @@ -2,9 +2,9 @@ :class: hideitem .. _ug-index: -====================================== +###################################### AI Model Efficiency Toolkit User Guide -====================================== +###################################### Overview ======== @@ -12,45 +12,36 @@ Overview AI Model Efficiency Toolkit (AIMET) is a software toolkit that enables users to quantize and compress models. Quantization is a must for efficient edge inference using fixed-point AI accelerators. -AIMET optimizes pre-trained models (e.g., FP32 trained models) using post-training and fine-tuning techniques that -minimize accuracy loss incurred during quantization or compression. +AIMET optimizes pre-trained models (for example, FP32 trained models) using post-training and fine-tuning techniques that minimize accuracy loss incurred during quantization or compression. AIMET currently supports PyTorch, TensorFlow, and Keras models. +The following picture shows a high-level view of the AIMET workflow. + .. image:: ../images/AIMET_index_no_fine_tune.png -The above picture shows a high-level view of the workflow when using AIMET. The user will start with a trained -model in either the PyTorch, TensorFlow, or Keras training framework. This trained model is passed to AIMET using APIs -for compression and quantization. AIMET returns a compressed/quantized version of the model -that the users can fine-tune (or train further for a small number of epochs) to recover lost accuracy. Users can then -export via ONNX/meta/h5 to an on-target runtime like Qualcomm\ |reg| Neural Processing SDK. +You train a model in the PyTorch, TensorFlow, or Keras training framework, then pass the model to AIMET, using APIs for compression and quantization. AIMET returns a compressed and/or quantized version of the model that you can fine-tune (or train further for a small number of epochs) to recover lost accuracy. You can then export the model using ONNX, meta/checkpoint, or h5 to an on-target runtime like the Qualcomm\ |reg| Neural Processing SDK. Features ======== -AIMET supports two sets of model optimization techniques: - -- Model Quantization: AIMET can simulate behavior of quantized HW for a given trained - model. This model can be optimized using Post-Training Quantization (PTQ) and fine-tuning (Quantization Aware Training - - QAT) techniques. - -- Model Compression: AIMET supports multiple model compression techniques that allow the - user to take a trained model and remove redundancies, resulting in a smaller model that runs faster on target. +AIMET supports two model optimization techniques: -Release Information -=================== +Model Quantization + AIMET can simulate the behavior of quantized hardware for a trained model. This model can be optimized using Post-Training Quantization (PTQ) and Quantization Aware Training (QAT) fine-tuning techniques. -For information specific to this release, please see :ref:`Release Notes ` and :ref:`Known Issues `. +Model Compression + AIMET supports multiple model compression techniques that remove redundancies from a trained model, resulting in a smaller model that runs faster on target. -Installation Guide -================== +Installing AIMET +================ -Please visit the :ref:`AIMET Installation ` for more details. +For installation instructions, see :ref:`AIMET Installation `. Getting Started =============== -Please refer to the following documentation: +To get started using AIMET, refer to the following documentation: - :ref:`Quantization User Guide ` - :ref:`Compression User Guide ` @@ -58,6 +49,11 @@ Please refer to the following documentation: - :ref:`Examples Documentation ` - :ref:`Installation ` +Release Information +=================== + +For information specific to this release, see :ref:`Release Notes ` and :ref:`Known Issues `. + :hideitem:`toc tree` ------------------------------------ .. toctree:: @@ -69,10 +65,6 @@ Please refer to the following documentation: Examples Documentation Installation <../install/index> -| - -| - | |project| is a product of |author| | Qualcomm\ |reg| Neural Processing SDK is a product of Qualcomm Technologies, Inc. and/or its subsidiaries. diff --git a/Docs/user_guide/model_quantization.rst b/Docs/user_guide/model_quantization.rst index ff1684b5ed..d8ac0811bf 100644 --- a/Docs/user_guide/model_quantization.rst +++ b/Docs/user_guide/model_quantization.rst @@ -2,51 +2,49 @@ :class: hideitem .. _ug-model-quantization: -======================== -AIMET Model Quantization -======================== -Models are generally trained on floating-point hardware like CPUs and GPUs. However, when these trained models are run -on quantized hardware that support fixed-precision operations, model parameters are converted from floating-point -precision to fixed precision. As an example, when running on hardware that supports 8-bit integer operations, the -floating point parameters in the trained model need to be converted to 8-bit integers. It is observed that for some -models, running on an 8-bit fixed-precision runtime introduces a loss in accuracy due to noise added from the use -of fixed precision parameters and fixed precision operations. - -AIMET provides multiple techniques and tools which help to create quantized models with a minimal loss in accuracy -relative to floating-point models. - -This section provides information on typical use cases and AIMET's quantization features. - -Use Cases +######################## +AIMET model quantization +######################## + +Models are trained on floating-point hardware like CPUs and GPUs. However, when you run these models on quantized hardware with fixed-precision operations, the model parameters must be fixed-precision. For example, when running on hardware that supports 8-bit integer operations, the floating point parameters in the trained model need to be converted to 8-bit integers. For some models, reduction to 8-bit fixed-precision introduces noise that causes a loss of accuracy. + +AIMET provides techniques and tools to help create quantized models that minimize loss of accuracy relative to floating-point models. + +Use cases ========= -1. **Predict on-target accuracy**: AIMET enables a user to simulate the effects of quantization to get a first order -estimate of the model's accuracy when run on quantized targets. This is useful to get an estimate of on-target accuracy -without needing an actual target platform. Note that to create a simulation model, AIMET uses representative data -samples to compute per-layer quantization encodings. + +This section briefly describes how AIMET's quantization features apply to typical use cases. + +Quantization simulation + AIMET enables you to simulate running models on quantized targets. This helps you estimate on-target accuracy without requiring you to move the model to a quantized target platform. + + A quantization simulation workflow is illustrated here: .. image:: ../images/quant_use_case_1.PNG -2. **Post-Training Quantization (PTQ)**: PTQ techniques attempt to make a model more quantization friendly without -requiring model re-training/fine-tuning. PTQ (as opposed to fine-tuning) is recommended as a first step in a -quantization workflow due to the following advantages: +Post-training quantization (PTQ) + PTQ techniques make a model more quantization-friendly without requiring model retraining or fine-tuning. PTQ is recommended as a first step in a quantization workflow because: -- No need for the original training pipeline; an evaluation pipeline is sufficient -- Only requires a small unlabeled dataset for calibration (can even be data-free in some scenarios) -- Fast, simple, and easy to use + - PTQ does not require the original training pipeline; an evaluation pipeline is sufficient + - PTQ requires only a small, unlabeled dataset for calibration + - PTQ is fast and easy to use + + The PTQ workflow is illustrated here: .. image:: ../images/quant_use_case_3.PNG -Note that with PTQ techniques, the quantized model accuracy may still have a gap relative to the floating-point model. -In such a scenario, or to even further improve the model accuracy, fine-tuning is recommended. + With PTQ techniques, model accuracy may still be reduced. In such cases, fine-tuning is recommended. -3. **Quantization-Aware Training (QAT)/Fine-Tuning**: QAT enables a user to fine-tune a model with quantization -operations inserted in network graph, which in effect adapts the model parameters to be robust to quantization noise. -While QAT requires access to a training pipeline and dataset, and takes longer to run due to needing a few epochs of -fine-tuning, it can provide better accuracy especially at low bitwidths. A typical QAT workflow is illustrated below. +Quantization-aware training (QAT) and fine-tuning + QAT enable you to fine-tune a model with quantization operations inserted in the network graph. In effect, it makes the model parameters robust to quantization noise. + + Compared to PTQ, QAT requires a training pipeline and dataset and takes longer because it needs some fine-tuning, but it can provide better accuracy, especially at low bitwidths. + + A typical QAT workflow is illustrated here: .. image:: ../images/quant_use_case_2.PNG -AIMET Quantization Features +AIMET quantization features =========================== .. toctree:: @@ -56,60 +54,62 @@ AIMET Quantization Features Quantization Simulation Quantization-Aware Training (QAT) -- :doc:`Quantization Simulation`: - QuantSim enables a user to modify a model by adding quantization simulation ops. When an evaluation is run on a - model with these quantization simulation ops, the user can observe a first-order simulation of expected accuracy on - quantized hardware. +:doc:`Quantization Simulation (QuantSim)` +----------------------------------------------------------- -- :ref:`Quantization-Aware Training (QAT)`: - QAT allows users to take a QuantSim model and further fine-tune the model parameters by taking quantization into - account. +QuantSim modifies a model by inserting quantization simulation operations, providing a first-order estimate of expected runtime accuracy on quantized hardware. - Two modes of QAT are supported: +:ref:`Quantization-Aware Training (QAT)` +------------------------------------------------------------------------ - - Regular QAT: +QAT enables fine-tuning of QuantSim model parameters by taking quantization into account. + +Two modes of QAT are supported: + + Regular QAT Fine-tuning of model parameters. Trainable parameters such as module weights, biases, etc. can be - updated. The scale and offset quantization parameters for activation quantizers remain constant. Scale and - offset parameters for weight quantizers will update to reflect new weight values after each training step. + updated. The scale and offset quantization parameters for activation quantizers remain constant. Scale and offset parameters for weight quantizers will update to reflect new weight values after each training step. - - QAT with Range Learning: + QAT with range learning In addition to trainable module weights and scale/offset parameters for weight quantizers, scale/offset parameters for activation quantizers are also updated during each training step. :hideitem:`Post-Training Quantization` ------------------------------------------ -- Post-Training Quantization (PTQ) Techniques: - Post-training quantization techniques help a model improve quantized accuracy without needing to re-train. +Post-training quantization (PTQ) techniques +------------------------------------------- - .. toctree:: - :titlesonly: - :hidden: +Post-training quantization techniques help improve quantized model accuracy without needing to re-train. - AutoQuant - Adaptive Rounding (AdaRound) - Cross-Layer Equalization - BN Re-estimation - Bias Correction [Depricated] +.. toctree:: + :titlesonly: + :hidden: - - :ref:`AutoQuant`: - AIMET provides an API that integrates the post-training quantization techniques described below. AutoQuant is - recommended for PTQ. If desired, individual techniques can be invoked using standalone feature specific APIs. + AutoQuant + Adaptive Rounding (AdaRound) + BN Re-estimation + Bias Correction [Deprecated] - - :ref:`Adaptive Rounding (AdaRound)`: - Determines optimal rounding for weight tensors to improve quantized performance. +:ref:`AutoQuant` + AIMET provides an API that integrates the post-training quantization techniques described below. AutoQuant is recommended for PTQ. If desired, individual techniques can be invoked using standalone feature specific APIs. - - :ref:`Cross-Layer Equalization`: - Equalizes weight ranges in consecutive layers. +:ref:`Adaptive rounding (AdaRound)` + Determines optimal rounding for weight tensors to improve quantized performance. - - :ref:`BN Re-estimation`: - Re-estimates Batch Norm layer statistics before folding the Batch Norm layers. +Cross-layer equalization + Equalizes weight ranges in consecutive layers. Implementation is variant-specific; see the API for your platform: + :ref:`PyTorch` + :ref:`Keras` + :ref:`ONNX` - - :ref:`Bias Correction` [Deprecated]: - Bias Correction is considered deprecated. It is advised to use AdaRound instead. +:ref:`BN re-estimation` + Re-estimates Batch Norm layer statistics before folding the Batch Norm layers. -:hideitem:`Debugging/Analysis Tools` ------------------------------------- +Bias correction (Deprecated) + Bias correction is deprecated. Use :ref:`AdaRound` instead. + +:hideitem:`Debugging and Analysis Tools` +---------------------------------------- .. toctree:: :titlesonly: @@ -118,91 +118,84 @@ AIMET Quantization Features QuantAnalyzer Visualizations -- Debugging/Analysis Tools - - :ref:`QuantAnalyzer`: - Automated debugging of the model to understand sensitivity to weight and/or activation quantization, individual - layer sensitivity, etc. +:ref:`QuantAnalyzer`: + Automated debugging of the model to understand sensitivity to weight and/or activation quantization, individual layer sensitivity, etc. - - :ref:`Visualizations`: - Visualizations and histograms of weight and activation ranges. +:ref:`Visualizations`: + Visualizations and histograms of weight and activation ranges. -AIMET Quantization Workflow +AIMET quantization workflow =========================== This section describes the recommended workflow for quantizing a neural network. .. image:: ../images/quantization_workflow.PNG -**1. Model prep and validation** +1. Prep and validate the model +------------------------------ -Before attempting quantization, ensure that models have been defined in accordance to model guidelines. These guidelines -depend on the ML framework the model is written in. +Before attempting quantization, ensure that models are defined according to model guidelines. These guidelines depend on the ML framework (PyTorch or TensorFlow) that the model is written in. :hideitem:`PyTorch` -------------------- -- Pytorch: - - :doc:`PyTorch Model Guidelines<../api_docs/torch_model_guidelines>` +:doc:`PyTorch Model Guidelines<../api_docs/torch_model_guidelines>` - In the case of PyTorch, there exists the Model Validator utility, to automate the checking of certain PyTorch model - requirements, as well as the Model Preparer utility, to automate the updating of the model definition to align with - certain requirements. + PyTorch has two utilities to automate model complaince: + + - The Model Validator utility automates checking PyTorch model requirements + - he Model Preparer utility automates updating model definition to align with requirements - In this model prep and validation phase, we advise the following flow: + In model prep and validation using PyTorch, we recommend the following flow: .. image:: ../images/pytorch_model_prep_and_validate.PNG - Users can use the model validator utility first to check if the model can be run with AIMET. If validator checks - fail, users can first try using model preparer in their pipeline, an automated feature for updating models, and - retry the model validator to see if checks now pass. If the validator continues to print warnings, users will need - to update the model definition by hand prior to using AIMET features. + Use the Model Validator utility to check if the model can be run with AIMET. If validator checks fail, put Model Preparer in the pipeline and retry Model Validator. If the validator continues to generate warnings, update the model definition by hand. - For more information on model validator and preparer, refer to the corresponding sections in + For more information on Model Validator and Model Preparer, see :doc:`AIMET PyTorch Quantization APIs<../api_docs/torch_quantization>`. +:hideitem:`TensorFlow` +-------------------- +:doc:`PyTorch Model Guidelines<../api_docs/torch_model_guidelines>` + +2. Apply PTQ and AutoQuant +-------------------------- -**2. PTQ/AutoQuant** +Apply PTQ techniques to adjust model parameters and make the model more robust to quantization. We recommend trying AutoQuant first. AutoQuant tries various other PTQ methods and finds the best combination of methods to apply. See :ref:`AIMET quantization features`. -The user can apply various PTQ techniques to the model to adjust model parameters and make the model more robust to -quantization. We recommend trying AutoQuant first, a PTQ feature which internally tries various other PTQ methods and -finds the best combination of methods to apply. Refer to the -AIMET Quantization Features section for more details on PTQ/AutoQuant. +3. Use QAT +---------- -**3. QAT** +If model accuracy is still not satisfactory after PTQ/AutoQuant, use QAT to fine-tune the model. See :doc:`AIMET Quantization Features `. -If model accuracy is still not satisfactory after PTQ/AutoQuant, the user can use QAT to fine-tune the model. Refer to -the AIMET Quantization Features section for more details on QAT. +4. Export models +---------------- -**4. Exporting models** +To move the model onto the target, you need: -In order to bring the model onto the target, users will need two things: +- A model with updated weights +- An encodings file containing quantization parameters associated with each quantization operation -- a model with updated weights -- an encodings file containing quantization parameters associated with each quantization op +AIMET QuantSim can export both items. The exported model type differs based on the ML framework used: -AIMET QuantSim provides export functionality to generate both items. The exported model type will differ based on the ML -framework used: +- `.onnx` for PyTorch +- `meta` / `checkpoint` for TensorFlow +- `.h5` and `.pb` for Keras -- .onnx for PyTorch -- meta/checkpoint for TensorFlow -- .h5 and .pb for Keras +The exact steps to export the model and encodings file depend on which AIMET Quantization features are used: +- Calling AutoQuant automatically exports the model and encodings file. +- If you use QAT, you'll call `.export()` on the QuantSim object. +- - If you use lower-level PTQ techniques like CLE, you first create a QuantSim object from the modified model, then call `.export()` on the QuantSim object. -Depending on which AIMET Quantization features were used, the user may need to take different steps to export the model -and encodings file. For example, calling AutoQuant will automatically export the model and encodings file as part of its -processing. If QAT is used, users will need to call .export() on the QuantSim object. If lower level PTQ techniques like -CLE are used, users will need to first create a QuantSim object from the modified model, and then call .export() on the -QuantSim object. +Debugging +========= -Debugging Guidelines -==================== .. toctree:: :titlesonly: :hidden: - Quantization Guidebook + Quantization Diagnostics -Applying AIMET Quantization features may involve some trial and error in order to find the best optimizations to apply -on a particular model. We have included some debugging steps in the :ref:`Quantization Guidebook` -that can be tried when quantization accuracy does not seem to improve right off the bat. +Applying AIMET Quantization features may involve some trial and error in order to find the best optimizations to apply on a particular model. If quantization accuracy does not seem to improve. see the debugging steps in the :ref:`Quantization Guidebook`. diff --git a/Docs/user_guide/post_training_quant_techniques.rst b/Docs/user_guide/post_training_quant_techniques.rst index 473b182470..e6f7dae2d4 100644 --- a/Docs/user_guide/post_training_quant_techniques.rst +++ b/Docs/user_guide/post_training_quant_techniques.rst @@ -2,14 +2,14 @@ .. _ug-post-training-quantization: -=========================================== -AIMET Post-Training Quantization Techniques -=========================================== +########################################### +AIMET post-training quantization techniques +########################################### Overview ======== -It is observed that some ML models show reduced inference accuracy when run on quantized hardware due to approximation noises. AIMET provides post-training quantization techniques that help adjust the parameters in the model such that the model becomes more quantization-friendly. AIMET post-training quantizations are designed to be applied on pre-trained ML models. These techniques are explained as part of the "Data-Free Quantization Through Weight Equalization and Bias Correction” paper at ICCV 2019 - https://arxiv.org/abs/1906.04721 +Sme ML models show reduced inference accuracy when run on quantized hardware due to approximation noise. AIMET provides post-training quantization techniques that help adjust the parameters in the model such that the model becomes more quantization-friendly. AIMET post-training quantizations are designed to be applied on pre-trained ML models. These techniques are explained as part of the "Data-Free Quantization Through Weight Equalization and Bias Correction” paper at ICCV 2019 - https://arxiv.org/abs/1906.04721 User Flow diff --git a/Docs/user_guide/quant_analyzer.rst b/Docs/user_guide/quant_analyzer.rst index 442ab60f67..8e127240af 100644 --- a/Docs/user_guide/quant_analyzer.rst +++ b/Docs/user_guide/quant_analyzer.rst @@ -1,91 +1,87 @@ .. _ug-quant-analyzer: -=================== +################### AIMET QuantAnalyzer -=================== +################### Overview ======== -The QuantAnalyzer feature analyzes the model for quantization and points out sensitive parts/hotspots in the model. -The analyses are performed automatically, and only requires the user to pass in callbacks for performing forward pass and evaluation, and optionally a dataloader for MSE loss analysis. +The QuantAnalyzer performs several analyses to identify sensitive areas and hotspots in the model. These analyses are performed automatically. To use QuantAnalyzier, you to pass in callbacks to perform forward pass and evaluation, and optionally a dataloader for MSE loss analysis. -For each analysis, QuantAnalyzer outputs json and/or html files containing data and plots for easy visualization. +For each analysis, QuantAnalyzer outputs JSON and/or HTML files containing data and plots for visualization. Requirements ============ -To call the QuantAnalyzer API, users need to provide the following: - - An FP32 pretrained model for analysis - - A dummy input for the model which can contain random values, but must match the shape of the model's expected input - - A user defined function for passing 500-1000 representative data samples through the model for quantization calibration. - - A user defined function for passing labeled data through the model for evaluation, returning an accuracy metric + +To call the QuantAnalyzer API, you must provide the following: + - An FP32 pre-trained model for analysis + - A dummy input for the model that can contain random values but which must match the shape of the model's expected input + - A user-defined function for passing 500-1000 representative data samples through the model for quantization calibration + - A user-defined function for passing labeled data through the model for evaluation, returning an accuracy metric - (Optional, for runing MSE loss analysis) A dataloader providing unlabeled data to be passed through the model -Other quantization related settings are also provided in the call to analyze a model. -Please refer to :doc:`PyTorch QuantAnalyzer API Docs<../api_docs/torch_quant_analyzer>` for more information on how to call the QuantAnalyzer feature. +Other quantization-related settings are also provided in the call to analyze a model. +See :doc:`PyTorch QuantAnalyzer API Docs<../api_docs/torch_quant_analyzer>` for more about how to call the QuantAnalyzer feature. -**Note**: Typically on quantized runtimes, batch normalization layers will be folded where possible. -So that users do not have to call a separate API to do so, QuantAnalyzer automatically performs Batch Norm Folding prior to running its analyses. +..admonition:: NOTE + Typically on quantized runtimes, batch normalization (BN) layers are folded where possible. So that you don't have to call a separate API to do so, QuantAnalyzer automatically performs Batch Norm Folding before running its analyses. -Detailed Analysis Descriptions +Detailed analysis descriptions ============================== + QuantAnalyzer performs the following analyses: -1. Sensitivity analysis to weight and activation quantization: - QuantAnalyzer compares the accuracies of the original FP32 model, an activation-only quantized model, and a weight-only quantized model. +Sensitivity analysis to weight and activation quantization + QuantAnalyzer compares the accuracies of the original FP32 model, an activation-only quantized model, and a weight-only quantized model. This helps users determine which AIMET quantization technique(s) will be more beneficial for the model. - This helps users determine which AIMET quantization technique(s) will be more beneficial for the model. For example, in situations where the model is more sensitive to activation quantization, PTQ techniques like Adaptive Rounding or Cross Layer Equalization might not be very helpful. Accuracy values for each model are printed as part of AIMET logging. -2. Per layer quantizer enablement analysis: - Sometimes the accuracy drop incurred from quantization can be attributed to only a subset of quantizers within the model. - QuantAnalyzer performs analyses to find such layers by enabling and disabling individual quantizers to observe how the model accuracy changes. +Per-layer quantizer enablement analysis + Sometimes the accuracy drop incurred from quantization can be attributed to only a subset of quantizers within the model. QuantAnalyzer finds such layers by enabling and disabling individual quantizers to observe how the model accuracy changes. The following two types of quantizer enablement analyses are performed: - 1. Disable all quantizers across the model and, for each layer, enable only that layer's output quantizer and perform evaluation with the provided callback. - This results in accuracy values obtained for each layer in the model when only that layer's quantizer is enabled, allowing users to observe effects of individual layer quantization and pinpoint culprit layer(s) and hotspots. + 1. Disable all quantizers across the model and, for each layer, enable only that layer's output quantizer and perform evaluation with the provided callback. This results in accuracy values obtained for each layer in the model when only that layer's quantizer is enabled, exposing the effects of individual layer quantization and pinpointing culprit layer(s) and hotspots. - 2. Enable all quantizers across the model and, for each layer, disable only that layer's output quantizer and perform evaluation with the provided callback. - Once again, accuracy values are produced for each layer in the model when only that layer's quantizer is disabled. + 2. Enable all quantizers across the model and, for each layer, disable only that layer's output quantizer and perform evaluation with the provided callback. Once again, accuracy values are produced for each layer in the model when only that layer's quantizer is disabled. - As a result of these analyses, AIMET outputs per_layer_quant_enabled.html and per_layer_quant_disabled.html respectively, containing plots mapping layers on the x-axis to model accuracy on the y-axis. + As a result of these analyses, AIMET outputs `per_layer_quant_enabled.html` and `per_layer_quant_disabled.html` respectively, containing plots mapping layers on the x-axis to model accuracy on the y-axis. - JSON files per_layer_quant_enabled.json and per_layer_quant_disabled.json are also produced, containing the data shown in the .html plots. + JSON files `per_layer_quant_enabled.json` and `per_layer_quant_disabled.json` are also produced, containing the data shown in the .html plots. -3. Per layer encodings min-max range analysis: +Per-layer encodings min-max range analysis As part of quantization, encoding parameters for each quantizer must be obtained. - These parameters include scale, offset, min, and max, and are used for mapping floating point values to quantized integer values. + These parameters include scale, offset, min, and max, and are used to map floating point values to quantized integer values. QuantAnalyzer tracks the min and max encoding parameters computed by each quantizer in the model as a result of forward passes through the model with representative data (from which the scale and offset values can be directly obtained). - As a result of this analysis, AIMET outputs html plots and json files for each activation quantizer and each parameter quantizer (contained in the min_max_ranges folder), containing the encoding min/max values for each. + As a result of this analysis, AIMET outputs html plots and json files for each activation quantizer and each parameter quantizer (contained in the min_max_ranges folder) containing the encoding min/max values for each. - If Per Channel Quantization (PCQ) is enabled, encoding min and max values for all the channels of each weight will be shown. + If Per Channel Quantization (PCQ) is enabled, encoding min and max values for all the channels of each weight are shown. -4. Per layer statistics histogram: - Under the TF Enhanced quantization scheme, encoding min/max values for each quantizer are obtained by collecting a histogram of tensor values seen at that quantizer and potentially tossing out outliers. +Per-layer statistics histogram + Under the TF Enhanced quantization scheme, encoding min/max values for each quantizer are obtained by collecting a histogram of tensor values seen at that quantizer and deleting outliers. - When this quantization scheme is selected, QuantAnalyzer will output plots for each quantizer in the model, displaying the histogram of tensor values seen at that quantizer. - These plots are available as part of the activations_pdf and weights_pdf folders, containing a separate .html plot for each quantizer. + When this quantization scheme is selected, QuantAnalyzer outputs plots for each quantizer in the model, displaying the histogram of tensor values seen at that quantizer. + These plots are available as part of the `activations_pdf` and `weights_pdf folders`, containing a separate .html plot for each quantizer. -5. Per layer MSE loss: - An optional analysis QuantAnalyzer can do is to monitor each layer's output in the original FP32 model as well as the corresponding layer output in the quantized model, and calculate the MSE loss between the two. +Per layer mean-square-error (MSE) loss (optional) + QuantAnalyzer can monitor each layer's output in the original FP32 model as well as the corresponding layer output in the quantized model and calculate the MSE loss between the two. This helps identify which layers may contribute more to quantization noise. - To enable this optional analysis, users need to pass in a dataloader for QuantAnalyzer to read from. - Approximately 256 samples/images are sufficient. + To enable this optional analysis, you pass in a dataloader that QuantAnalyzer reads from. + Approximately 256 samples/images are sufficient for the analysis. - A per_layer_mse_loss.html file will be generated containing a plot mapping layer quantizers on the x-axis to MSE loss on the y-axis. - A corresponding per_layer_mse_loss.json file will also be generated containing data corresponding to the .html file. + A `per_layer_mse_loss.html` file is generated containing a plot that maps layer quantizers on the x-axis to MSE loss on the y-axis. A corresponding `per_layer_mse_loss.json` file is generated containing data corresponding to the .html file. QuantAnalyzer API ================= -Please refer to the links below to view the QuantAnalyzer API for each AIMET variant: +See the links below to view the QuantAnalyzer API for each AIMET variant: - :ref:`QuantAnalyzer for PyTorch` - :ref:`QuantAnalyzer for Keras` diff --git a/Docs/user_guide/quantization_aware_training.rst b/Docs/user_guide/quantization_aware_training.rst index 7be8b70e63..345d4929ca 100644 --- a/Docs/user_guide/quantization_aware_training.rst +++ b/Docs/user_guide/quantization_aware_training.rst @@ -1,57 +1,50 @@ .. _ug-quantization-aware-training: -================================= +################################# AIMET Quantization Aware Training -================================= +################################# Overview ======== -In cases where PTQ techniques are not sufficient for mitigating quantization error, users can use quantization-aware -training (QAT). QAT models the quantization noise during training and allows the model to find better solutions -than post-training quantization. However, the higher accuracy comes with the usual costs of neural -network training, i.e. longer training times, need for labeled data and hyperparameter search. + +When post-training quantizatio (PTQ) doesn't sufficiently reduce quantization error, the next step is to use quantization-aware training (QAT). QAT finds more accurate solutions than PTQ by modeling the quantization noise during training. This higher accuracy comes at the usual cost of neural network training, including longer training times and the need for labeled data and hyperparameter search. QAT workflow ============ -The QAT workflow is largely similar to the flow for using Quantization Simulation for inference. The only difference is -that a user can take the sim.model and use it in their training pipeline in order to fine-tune model parameters while -taking quantization noise into account. The user's training pipeline will not need to change in order to train the -sim.model compared to training the original model. -A typical pipeline is as follows: +Using QAT is similar to using Quantization Simulation for inference. The only difference is that you use the sim.model in your training pipeline to fine-tune model parameters while taking quantization noise into account. Your training pipeline doesn't need to change to train the sim.model. + +A typical QAT workflow is as follows: 1. Create a QuantSim sim object from a pretrained model. -2. Calibrate the sim using representative data samples to come up with initial encoding values for each quantizer node. -3. Pass the sim.model into a training pipeline to fine-tune the model parameters. +2. Calibrate the sim using representative data samples to calculate initial encoding values for each quantizer node. +3. Pass the sim.model into a training pipeline to fine-tune the model parameters. 4. Evaluate the sim.model using an evaluation pipeline to check whether model accuracy has improved. -5. Export the sim to generate a model with updated weights and no quantization nodes, along with the accompanying - encodings file containing quantization scale/offset parameters for each quantization node. +5. Export the sim to generate a model with updated weights and no quantization nodes, along with an encodings file containing quantization scale and offset parameters for each quantization node. -Observe that as compared to QuantSim inference, step 3 is the only addition when performing QAT. +Compared to QuantSim inference, step 3 is the only addition when performing QAT. QAT modes ========= -There are two variants of QAT, referred to as QAT without Range Learning and QAT with Range Learning. -In QAT without Range Learning, encoding values for activation quantizers are found once in the beginning during the -calibration step after QuantSim has been instantiated, and are not updated again subsequently throughout training. +There are two versions of QAT: without Range Learning and with Range Learning. + +Without range learning + In QAT without Range Learning, encoding values for activation quantizers are found once during calibration and are not updated again. -In QAT with Range Learning, encoding values for activation quantizers are initially set during the calibration step, but -are free to update during training, allowing a more optimal set of scale/offset quantization parameters to be found -as training takes place. +With range learning + In QAT with Range Learning, encoding values for activation quantizers are set during calibration and can be updated during training, resulting in better scale and offset quantization parameters. -In both variants, parameter quantizer encoding values will continue to update in accordance with the parameters -themselves updating during training. +In both versions, parameter quantizer encoding values continue to be updated with the parameters themselves during training. -Recommendations for Quantization-Aware Training +Recommendations for quantization-aware training =============================================== -Here are some general guidelines that can aid in improving performance or faster convergence with Quantization-aware Training (QAT): +Here are some guidelines that can improve performance and speed convergence with QAT: -* Initialization: - - Often it can be beneficial to first apply post training quantization techniques like :ref:`AutoQuant` before applying QAT. - This is especially beneficial if there is large drop in INT8 performance compared to the FP32 baseline. +* Initialization + It often helps to first apply post training quantization techniques like :ref:`AutoQuant` before applying QAT, especially if there is large drop in INT8 performance from the FP32 baseline. * Hyper-parameters: - - Number of epochs: 15-20 epochs are generally sufficient for convergence + - Number of epochs: 15-20 epochs are usually sufficient for convergence - Learning rate: Comparable (or one order higher) to FP32 model's final learning rate at convergence. Results in AIMET are with learning of the order 1e-6. - Learning rate schedule: Divide learning rate by 10 every 5-10 epochs diff --git a/Docs/user_guide/quantization_feature_guidebook.rst b/Docs/user_guide/quantization_feature_guidebook.rst index 15cc1cfea8..719dd85a15 100644 --- a/Docs/user_guide/quantization_feature_guidebook.rst +++ b/Docs/user_guide/quantization_feature_guidebook.rst @@ -1,65 +1,59 @@ -.. _ug-quant-guidebook: +.. _ug-quant-debug: +############################## +AIMET quantization diagnostics +############################## -===================================== -AIMET Quantization Features Guidebook -===================================== +AIMET supports various neural network quantization techniques. See :ref:`User Guide`. -AIMET supports various neural network quantization techniques. A more in-depth discussion on various techniques and -their usage is provided in :ref:`User Guide` +If the model's performance is still not satisfactory after applying an AIMET quantization feature, we recommend a set of diagnostic steps to identify the bottlenecks and improve performance. These debugging steps can provide insights as to why a quantized model underperforms and help to address the underlying issues, but they are not algorithmic. Some trial and error might be required. -After applying an AIMET Quantization feature, if the model's performance is still not satisfactory, we recommend a set -of diagnostics steps to identify the bottlenecks and improve the performance. While this is not strictly an algorithm, -these debugging steps can provide insights on why a quantized model underperforms and help to tackle the underlying -issues. These steps are shown as a flow chart in figure 9 and are described in more detail below: +The steps are shown as a flow chart in the following figure and are described in more detail below: -**FP32 sanity check** -An important initial debugging step is to ensure that the floating-point and quantized model behave similarly in the -forward pass, especially when using custom quantization pipelines. Set the quantized model bit-width to 32 bits for -both weights and activation, or by-pass the quantization operation, if possible, and check that the accuracy matches -that ofthe FP32 model. +.. image:: /images/quantization_debugging_flow_chart.png + :height: 800 + :width: 700 -**Weights or activations quantization** -The next debugging step is to identify how activation or weight quantization impact the performance independently. Does -performance recover if all weights are quantized to a higher bit-width while activations are kept in a lower bitwidth, -or conversely if all activations use a high bit-width and activations a low bit-width? This step can show the relative -contribution of activations and weight quantization to the overall performance drop and point us towards the -appropriate solution. +1. FP32 confidence check +======================== -**Fixing weight quantization** -If the previous step shows that weight quantization does cause significant accuracy drop, then there are a few solutions -to try: -1. Apply CLE if not already implemented, especially for models with depth-wise separable convolutions. -2. Try per-channel quantization. This will address the issue of uneven per-channel weight distribution. -3. Apply bias correction or AdaRound if calibration data is available +First, ensure that the floating-point and quantized model behave similarly in the forward pass, especially when using custom quantization pipelines. Set the quantized model bit-width to 32 bits for both weights and activation, or by-pass the quantization operation if possible, and check that the accuracy matches that of the FP32 model. +2. Weights or activations quantization +====================================== -.. image:: /images/quantization_debugging_flow_chart.png - :height: 800 - :width: 700 +Next, identify how activation or weight quantization impacts the performance independently. Does performance recover if all weights are quantized to a higher bit-width while activations are kept in a lower bitwidth, or vice versa? This step can show the relative contribution of activations and weight quantization to the overall performance drop and point toward the appropriate solution. + +3. Fixing weight quantization +============================= + +If the previous step shows that weight quantization causes significant accuracy drop, try the following solutions: + +1. Apply cross-layer equalization (CLE) if not already implemented, especially for models with depth-wise separable convolutions. +2. Try per-channel quantization. This addresses the issue of uneven per-channel weight distribution. +3. Apply bias correction or AdaRound if calibration data is available. + +4. Fixing activation quantization +================================= + +Generic CLE can lead to uneven activation distribution. To reduce the quantization error from activation quantization, try using different range setting methods or adjust CLE to take activation quantization ranges into account. + +5. Doing per-layer analysis +=========================== + +If global solutions have not restored accuracy to acceptable levels, consider each quantizer individually. Set each quantizer sequentially to the target bit-width while holding the rest of the network at 32 bits (see inner `for` loop in figure. +6. Visualizing layers +===================== -**Fixing activation quantization** -To reduce the quantization error from activation quantization, we can also try using different range setting methods or -adjust CLE to take activation quantization ranges into account, as vanilla CLE can lead to uneven activation -distribution. +If the quantization of an individual tensor leads to significant accuracy drop, try visualizing the tensor distribution at different granularities, for example per-channel, and dimensions,for example per-token or per-embedding for activations in BERT. -**Per-layer analysis** -If the global solutions have not restored accuracy to acceptable levels, we consider each quantizer individually. We set -each quantizer sequentially, to the target bit-width while keeping the rest of the network to 32 bits -(see inner for loop in figure above). +7. Fixing individual quantizers +=============================== -**Visualizing layers** -If the quantization of a individual tensor leads to significant accuracy drop, we recommended visualizing the tensor -distribution at different granularities, e.g. per-channel as in figure 5, and dimensions, e.g., per-token or per-embedding -for activations in BERT. +The previous step (visualization) can reveal the source of a tensor's sensitivity to quantization. Some common solutions involve custom range setting for this quantizer or allowing a higher bit-width for a problematic quantizer. If the problem is fixed and the accuracy recovers, continue to the next quantizer. If not, you might have to resort to other methods, such as quantization-aware training (QAT). -**Fixing individual quantizers** -The visualization step can reveal the source of the tensor's sensitivity to quantization. Some common solutions involve -custom range setting for this quantizer or allowing a higher bit-width for problematic quantizer. If the problem is -fixed and the accuracy recovers, we continue to the next quantizer. If not, we may have to resort to other methods, -such as quantization-aware training (QAT). +8. Quantize the model +===================== -After completing the above steps, the last step is to quantize the complete model to the desired bit-width. If the -accuracy is acceptable, we have our final quantized model ready to use. Otherwise, we can consider higher bit-widths and -smaller granularities or revert to more powerful quantization methods, such as quantization-aware training. +After you complete these steps, quantize the complete model to the desired bit-width. If the accuracy is acceptable, this yields a final quantized model ready to use. Otherwise, consider higher bit-widths and smaller granularities or revert to more powerful quantization methods, such as quantization-aware training. diff --git a/Docs/user_guide/quantization_sim.rst b/Docs/user_guide/quantization_sim.rst index cedf62833e..f657705564 100644 --- a/Docs/user_guide/quantization_sim.rst +++ b/Docs/user_guide/quantization_sim.rst @@ -1,159 +1,130 @@ .. _ug-quantsim: -============================= -AIMET Quantization Simulation -============================= +############################# +AIMET quantization simulation +############################# + Overview ======== -AIMET’s Quantization Simulation feature provides functionality to simulate the effects of quantized hardware. This -allows the user to then apply post-training and/or fine-tuning techniques in AIMET to recover the loss in accuracy, and -ultimately deploy the model on the target device. -When applying QuantSim by itself, optimal quantization scale/offset parameters for each quantizer are found, but no -techniques for mitigating accuracy loss from quantization are applied. Users can either pass their original model -directly to QuantSim to simulate quantization noise on the starting model, or apply Post-Training Quantization -techniques to obtain an updated model to then pass into QuantSim to observe a difference in quantization accuracy as a -result of applying the techniques. +AIMET’s Quantization Simulation feature simulates the effects of quantized hardware. This enables you to apply post-training and/or fine-tuning techniques in AIMET to recover the accuracy lost in quantization before deploying the model to the target device. -Once a QuantSim object has been created, users can fine-tune the model within the QuantSim object using their -existing pipeline. This method is described in the :ref:`Quantization Aware Training` page. +When QuantSim is applied by itself, AIMET finds optimal quantization scale/offset parameters for each quantizer but does not apply techniques to mitigate accuracy loss. You can apply QuantSim directly to the original model or to a model updated using Post-Training Quantization. + +Once a QuantSim object has been created, you can fine-tune the model using its existing pipeline. This technique is described in :ref:`Quantization Aware Training`. The quantization nodes used in QuantSim are custom quantizers defined in AIMET, and are not recognized by targets. -QuantSim provides an export functionality that will save a copy of the model with quantization nodes removed, as well as -generate an encodings file containing quantization scale/offset parameters for each activation and weight tensor in -the model. +QuantSim provides an export functionality that saves a copy of the model with quantization nodes removed and generates an encodings file containing quantization scale and offset parameters for activation and weight tensors in the model. -A hardware runtime can ingest the encodings file and match it with the exported model to find what scale/offset values -to apply on each tensor in the model. +A hardware runtime can ingest the encodings file and match it with the exported model to apply scale and offset values in the model. -QuantSim Workflow +QuantSim workflow ================= -A typical workflow for using AIMET quantization simulation to simulate on-target quantized accuracy is described below. +Following is a typical workflow for using AIMET QuantSim to simulate on-target quantized accuracy. -1. The user starts with a pretrained floating-point FP32 model. +1. Start with a pretrained floating-point FP32 model. -2. AIMET creates a simulation model by inserting quantization simulation ops into the model graph as explained in the - sub-section below. +2. Use AIMET to create a simulation model. AIMET inserts quantization simulation operations into the model graph (explained in the sub-section below). -3. AIMET also configures the inserted simulation ops. The configuration of these ops can be controlled via a - configuration file as discussed in sub-section below. +3. AIMET configures the inserted simulation operations. The configuration of these operations can be controlled via a configuration file as discussed below. -4. AIMET finds optimal quantization parameters, such as scale/offsets, for the inserted quantization simulation ops. To - do this, AIMET requires the user to provide a callback method that feeds a few representative data samples through - the model. These samples can either be from the training or calibration datasets. Generally, samples in the order of - 1,000-2,000 have been sufficient for AIMET to find optimal quantization parameters. +4. Provide a callback method that feeds representative data samples through the model. AIMET uses this method to find optimal quantization parameters, such as scales and offsets, for the inserted quantization simulation operations. These samples can be from the training or calibration datasets. 1,000-2,000 samples are usually sufficient to optimize quantization parameters. 5. AIMET returns a quantization simulation model that can be used as a drop-in replacement for the original model in - their evaluation pipeline. Running this simulation model through the evaluation pipeline yields a quantized accuracy + your evaluation pipeline. Running this simulation model through the evaluation pipeline yields a quantized accuracy metric that closely simulates on-target accuracy. -6. The user can call .export() on the sim object to save a copy of the model with quantization nodes removed, along with - an encodings file containing quantization scale/offset parameters for each activation and weight tensor in the model. +6. Call `.export()` on the sim object to save a copy of the model with quantization nodes removed, along with + an encodings file containing quantization scale and offset parameters for each activation and weight tensor in the model. -Simulating Quantization Noise +Simulating quantization noise ============================= -The diagram below explains how quantization noise is introduced to a model when its input, output or parameters are -quantized and dequantized. + +The diagram below illustrates how quantization noise is introduced to a model when its inputs, outputs, or parameters are quantized and de-quantized. .. image:: ../images/quant_3.png -Since dequantizated value may not be exactly the same as quantized value, the difference between the two values is the -quantization noise. +A de-quantizated value is not exactly equal to its corresponding quantized value. The difference between the two values is the quantization noise. -In order to simulate quantization noise, AIMET QuantSim adds quantizer ops to the PyTorch/TensorFlow/Keras model graph. -The resulting model graph can be used as is in the user’s evaluation or training pipeline. +To simulate quantization noise, AIMET QuantSim adds quantizer operations to the PyTorch, TensorFlow, or Keras model graph. The resulting model graph can be used as-is in your evaluation or training pipeline. -Determining Quantization Parameters (Encodings) +Determining quantization parameters (encodings) =============================================== -Using a QuantSim model, AIMET analyzes and determines the optimal quantization encodings (scale and offset parameters) -for each quantizer op. -To do this, AIMET passes some calibration samples through the model. Using hooks, tensor data is intercepted while -flowing through the model. A histogram is created to model the distribution of the floating point numbers in the output -tensor for each layer. +Using a QuantSim model, AIMET determines the optimal quantization encodings (scale and offset parameters) for each quantizer operation. + +To do this, AIMET passes calibration samples through the model and, using hooks, intercepts tensor data flowing through the model. AIMET creates a histogram to model the distribution of the floating point values in the output tensor for each layer. .. image:: ../images/quant_2.png -Using the distribution of the floating point numbers in the output tensor for each layer, quantization encodings are -computed using the specified quantization calibration technique. An encoding for a layer consists of four numbers: +An encoding for a layer consists of four numbers: -- Min (q\ :sub:`min`\ ): Numbers below these are clamped -- Max (q\ :sub:`max`\ ): Numbers above these are clamped -- Delta: Granularity of the fixed point numbers (is a function of the bit-width selected) -- Offset: Offset from zero +Min (q\ :sub:`min`\ ) + Numbers below these are clamped +Max (q\ :sub:`max`\ ) + Numbers above these are clamped +Delta + Granularity of the fixed point numbers (a function of the bit-width selected) +Offset + Offset from zero -The Delta and Offset can be calculated using Min and Max and vice versa using the equations: +The Delta and Offset are calculated using Min and Max and vice versa using the equations: :math:`\textrm{Delta} = \dfrac{\textrm{Max} - \textrm{Min}}{{2}^{\textrm{bitwidth}} - 1} \quad \textrm{Offset} = \dfrac{-\textrm{Min}}{\textrm{Delta}}` -Quantization Schemes +Using the floating point distribution in the output tensor for each layer, AIMET calculates quantization encodings using the specified quantization calibration technique described in the next section. + +Quantization schemes ==================== -AIMET supports various techniques for coming up with min and max values for encodings, also called quantization schemes: -- Min-Max: Also referred to as "TF" in AIMET (The name TF represents the origin of this technique and - has no relation to what framework the user is using). To cover the whole dynamic range of the tensor, we can define - the quantization parameters Min and Max to be the observed Min and Max during the calibration process. This leads to - no clipping error. However, this approach is sensitive to outliers, as strong outliers may cause excessive rounding - errors. +AIMET supports various techniques, also called quantization schemes, for calculating min and max values for encodings: -- Signal-to-Quantization-Noise (SQNR): Also referred to as “TF Enhanced” in AIMET (The name TF - represents the origin of this technique and has no relation to what framework the user is using). The SQNR approach is - similar to the Mean Square Error (MSE) minimization approach. In the SQNR range setting method, we find qmin and qmax - that minimize the total MSE between the original and the quantized tensor. Quantization noise and saturation noise are - different types of erros which are weighted differently. +Min-Max (also referred to as "TF" in AIMET. The name TF represents the origin of the technique and has no relation to which framework is using it.) + To cover the whole dynamic range of the tensor, the quantization parameters Min and Max are defined as the observed Min and Max during the calibration process. This approach eliminates clipping error but is sensitive to outliers since extreme values induce rounding errors. -For each quantization scheme, there are "post training" and "training range learning" variants. The "post training" -variants are used during regular QuantSim inference as well as QAT without Range Learning, to come up with initial -encoding values for each quantization node. In QAT without Range Learning, encoding values for activation quantizers -will remain static (encoding values for parameter quantizers will change in accordance with changing parameter values -during training). +Signal-to-Quantization-Noise (SQNR; also called “TF Enhanced” in AIMET. The name TF represents the origin of the technique and has no relation to what framework is using it). + The SQNR approach is similar to the mean square error (MSE) minimization approach. The qmin and qmax are found that minimize the total MSE between the original and the quantized tensor. + + Quantization noise and saturation noise are different types of erros which are weighted differently. -The "training range learning" variants are used during QAT with Range Learning. The schemes define how to come up with -initial encoding values for each quantization node, but also allow encoding values for activations to be learned -alongside parameter quantizer encodings during training. +For each quantization scheme, there are "post training" and "training range learning" variants. The "post training" variants are used during regular QuantSim inference and QAT without Range Learning to compute initial encoding values for each quantization node. In QAT without Range Learning, encoding values for activation quantizers remain static (encoding values for parameter quantizers change with changing parameter values during training). -For more details on QAT, refer to :ref:`Quantization Aware Training`. +The "training range learning" variants are used during QAT with Range Learning. The schemes define how to compute initial encoding values for each quantization node, but also allow encoding values for activations to be learned alongside parameter quantizer encodings during training. -Configuring Quantization Simulation Ops -======================================= +For more details on QAT, see :ref:`Quantization Aware Training`. -Different hardware and on-device runtimes may support different quantization choices for neural network inference. For -example, some runtimes may support asymmetric quantization for both activations and weights, whereas other ones may -support asymmetric quantization just for weights. +Configuring quantization simulation operations +============================================== -As a result, we need to make quantization choices during simulation that best reflect our target runtime and hardware. -AIMET provides a default configuration file, which can be modified. This file is used during quantization simulation if -no other configuration file is specified. By default, following configuration is used for quantization simulation: +Different hardware and on-device runtimes support different quantization choices for neural network inference. For example, some runtimes support asymmetric quantization for both activations and weights, whereas others support asymmetric quantization just for weights. -- Weight quantization: Per-channel, symmetric quantization, INT8 +As a result, quantization choices during simulation need to best reflect the target runtime and hardware. AIMET provides a default configuration file that can be modified. By default, the following configuration is used for quantization simulation: -- Activation or layer output quantization: Per-tensor, asymmetric quantization, INT8 +Weight quantization + Per-channel, symmetric quantization, INT8 -Quantization options that can be controlled via the configuration file include the following: +Activation or layer output quantization + Per-tensor, asymmetric quantization, INT8 -- Enabling/disabling of input and output quantizer ops -- Enabling/disabling of parameter quantizer ops -- Enabling/disabling of model input quantizer -- Enabling/disabling of model output quantizer -- Symmetric/Asymmetric quantization -- Unsigned/signed symmetric quantization -- Strict/non strict symmetric quantization -- Per channel/per tensor quantization -- Defining groups of layers to be fused (no quantization done on intermediate tensors within fused layers) +Quantization options settable in the configuration file include: -Please see the :ref:`Quantization Simulation Configuration ` page which describes the configuration -options in detail. +- Enabling or disabling input and output quantizer ops +- Enabling or disabling parameter quantizer ops +- Enabling or disabling model input quantizer +- Enabling or disabling model output quantizer +- Symmetric or asymmetric quantization +- Unsigned or signed symmetric quantization +- Strict or non-strict symmetric quantization +- Per-channel or per-tensor quantization +- Defining groups of layers to be fused (no quantization is done on intermediate tensors within fused layers) + +See the :ref:`Quantization Simulation Configuration ` page, which describes the configuration options in detail. Quantization Simulation APIs ============================ -Please refer to the links below to view the Quantization Simulation API for each AIMET variant: +See the AIMET Quantization Simulation API for your platform: - :ref:`Quantization Simulation for PyTorch` - :ref:`Quantization Simulation for Keras` - :ref:`Quantization Simulation for ONNX` - -Frequently Asked Questions -========================== -- Q: How many samples are needed in the calibration step (compute encodings)? - A: 1,000 - 2,000 unlabeled representative data samples are sufficient. \ No newline at end of file diff --git a/Docs/user_guide/visualization_quant.rst b/Docs/user_guide/visualization_quant.rst index 62a6447a05..dc3cddc5a4 100644 --- a/Docs/user_guide/visualization_quant.rst +++ b/Docs/user_guide/visualization_quant.rst @@ -2,39 +2,40 @@ .. _ug-quantization-visualization: -==================================== -AIMET Visualization for Quantization -==================================== - +#################################### +AIMET visualization for quantization +#################################### Overview ======== -AIMET Visualization adds analytical capability to the AIMET tool (which helps quantize and compress ML models) through visualization. It provides more detailed insights into AIMET features as users are able to analyze a model's layers in terms of compressibility and also highlight potential issues when applying quantization. The tool also assists in displaying progress for computationally heavy tasks. The visualizations get saved as an HTML file under the specified directory. + +AIMET Visualization provides detailed insights into AIMET features. You can analyze model layers' compressibility and highlight potential issues when applying quantization. The tool also displays progress for computationally heavy tasks. The visualizations get saved as an HTML file. Quantization ============ -During quantization, common parameters are used throughout a layer for converting the floating point weight values to INT8. If the dynamic range in weights is very high, the quantization will not be very granular. To equalize the weight range we apply Cross Layer Equalization. -In order to understand if we need to apply Cross Layer Equalization, we can visualize the weight range for every channel in a layer. If the weight range varies a lot over the various channels, applying cross layer equalization helps in improving the Quantization accuracy. + +During quantization, common parameters are used throughout a layer for converting the floating point weight values to INT8. If the dynamic range in weights is very high, the quantization is not very granular. The weight range can be equalized by applying cross layer equalization. + +To determine if you need to apply cross layer equalization, visualize the weight range for every channel in a layer. If the weight range varies a lot across channels, applying cross layer equalization helps in improving the Quantization accuracy. .. image:: ../images/vis_3.png PyTorch ------- -In PyTorch, we can visualize the weights for a model. We can also visualize the weight ranges for a model before and after Cross Layer Equalization. -There are three main functions a user can invoke: +In PyTorch, you can visualize the weights for a model. You can also visualize the weight ranges for a model before and after cross layer equalization. +There are three main functions you can invoke: -#. User can analyze relative weight ranges of model to see potentially problematic layers for quantization -#. User can understand each layer in the model -#. User can visualize the model, comparing weights before and after quantization. +#. Analyze relative weight ranges of the model to see potentially problematic layers for quantization +#. Understand each layer in the model +#. Visualize the model, comparing weights before and after quantization TensorFlow ---------- -In TensorFlow, we can visualize the weight ranges and relative weight ranges over various channels in a layer. -User can also use the same functions to see the changes in a layer weight ranges before and after Cross Layer Equalization. +In TensorFlow, you can visualize the weight ranges and relative weight ranges over various channels in a layer. You can also use the same functions to see the changes in a layer's weight ranges before and after cross layer equalization. -There are two main functions a user can invoke: +There are two main functions you can invoke: -#. User can analyze relative weight ranges of a layer to see potentially problematic layers for quantization -#. User can visualize weight ranges of a layer and see the various statistics for weights +#. Analyze relative weight ranges of a layer to see potentially problematic layers for quantization +#. Visualize weight ranges of a layer and see the various statistics for weights From db6d7e8028c2c4d06206123efbb2d23736aaaf1b Mon Sep 17 00:00:00 2001 From: Dave Welsch Date: Mon, 16 Sep 2024 15:38:51 -0700 Subject: [PATCH 2/5] Quantization Guide - more edits. Signed-off-by: Dave Welsch --- Docs/conf.py | 2 +- Docs/user_guide/adaround.rst | 8 +- Docs/user_guide/bn_reestimation.rst | 25 +-- Docs/user_guide/index.rst | 22 +-- Docs/user_guide/model_guidelines.rst | 28 +-- Docs/user_guide/model_quantization.rst | 15 +- Docs/user_guide/quant_analyzer.rst | 8 +- .../quantization_aware_training.rst | 11 +- .../user_guide/quantization_configuration.rst | 187 +++++++++--------- Docs/user_guide/quantization_sim.rst | 14 +- Docs/user_guide/visualization_compression.rst | 11 +- 11 files changed, 169 insertions(+), 162 deletions(-) diff --git a/Docs/conf.py b/Docs/conf.py index 2006057761..cf8ea7ce0e 100644 --- a/Docs/conf.py +++ b/Docs/conf.py @@ -112,7 +112,7 @@ def setup(app): # # This is also used if you do content translation via gettext catalogs. # Usually you set "language" from the command line for these cases. -language = None +language = 'en' # List of patterns, relative to source directory, that match files and # directories to ignore when looking for source files. diff --git a/Docs/user_guide/adaround.rst b/Docs/user_guide/adaround.rst index 2c5898d39f..9c05b39216 100644 --- a/Docs/user_guide/adaround.rst +++ b/Docs/user_guide/adaround.rst @@ -7,7 +7,7 @@ AIMET AdaRound By default, AIMET uses *nearest rounding* for quantization. A single weight value in a weight tensor is illustrated in the following figure. In nearest rounding, this weight value is quantized to the nearest integer value. -The Adaptive Rounding (AdaRound) feature uses a smaller subset of the unlabeled training data to adaptively round weights. In the following figure, the weight value is quantized to the integer value far from it. +The Adaptive Rounding (AdaRound) feature uses a subset of the unlabeled training data to adaptively round weights. In the following figure, the weight value is quantized to the integer value far from it. .. image:: ../images/adaround.png :width: 900px @@ -40,6 +40,8 @@ QAT Recommended ----------- +The following sequences are recommended: + #. {BNF} --> {CLE} --> AdaRound Applying BNF and CLE are optional steps before applying AdaRound. Some models benefit from applying CLE while some don't. @@ -49,7 +51,7 @@ Recommended Not recommended ---------------- -Applying BC either before or after AdaRound is not recommended. +Applying bias correction (BC) either before or after AdaRound is *not* recommended. #. AdaRound --> BC @@ -70,7 +72,7 @@ Use the following guideline for adjusting hyper parameters with AdaRound. * Regularization parameter (default 0.01) * Hyper Parameters to avoid changing - * Beta range(default (20, 2)) + * Beta range (default (20, 2)) * Warm start period (default 20%) AdaRound API diff --git a/Docs/user_guide/bn_reestimation.rst b/Docs/user_guide/bn_reestimation.rst index c69bd083b8..f22fc610c0 100644 --- a/Docs/user_guide/bn_reestimation.rst +++ b/Docs/user_guide/bn_reestimation.rst @@ -8,19 +8,14 @@ AIMET Batch Normal Re-estimation Overview ======== -The Batch Normal (BN) re-estimation feature utilizes a small subset of training data to individually re-estimate the statistics of the -BN layers in a model. These BN statistics are then used to adjust the quantization scale parameters -of the preceeding Convolution or Linear layers. Effectively, the BN layers are folded. +The Batch Normal (BN) re-estimation feature utilizes a small subset of training data to individually re-estimate the statistics of the BN layers in a model. These BN statistics are then used to adjust the quantization scale parameters of the preceeding Convolution or Linear layers. Effectively, the BN layers are folded. -The BN Re-estimation feature is applied after performing Quantization Aware Training (QAT) with Range Learning, with -Per Channel Quantization (PCQ) enabled. It is very important NOT to fold the BN layers before performing QAT. The BN layers are -folded ONLY after QAT and the re-estimation of the BN statistics are completed. The Workflow section below, covers -the exact sequence of steps. +The BN re-estimation feature is applied after performing Quantization Aware Training (QAT) with Range Learning, with Per Channel Quantization (PCQ) enabled. It is important *not* to fold the BN layers before performing QAT. Fold the BN layers only after QAT and the re-estimation of the BN statistics are completed. See the Workflow section below for the exact sequence of steps. -The BN Re-estimation feature is specifically recommended for the following scenarios: +The BN re-estimation feature is specifically recommended for the following scenarios: - Low-bitwidth weight quantization (e.g., 4-bits) -- Models for which Batch Norm Folding leads to decreased performance. +- Models for which Batch Norm Folding leads to decreased performance - Models where the main issue is weight quantization (including higher bitwidth quantization) - Low bitwidth quantization of depthwise separable layers since their Batch Norm Statistics are affected by oscillations @@ -28,12 +23,12 @@ The BN Re-estimation feature is specifically recommended for the following scena Workflow ======== -BN-Re-estimation requires that +BN re-estimation requires that: 1. BN layers not be folded before QAT. 2. Per Channel Quantization is enabled. -To use the BN-Re-estimation feature, the following sequence of steps must be followed in the correct order. +To use the BN re-estimation feature, the following sequence of steps must be followed in order: 1. Create the QuantizationSimModel object with Range Learning Quant Scheme 2. Perform QAT with Range Learning @@ -41,10 +36,10 @@ To use the BN-Re-estimation feature, the following sequence of steps must be fol 4. Fold the BN layers 5. Using the QuantizationSimModel, export the model and encodings. -Once the above steps are completed, the model can be run on the target for inference. +Once the steps are completed, the model can be run on the target for inference. -The following high level call flow diagrams, enumerates the work flow for PyTorch. -The workflow is the same for TensorFlow and Keras. +The following sequence diagram shows the workflow for PyTorch. +The workflow is the same for TensorFlow and Keras. .. image:: ../images/bn_reestimation.png :width: 1200px @@ -53,7 +48,7 @@ The workflow is the same for TensorFlow and Keras. BN Re-estimation API ==================== -Please refer to the links below to view the BN Re-estimation API for each AIMET variant: +See the links below to view the BN re-estimation API for each AIMET variant: - :ref:`BN Re-estimation for PyTorch` - :ref:`BN Re-estimation for Keras` diff --git a/Docs/user_guide/index.rst b/Docs/user_guide/index.rst index 0691d4d38a..06901291ad 100644 --- a/Docs/user_guide/index.rst +++ b/Docs/user_guide/index.rst @@ -14,13 +14,13 @@ Quantization is a must for efficient edge inference using fixed-point AI acceler AIMET optimizes pre-trained models (for example, FP32 trained models) using post-training and fine-tuning techniques that minimize accuracy loss incurred during quantization or compression. -AIMET currently supports PyTorch, TensorFlow, and Keras models. +AIMET supports PyTorch, TensorFlow, and Keras models. -The following picture shows a high-level view of the AIMET workflow. +The following diagram shows a high-level view of the AIMET workflow. .. image:: ../images/AIMET_index_no_fine_tune.png -You train a model in the PyTorch, TensorFlow, or Keras training framework, then pass the model to AIMET, using APIs for compression and quantization. AIMET returns a compressed and/or quantized version of the model that you can fine-tune (or train further for a small number of epochs) to recover lost accuracy. You can then export the model using ONNX, meta/checkpoint, or h5 to an on-target runtime like the Qualcomm\ |reg| Neural Processing SDK. +You train a model in the PyTorch, TensorFlow, or Keras training framework, then pass the model to AIMET, using its APIs for compression and quantization. AIMET returns a compressed and/or quantized version of the model that you can fine-tune (or train further for a small number of epochs) to recover lost accuracy. You can then export the model using ONNX, meta/checkpoint, or h5 to an on-target runtime like the Qualcomm\ |reg| Neural Processing SDK. Features ======== @@ -33,21 +33,16 @@ Model Quantization Model Compression AIMET supports multiple model compression techniques that remove redundancies from a trained model, resulting in a smaller model that runs faster on target. -Installing AIMET +More Information ================ -For installation instructions, see :ref:`AIMET Installation `. - -Getting Started -=============== - -To get started using AIMET, refer to the following documentation: +For more information about AIMET, see the following documentation: +- :ref:`Installation ` - :ref:`Quantization User Guide ` - :ref:`Compression User Guide ` -- :ref:`API Documentation ` - :ref:`Examples Documentation ` -- :ref:`Installation ` +- :ref:`API Documentation ` Release Information =================== @@ -59,11 +54,12 @@ For information specific to this release, see :ref:`Release Notes Quantization User Guide Compression User Guide API Documentation<../api_docs/index> Examples Documentation - Installation <../install/index> + | |project| is a product of |author| | Qualcomm\ |reg| Neural Processing SDK is a product of Qualcomm Technologies, Inc. and/or its subsidiaries. diff --git a/Docs/user_guide/model_guidelines.rst b/Docs/user_guide/model_guidelines.rst index 8da79361f4..daa953373d 100644 --- a/Docs/user_guide/model_guidelines.rst +++ b/Docs/user_guide/model_guidelines.rst @@ -1,17 +1,19 @@ :orphan: -============================ +############################ Model Guidelines for PyTorch -============================ +############################ -To implement the Cross Layer Equalization API, aimet_torch.cross_layer_equalization.equalize_model(), AIMET creates a computing graph to analyze the sequence of Operations in the model. -If your model is defined using certain constructs, it restricts AIMET from successfully creating and analyzing the computing graph. The following table lists the potential issues and workarounds. +To implement the Cross Layer Equalization API, `aimet_torch.cross_layer_equalization.equalize_model()`, AIMET creates a computing graph to analyze the sequence of operations in the model. -Note: These restrictions are not applicable, if you are using the **Primitive APIs** +Certain model constructs prevent AIMET from creating and analyzing the computing graph. The following table lists these potential issues and workarounds. +.. admonition NOTE:: + + These restrictions are not applicable if you are using the **Primitive APIs**. +------------------------+------------------------------+-----------------------------------+ -| Potential Issue | Description | Work Around | +| Potential Issue | Description | Workaround | +========================+==============================+===================================+ | ONNX Export | Use torch.onnx.export() | If ONNX export fails, rewrite the | | | to export your model. | specific layer so that ONNX | @@ -19,10 +21,10 @@ Note: These restrictions are not applicable, if you are using the **Primitive AP +------------------------+------------------------------+-----------------------------------+ | Slicing Operation |Some models use | Rewrite the x.view() statement | | |**torch.tensor.view()** in the| as follows: | -| |forward function as follows: | x = x.view(x.size(0), -1) | -| |x = x.view(-1, 1024) | | -| |If view function is written | | -| |as above, it causes an issue | | +| |forward function as follows: | `x = x.view(x.size(0), -1)` | +| |x = x.view(-1, 1024) If | | +| |the view function is written | | +| |this way, it causes an issue | | | |while creating the | | | |computing graph | | +------------------------+------------------------------+-----------------------------------+ @@ -35,7 +37,7 @@ Note: These restrictions are not applicable, if you are using the **Primitive AP | |align_corners=True) |align_corners=False) | +------------------------+------------------------------+-----------------------------------+ | Deconvolution operation|The deconvolution operation | There is no workaround available | -| |is used in DeepLabV3 model. | at this time. This issue will be | -| |This is currently not | addressed in a subsequent AIMET | -| |supported by AIMET | release. | +| |is used in the DeepLabV3 | at this time. This issue will be | +| |model. This is not | addressed in a subsequent AIMET | +| |supported by AIMET. | release. | +------------------------+------------------------------+-----------------------------------+ diff --git a/Docs/user_guide/model_quantization.rst b/Docs/user_guide/model_quantization.rst index d8ac0811bf..aa0bb60806 100644 --- a/Docs/user_guide/model_quantization.rst +++ b/Docs/user_guide/model_quantization.rst @@ -6,9 +6,9 @@ AIMET model quantization ######################## -Models are trained on floating-point hardware like CPUs and GPUs. However, when you run these models on quantized hardware with fixed-precision operations, the model parameters must be fixed-precision. For example, when running on hardware that supports 8-bit integer operations, the floating point parameters in the trained model need to be converted to 8-bit integers. For some models, reduction to 8-bit fixed-precision introduces noise that causes a loss of accuracy. +Models are trained on floating-point hardware like CPUs and GPUs. However, when you run these models on quantized hardware with fixed-precision operations, the model parameters must be fixed-precision. For example, when running on hardware that supports 8-bit integer operations, the floating point parameters in the trained model need to be converted to 8-bit integers. -AIMET provides techniques and tools to help create quantized models that minimize loss of accuracy relative to floating-point models. +For some models, reduction to 8-bit fixed-precision introduces noise that causes a loss of accuracy. AIMET provides techniques and tools to create quantized models that minimize this loss of accuracy. Use cases ========= @@ -44,6 +44,8 @@ Quantization-aware training (QAT) and fine-tuning .. image:: ../images/quant_use_case_2.PNG +_aimet-quantization-features: + AIMET quantization features =========================== @@ -88,7 +90,7 @@ Post-training quantization techniques help improve quantized model accuracy with AutoQuant Adaptive Rounding (AdaRound) BN Re-estimation - Bias Correction [Deprecated] + Bias Correction [Deprecated] :ref:`AutoQuant` AIMET provides an API that integrates the post-training quantization techniques described below. AutoQuant is recommended for PTQ. If desired, individual techniques can be invoked using standalone feature specific APIs. @@ -161,7 +163,7 @@ Before attempting quantization, ensure that models are defined according to mode 2. Apply PTQ and AutoQuant -------------------------- -Apply PTQ techniques to adjust model parameters and make the model more robust to quantization. We recommend trying AutoQuant first. AutoQuant tries various other PTQ methods and finds the best combination of methods to apply. See :ref:`AIMET quantization features`. +Apply PTQ techniques to adjust model parameters and make the model more robust to quantization. We recommend trying AutoQuant first. AutoQuant tries various other PTQ methods and finds the best combination of methods to apply. See :ref:`aimet-quantization-features`. 3. Use QAT @@ -185,9 +187,10 @@ AIMET QuantSim can export both items. The exported model type differs based on t - `.h5` and `.pb` for Keras The exact steps to export the model and encodings file depend on which AIMET Quantization features are used: + - Calling AutoQuant automatically exports the model and encodings file. - If you use QAT, you'll call `.export()` on the QuantSim object. -- - If you use lower-level PTQ techniques like CLE, you first create a QuantSim object from the modified model, then call `.export()` on the QuantSim object. +- If you use lower-level PTQ techniques like CLE, you first create a QuantSim object from the modified model, then call `.export()` on the QuantSim object. Debugging ========= @@ -198,4 +201,4 @@ Debugging Quantization Diagnostics -Applying AIMET Quantization features may involve some trial and error in order to find the best optimizations to apply on a particular model. If quantization accuracy does not seem to improve. see the debugging steps in the :ref:`Quantization Guidebook`. +Applying AIMET Quantization features may involve some trial and error in order to find the best optimizations to apply on a particular model. If quantization accuracy does not seem to improve, see the debugging steps in the :ref:`Quantization Diagnostics`. diff --git a/Docs/user_guide/quant_analyzer.rst b/Docs/user_guide/quant_analyzer.rst index 8e127240af..b52ea9340c 100644 --- a/Docs/user_guide/quant_analyzer.rst +++ b/Docs/user_guide/quant_analyzer.rst @@ -8,7 +8,7 @@ AIMET QuantAnalyzer Overview ======== -The QuantAnalyzer performs several analyses to identify sensitive areas and hotspots in the model. These analyses are performed automatically. To use QuantAnalyzier, you to pass in callbacks to perform forward pass and evaluation, and optionally a dataloader for MSE loss analysis. +The QuantAnalyzer performs several analyses to identify sensitive areas and hotspots in the model. These analyses are performed automatically. To use QuantAnalyzier, you pass in callbacks to perform forward passes and evaluations, and optionally a dataloader for MSE loss analysis. For each analysis, QuantAnalyzer outputs JSON and/or HTML files containing data and plots for visualization. @@ -20,12 +20,12 @@ To call the QuantAnalyzer API, you must provide the following: - A dummy input for the model that can contain random values but which must match the shape of the model's expected input - A user-defined function for passing 500-1000 representative data samples through the model for quantization calibration - A user-defined function for passing labeled data through the model for evaluation, returning an accuracy metric - - (Optional, for runing MSE loss analysis) A dataloader providing unlabeled data to be passed through the model + - (Optional, for running MSE loss analysis) A dataloader providing unlabeled data to be passed through the model Other quantization-related settings are also provided in the call to analyze a model. See :doc:`PyTorch QuantAnalyzer API Docs<../api_docs/torch_quant_analyzer>` for more about how to call the QuantAnalyzer feature. -..admonition:: NOTE +.. admonition:: NOTE Typically on quantized runtimes, batch normalization (BN) layers are folded where possible. So that you don't have to call a separate API to do so, QuantAnalyzer automatically performs Batch Norm Folding before running its analyses. Detailed analysis descriptions @@ -34,7 +34,7 @@ Detailed analysis descriptions QuantAnalyzer performs the following analyses: Sensitivity analysis to weight and activation quantization - QuantAnalyzer compares the accuracies of the original FP32 model, an activation-only quantized model, and a weight-only quantized model. This helps users determine which AIMET quantization technique(s) will be more beneficial for the model. + QuantAnalyzer compares the accuracies of the original FP32 model, an activation-only quantized model, and a weight-only quantized model. This helps determine which AIMET quantization technique(s) will be more beneficial for the model. For example, in situations where the model is more sensitive to activation quantization, PTQ techniques like Adaptive Rounding or Cross Layer Equalization might not be very helpful. diff --git a/Docs/user_guide/quantization_aware_training.rst b/Docs/user_guide/quantization_aware_training.rst index 345d4929ca..cdc6309539 100644 --- a/Docs/user_guide/quantization_aware_training.rst +++ b/Docs/user_guide/quantization_aware_training.rst @@ -1,7 +1,7 @@ .. _ug-quantization-aware-training: ################################# -AIMET Quantization Aware Training +AIMET quantization aware training ################################# Overview @@ -27,7 +27,7 @@ Compared to QuantSim inference, step 3 is the only addition when performing QAT. QAT modes ========= -There are two versions of QAT: without Range Learning and with Range Learning. +There are two versions of QAT: without range learning and with range learning. Without range learning In QAT without Range Learning, encoding values for activation quantizers are found once during calibration and are not updated again. @@ -41,9 +41,10 @@ Recommendations for quantization-aware training =============================================== Here are some guidelines that can improve performance and speed convergence with QAT: -* Initialization - It often helps to first apply post training quantization techniques like :ref:`AutoQuant` before applying QAT, especially if there is large drop in INT8 performance from the FP32 baseline. -* Hyper-parameters: +Initialization + - It often helps to first apply post training quantization techniques like :ref:`AutoQuant` before applying QAT, especially if there is large drop in INT8 performance from the FP32 baseline. + +Hyper-parameters - Number of epochs: 15-20 epochs are usually sufficient for convergence - Learning rate: Comparable (or one order higher) to FP32 model's final learning rate at convergence. Results in AIMET are with learning of the order 1e-6. diff --git a/Docs/user_guide/quantization_configuration.rst b/Docs/user_guide/quantization_configuration.rst index 8153227076..d585ccef65 100644 --- a/Docs/user_guide/quantization_configuration.rst +++ b/Docs/user_guide/quantization_configuration.rst @@ -1,36 +1,37 @@ .. _ug-quantsim-config: -====================================== -Quantization Simulation Configuration -====================================== +##################################### +Quantization simulation configuration +##################################### + Overview ======== -AIMET allows the configuration of quantizer placement and settings in accordance with a set of rules specified in a json configuration file, applied when the Quantization Simulation API is called. -Settings such as quantizer enablement, per channel quantization, symmetric quantization, and specifying fused ops when quantizing can be configurated. -The general use case for this file would be for users to match the quantization rules for a particular runtime they would like to simulate. +You can configure settings such as quantizer enablement, per-channel quantization, symmetric quantization, and specifying fused ops when quantizing, for example to match the quantization rules for a particular runtime you would like to simulate. -For examples on how to provide a specific configuration file to AIMET Quantization Simulation, -refer to the API docs for :doc:`PyTorch Quantsim<../api_docs/torch_quantsim>`, :doc:`TensorFlow Quantsim<../api_docs/tensorflow_quantsim>`, and :doc:`Keras Quantsim<../api_docs/keras_quantsim>`. +Quantizer placement and settings are set in a JSON configuration file. The configuration is applied when the Quantization Simulation API is called. -It is advised for the user to begin with the default configuration file under +For examples on how to provide a specific configuration file to AIMET Quantization Simulation, +see :doc:`PyTorch Quantsim<../api_docs/torch_quantsim>`, :doc:`TensorFlow Quantsim<../api_docs/tensorflow_quantsim>`, and :doc:`Keras Quantsim<../api_docs/keras_quantsim>`. -|default-quantsim-config-file| +Begin with the default configuration file, `default-quantsim-config-file`. -For most users of AIMET, no additional changes to the default configuration file should be needed. +Most of the time, no changes to the default configuration file are needed. -Configuration File Structure +Configuration file structure ============================ -The configuration file contains six main sections, in increasing amounts of specificity: + +The configuration file contains six main sections, ordered from less- to more specific: .. image:: ../images/quantsim_config_file.png -Rules defined in a more general section can be overruled by subsequent rules defined in a more specific case. -For example, one may specify in "defaults" for no layers to be quantized, but then turn on quantization for specific layers in the "op_type" section. +Rules defined in a more general section are overridden by subsequent rules defined in a more specific case. +For example, you can specify in "defaults" that no layers be quantized, but then turn on quantization for specific layers in the "op_type" section. + +Modifying configuration file sections +===================================== -How to configure individual Configuration File Sections -======================================================= -When working with a new runtime with different rules, or for experimental purposes, users can refer to this section to understand how to configure individual sections in a configuration file. +Configure individual sections as described here. 1. **defaults**: @@ -39,53 +40,53 @@ When working with a new runtime with different rules, or for experimental purpos :start-after: # defaults start :end-before: # defaults end - In the defaults section, it is required to include an "ops" dictionary and a "params" dictionary (though these dictionaries may be empty). + In the defaults section, include an "ops" dictionary and a "params" dictionary (though these dictionaries can be empty). - The "ops" dictionary holds settings that will apply to all activation quantizers in the model. - In this section, the following settings are available: + The "ops" dictionary holds settings that apply to all activation quantizers in the model. + The following settings are available: - is_output_quantized: - An optional parameter. If included, it must be set to "True". - Including this setting will turn on all output activation quantizers by default. - If not specified, all activation quantizers will start off as disabled. + Optional. If included, must be "True". + Including this setting turns on all output activation quantizers by default. + If not specified, all activation quantizers are disabled to start. - For cases when the runtime quantizes input activations, we typically see this only done for certain op types. - Configuring these settings for specific op types is covered in sections further below. + In cases when the runtime quantizes input activations, this is only done for certain op types. + To configure these settings for specific op types see below. - is_symmetric: - An optional parameter. If included, possible settings include "True" and "False". - A "True" setting will place all activation quantizers in symmetric mode by default. - A "False" setting, or omitting the parameter altogether, will set all activation quantizers to asymmetric mode by default. + Optional. If included, value is "True" or "False". + "True" places all activation quantizers in symmetric mode by default. + "False", or omitting the parameter, sets all activation quantizers to asymmetric mode by default. - The "params" dictionary holds settings that will apply to all parameter quantizers in the model. - In this section, the following settings are available: + The "params" dictionary holds settings that apply to all parameter quantizers in the model. + The following settings are available: - is_quantized: - An optional parameter. If included, possible settings include "True" and "False". - A "True" setting will turn on all parameter quantizers by default. - A "False" setting, or omitting the parameter altogether, will disable all parameter quantizers by default. + Optional. If included, value is "True" or "False". + "True" turns on all parameter quantizers by default. + "False", or omitting the parameter, disables all parameter quantizers by default. - is_symmetric: - An optional parameter. If included, possible settings include "True" and "False". - A "True" setting will place all parameter quantizers in symmetric mode by default. - A "False" setting, or omitting the parameter altogether, will set all parameter quantizers to asymmetric mode by default. + Optional. If included, value is "True" or "False". + "True" places all parameter quantizers in symmetric mode by default. + "False", or omitting the parameter, sets all parameter quantizers to asymmetric mode by default. - Aside from the "ops" and "params" dictionary, additional settings governing quantizers in the model are available: + Outside the "ops" and "params" dictionaries, the following additional quantizer settings are available: - strict_symmetric: - An optional parameter. If included, possible settings include "True" and "False". - When set to "True", quantizers which are configured in symmetric mode will use strict symmetric quantization. - When set to "False" or omitting the parameter altogether, quantizers which are configured in symmetric mode will not use strict symmetric quantization. + Optional. If included, value is "True" or "False". + "True" causes quantizers configured in symmetric mode to use strict symmetric quantization. + "False", or omitting the parameter, causes quantizers configured in symmetric mode to not use strict symmetric quantization. - unsigned_symmetric: - An optional parameter. If included, possible settings include "True" and "False". - When set to "True", quantizers which are configured in symmetric mode will use unsigned symmetric quantization when available. - When set to "False" or omitting the parameter altogether, quantizers which are configured in symmetric mode will not use unsigned symmetric quantization. + Optional. If included, value is "True" or "False". + "True" causes quantizers configured in symmetric mode use unsigned symmetric quantization when available. + "False", or omitting the parameter, causes quantizers configured in symmetric mode to not use unsigned symmetric quantization. - per_channel_quantization: - An optional parameter. If included, possible settings include "True" and "False". - When set to "True", parameter quantizers will use per channel quantization as opposed to per tensor quantization. - When set to "False" or omitting the parameter altogether, parameter quantizers will use per tensor quantization. + Optional. If included, value is "True" or "False". + "True" causes parameter quantizers to use per-channel quantization rather than per-tensor quantization. + When set to "False" or omitting the parameter, causes parameter quantizers to use per-tensor quantization. 2. **params**: @@ -95,9 +96,9 @@ When working with a new runtime with different rules, or for experimental purpos :end-before: # params end - In the params section, settings can be configured for certain types of parameters throughout the model. - For example, adding settings for "weight" will affect all parameters of type "weight" in the model. - Currently supported parameter types include: + In the params section, configure settings for parameters that apply throughout the model. + For example, adding settings for "weight" affects all parameters of type "weight" in the model. + Supported parameter types include: - weight - bias @@ -105,16 +106,16 @@ When working with a new runtime with different rules, or for experimental purpos For each parameter type, the following settings are available: - is_quantized: - An optional parameter. If included, possible settings include "True" and "False". - A "True" setting will turn on all parameter quantizers of that type. - A "False" setting, will disable all parameter quantizers of that type. - By omitting the setting, the parameter will fall back to the setting specified by the defaults section. + Optional. If included, value is "True" or "False". + "True" turns on all parameter quantizers of that type. + "False" disables all parameter quantizers of that type. + Omitting the setting causes the parameter to use the setting specified by the defaults section. - is_symmetric: - An optional parameter. If included, possible settings include "True" and "False". - A "True" setting will place all parameter quantizers of that type in symmetric mode. - A "False" setting will place all parameter quantizers of that type in asymmetric mode. - By omitting the setting, the parameter will fall back to the setting specified by the defaults section. + Optional. If included, value is "True" or "False". + "True" places all parameter quantizers of that type in symmetric mode. + "False" places all parameter quantizers of that type in asymmetric mode. + Omitting the setting causes the parameter to use the setting specified by the defaults section. 3. **op_type**: @@ -123,41 +124,39 @@ When working with a new runtime with different rules, or for experimental purpos :start-after: # op_type start :end-before: # op_type end - In the op type section, settings affecting particular op types can be specified. - The configuration file recognizes ONNX op types, and will internally map the type to a PyTorch or TensorFlow op type - depending on which framework is used. + In the op_type section, configure settings affecting particular op types. + The configuration file supports ONNX op types, and internally maps the type to a PyTorch or TensorFlow op type depending on which framework is used. For each op type, the following settings are available: - is_input_quantized: - An optional parameter. If included, it must be set to "True". - Including this setting will turn on input quantization for all ops of this op type. - Omitting the setting will keep input quantization disabled for all ops of this op type. + Optional. If included, must be "True". + Including this setting turns on input quantization for all ops of this op type. + Omitting the setting keeps input quantization disabled for all ops of this op type. - is_output_quantized: - An optional parameter. If included, possible settings include "True" and "False". - A "True" setting will turn on output quantization for all ops of this op type. - A "False" setting will disable output quantization for all ops of this op type. - By omitting the setting, output quantizers of this op type will fall back to the setting specified by the defaults section. + Optional. If included, value is "True" or "False". + "True" turns on output quantization for all ops of this op type. + "False" disables output quantization for all ops of this op type. + Omitting the setting causes output quantizers of this op type to fall back to the setting specified by the defaults section. - is_symmetric: - An optional parameter. If included, possible settings include "True" and "False". - A "True" setting will place all quantizers of this op type in symmetric mode. - A "False" setting will place all quantizers of this op type in asymmetric mode. - By omitting the setting, quantizers of this op type will fall back to the setting specified by the defaults section. + Optional. If included, value is "True" or "False". + "True" places all quantizers of this op type in symmetric mode. + "False" places all quantizers of this op type in asymmetric mode. + Omitting the setting causes quantizers of this op type to fall back to the setting specified by the defaults section. - per_channel_quantization: - An optional parameter. If included, possible settings include "True" and "False". - When set to "True", parameter quantizers of this op type will use per channel quantization as opposed to per tensor quantization. - When set to "False", parameter quantizers of this op type will use per tensor quantization. - By omitting the setting, parameter quantizers of this op type will fall back to the setting specified by the defaults section. + Optional. If included, value is "True" or "False". + "True" sets parameter quantizers of this op type to use per-channel quantization rather than per-tensor quantization. + "False" sets parameter quantizers of this op type to use per-tensor quantization. + Omitting the setting causes parameter quantizers of this op type to fall back to the setting specified by the defaults section. For a particular op type, settings for particular parameter types can also be specified. - For example, specifying settings for weight parameters of a Conv op type will affect only Conv weights and not weights - of Gemm op types. + For example, specifying settings for weight parameters of a Conv op type affects only Conv weights and not weights of Gemm op types. - To specify settings for param types of this op type, include a "params" dictionary under the op type. - Settings for this section follow the same convention as settings for parameter types in the preceding "params" section, however will only affect parameters for this op type. + To specify settings for param types of an op type, include a "params" dictionary under the op type. + Settings for this section follow the same convention as settings for parameter types in the "params" section, but only affect parameters for this op type. 4. **supergroups**: @@ -166,14 +165,14 @@ When working with a new runtime with different rules, or for experimental purpos :start-after: # supergroups start :end-before: # supergroups end - Supergroups are a sequence of operations which are fused during quantization, meaning no quantization noise is introduced between members of the supergroup. + Supergroups are a sequence of operations that are fused during quantization, meaning no quantization noise is introduced between members of the supergroup. For example, specifying ["Conv, "Relu"] as a supergroup disables quantization between any adjacent Conv and Relu ops in the model. - When searching for supergroups in the model, only sequential groups of ops with no branches in between will be matched with supergroups defined in the list. - Using ["Conv", "Relu"] as an example, if there was a Conv op in the model whose output is used by both a Relu op and a second op, the supergroup would not take effect for these Conv and Relu ops. + When searching for supergroups in the model, only sequential groups of ops with no branches in between are matched with supergroups defined in the list. + Using ["Conv", "Relu"] as an example, if there were a Conv op in the model whose output is used by both a Relu op and a second op, the supergroup would not include those Conv and Relu ops. To specify supergroups in the config file, add each entry as a list of op type strings. - The configuration file recognizes ONNX op types, and will internally map the types to PyTorch or TensorFlow op types depending on which framework is used. + The configuration file supports ONNX op types, and internally maps the type to a PyTorch or TensorFlow op type depending on which framework is used. 5. **model_input**: @@ -182,13 +181,13 @@ When working with a new runtime with different rules, or for experimental purpos :start-after: # model_input start :end-before: # model_input end - The "model_input" section is used to configure the quantization of inputs to the model. - In this section, the following setting is available: + Use the "model_input" section to configure the quantization of inputs to the model. + The following setting is available: - is_input_quantized: - An optional parameter. If included, it must be set to "True". - Including this setting will turn on quantization for input quantizers to the model. - Omitting the setting will keep input quantizers set to whatever setting they were in as a result of applying configurations from earlier sections. + Optional. If included, must be "True". + Including this setting turns on quantization for input quantizers to the model. + Omitting the setting keeps input quantizers at settings resulting from more general configurations. 6. **model_output**: @@ -197,10 +196,10 @@ When working with a new runtime with different rules, or for experimental purpos :start-after: # model_output start :end-before: # model_output end - The "model_output" section is used to configure the quantization of outputs of the model. - In this section, the following setting is available: + Use the "model_output" section to configure the quantization of outputs of the model. + The following setting is available: - is_output_quantized: - An optional parameter. If included, it must be set to "True". - Including this setting will turn on quantization for output quantizers of the model. - Omitting the setting will keep output quantizers set to whatever setting they were in as a result of applying configurations from earlier sections. + Optional. If included, it must be set to "True". + Including this setting turns on quantization for output quantizers of the model. + Omitting the setting keeps input quantizers at settings resulting from more general configurations. diff --git a/Docs/user_guide/quantization_sim.rst b/Docs/user_guide/quantization_sim.rst index f657705564..4e61b6920a 100644 --- a/Docs/user_guide/quantization_sim.rst +++ b/Docs/user_guide/quantization_sim.rst @@ -23,7 +23,7 @@ QuantSim workflow Following is a typical workflow for using AIMET QuantSim to simulate on-target quantized accuracy. -1. Start with a pretrained floating-point FP32 model. +1. Start with a pretrained floating-point (FP32) model. 2. Use AIMET to create a simulation model. AIMET inserts quantization simulation operations into the model graph (explained in the sub-section below). @@ -65,7 +65,7 @@ Min (q\ :sub:`min`\ ) Max (q\ :sub:`max`\ ) Numbers above these are clamped Delta - Granularity of the fixed point numbers (a function of the bit-width selected) + Granularity of the fixed point numbers (a function of the selected bit-width) Offset Offset from zero @@ -79,10 +79,16 @@ Quantization schemes AIMET supports various techniques, also called quantization schemes, for calculating min and max values for encodings: -Min-Max (also referred to as "TF" in AIMET. The name TF represents the origin of the technique and has no relation to which framework is using it.) +**Min-Max (also referred to as "TF" in AIMET)** + + (The name "TF" derives from the origin of the technique and has no relation to which framework is using it.) + To cover the whole dynamic range of the tensor, the quantization parameters Min and Max are defined as the observed Min and Max during the calibration process. This approach eliminates clipping error but is sensitive to outliers since extreme values induce rounding errors. -Signal-to-Quantization-Noise (SQNR; also called “TF Enhanced” in AIMET. The name TF represents the origin of the technique and has no relation to what framework is using it). +**Signal-to-Quantization-Noise (SQNR; also called “TF Enhanced” in AIMET)** + + (The name "TF Enhanced" derives from the origin of the technique and has no relation to which framework is using it.) + The SQNR approach is similar to the mean square error (MSE) minimization approach. The qmin and qmax are found that minimize the total MSE between the original and the quantized tensor. Quantization noise and saturation noise are different types of erros which are weighted differently. diff --git a/Docs/user_guide/visualization_compression.rst b/Docs/user_guide/visualization_compression.rst index dbe4dbd2db..42fcb47e67 100644 --- a/Docs/user_guide/visualization_compression.rst +++ b/Docs/user_guide/visualization_compression.rst @@ -1,21 +1,24 @@ -=================== +################### AIMET Visualization -=================== +################### Overview ======== -AIMET Visualization adds analytical capability to the AIMET tool (which helps quantize and compress ML models) through visualization. It provide more detailed insights in to AIMET features as users are able to analyze a model’s layers in terms of compressibility and also highlight potential issues when applying quantization. The tool also assists in displaying progress for computationally heavy tasks. + +AIMET Visualization adds analytical capability to the AIMET tool (which helps quantize and compress ML models) through visualization. It provide more detailed insights in to AIMET features, enabling you to analyze a model’s layers in terms of compressibility and also highlight potential issues when applying quantization. The tool also assists in displaying progress for computationally heavy tasks. Design ====== -Given a model, a user can start a Bokeh server session and then invoke functions which will produce visualizations to help analyze and understand the model before using AIMET features from quantization and compression + +Given a model, you can start a Bokeh server session and then invoke functions that produce visualizations to help analyze and understand the model before using AIMET features from quantization and compression. .. image:: ../images/vis_1.png Compression =========== + Evaluation scores during compression are displayed in a table as they are computed and users can see the progress displayed while computing these scores. After Greedy Selection has run, the optimal compression ratios are also displayed in a graph .. image:: ../images/vis_4.png From 4d9ff13261efe1fe7d6627f6b286162d9cc4ea10 Mon Sep 17 00:00:00 2001 From: Dave Welsch Date: Tue, 17 Sep 2024 15:58:19 -0700 Subject: [PATCH 3/5] Corrected TOC errors introduced by Quant UG edits. Signed-off-by: Dave Welsch --- Docs/user_guide/adaround.rst | 9 +- Docs/user_guide/model_quantization.rst | 86 ++++++++----------- .../quantization_feature_guidebook.rst | 54 ++++-------- 3 files changed, 59 insertions(+), 90 deletions(-) diff --git a/Docs/user_guide/adaround.rst b/Docs/user_guide/adaround.rst index 9c05b39216..38a8ca3b48 100644 --- a/Docs/user_guide/adaround.rst +++ b/Docs/user_guide/adaround.rst @@ -19,8 +19,7 @@ When creating a QuantizationSimModel using AdaRounded, use the QuantizationSimMo AdaRound use cases ================== -Terminology ------------ +**Terminology** The following abbreviations are used in the following use case descriptions: @@ -37,8 +36,7 @@ QAT { } An optional step in the use case -Recommended ------------ +**Recommended** The following sequences are recommended: @@ -48,8 +46,7 @@ The following sequences are recommended: #. AdaRound --> QAT AdaRound is a post-training quantization feature, but for some models applying BNF and CLE may not help. For these models, applying AdaRound before QAT might help. AdaRound is a better weights initialization step that speeds up QAT. -Not recommended ----------------- +**Not recommended** Applying bias correction (BC) either before or after AdaRound is *not* recommended. diff --git a/Docs/user_guide/model_quantization.rst b/Docs/user_guide/model_quantization.rst index aa0bb60806..bcf1ffac77 100644 --- a/Docs/user_guide/model_quantization.rst +++ b/Docs/user_guide/model_quantization.rst @@ -57,16 +57,12 @@ AIMET quantization features Quantization-Aware Training (QAT) :doc:`Quantization Simulation (QuantSim)` ------------------------------------------------------------ - -QuantSim modifies a model by inserting quantization simulation operations, providing a first-order estimate of expected runtime accuracy on quantized hardware. + QuantSim modifies a model by inserting quantization simulation operations, providing a first-order estimate of expected runtime accuracy on quantized hardware. :ref:`Quantization-Aware Training (QAT)` ------------------------------------------------------------------------- - -QAT enables fine-tuning of QuantSim model parameters by taking quantization into account. + QAT enables fine-tuning of QuantSim model parameters by taking quantization into account. -Two modes of QAT are supported: + Two modes of QAT are supported: Regular QAT Fine-tuning of model parameters. Trainable parameters such as module weights, biases, etc. can be @@ -77,38 +73,37 @@ Two modes of QAT are supported: parameters for activation quantizers are also updated during each training step. :hideitem:`Post-Training Quantization` - +-------------------------------------- Post-training quantization (PTQ) techniques -------------------------------------------- - -Post-training quantization techniques help improve quantized model accuracy without needing to re-train. + Post-training quantization techniques help improve quantized model accuracy without needing to re-train. -.. toctree:: - :titlesonly: - :hidden: + .. toctree:: + :titlesonly: + :hidden: - AutoQuant - Adaptive Rounding (AdaRound) - BN Re-estimation - Bias Correction [Deprecated] + AutoQuant + Adaptive Rounding (AdaRound) + Cross-Layer Equalization + BN Re-estimation + Bias Correction [Deprecated] -:ref:`AutoQuant` - AIMET provides an API that integrates the post-training quantization techniques described below. AutoQuant is recommended for PTQ. If desired, individual techniques can be invoked using standalone feature specific APIs. + :ref:`AutoQuant` + AIMET provides an API that integrates the post-training quantization techniques described below. AutoQuant is recommended for PTQ. If desired, individual techniques can be invoked using standalone feature specific APIs. -:ref:`Adaptive rounding (AdaRound)` - Determines optimal rounding for weight tensors to improve quantized performance. + :ref:`Adaptive rounding (AdaRound)` + Determines optimal rounding for weight tensors to improve quantized performance. -Cross-layer equalization - Equalizes weight ranges in consecutive layers. Implementation is variant-specific; see the API for your platform: - :ref:`PyTorch` - :ref:`Keras` - :ref:`ONNX` + :ref:`Cross-Layer Equalization`: + Equalizes weight ranges in consecutive layers. Implementation is variant-specific; see the API for your platform: + :ref:`PyTorch` + :ref:`Keras` + :ref:`ONNX` -:ref:`BN re-estimation` - Re-estimates Batch Norm layer statistics before folding the Batch Norm layers. + :ref:`BN re-estimation` + Re-estimates Batch Norm layer statistics before folding the Batch Norm layers. -Bias correction (Deprecated) - Bias correction is deprecated. Use :ref:`AdaRound` instead. + :ref:`Bias Correction` (Deprecated) + Bias correction is deprecated. Use :ref:`AdaRound` instead. :hideitem:`Debugging and Analysis Tools` ---------------------------------------- @@ -120,11 +115,12 @@ Bias correction (Deprecated) QuantAnalyzer Visualizations -:ref:`QuantAnalyzer`: - Automated debugging of the model to understand sensitivity to weight and/or activation quantization, individual layer sensitivity, etc. +Debugging and analysis tools + :ref:`QuantAnalyzer`: + Automated debugging of the model to understand sensitivity to weight and/or activation quantization, individual layer sensitivity, etc. -:ref:`Visualizations`: - Visualizations and histograms of weight and activation ranges. + :ref:`Visualizations`: + Visualizations and histograms of weight and activation ranges. AIMET quantization workflow =========================== @@ -133,8 +129,7 @@ This section describes the recommended workflow for quantizing a neural network. .. image:: ../images/quantization_workflow.PNG -1. Prep and validate the model ------------------------------- +**1. Prep and validate the model** Before attempting quantization, ensure that models are defined according to model guidelines. These guidelines depend on the ML framework (PyTorch or TensorFlow) that the model is written in. @@ -156,24 +151,17 @@ Before attempting quantization, ensure that models are defined according to mode For more information on Model Validator and Model Preparer, see :doc:`AIMET PyTorch Quantization APIs<../api_docs/torch_quantization>`. -:hideitem:`TensorFlow` --------------------- -:doc:`PyTorch Model Guidelines<../api_docs/torch_model_guidelines>` - -2. Apply PTQ and AutoQuant --------------------------- +**2. Apply PTQ and AutoQuant** Apply PTQ techniques to adjust model parameters and make the model more robust to quantization. We recommend trying AutoQuant first. AutoQuant tries various other PTQ methods and finds the best combination of methods to apply. See :ref:`aimet-quantization-features`. -3. Use QAT ----------- +**3. Use QAT** If model accuracy is still not satisfactory after PTQ/AutoQuant, use QAT to fine-tune the model. See :doc:`AIMET Quantization Features `. -4. Export models ----------------- +**4. Export models** To move the model onto the target, you need: @@ -189,8 +177,8 @@ AIMET QuantSim can export both items. The exported model type differs based on t The exact steps to export the model and encodings file depend on which AIMET Quantization features are used: - Calling AutoQuant automatically exports the model and encodings file. -- If you use QAT, you'll call `.export()` on the QuantSim object. -- If you use lower-level PTQ techniques like CLE, you first create a QuantSim object from the modified model, then call `.export()` on the QuantSim object. +- If you use QAT, call `.export()` on the QuantSim object. +- If you use lower-level PTQ techniques like CLE, first create a QuantSim object from the modified model, then call `.export()` on the QuantSim object. Debugging ========= diff --git a/Docs/user_guide/quantization_feature_guidebook.rst b/Docs/user_guide/quantization_feature_guidebook.rst index 719dd85a15..9491acf578 100644 --- a/Docs/user_guide/quantization_feature_guidebook.rst +++ b/Docs/user_guide/quantization_feature_guidebook.rst @@ -14,46 +14,30 @@ The steps are shown as a flow chart in the following figure and are described in :height: 800 :width: 700 -1. FP32 confidence check -======================== +**1. FP32 confidence check** + First, ensure that the floating-point and quantized model behave similarly in the forward pass, especially when using custom quantization pipelines. Set the quantized model bit-width to 32 bits for both weights and activation, or by-pass the quantization operation if possible, and check that the accuracy matches that of the FP32 model. -First, ensure that the floating-point and quantized model behave similarly in the forward pass, especially when using custom quantization pipelines. Set the quantized model bit-width to 32 bits for both weights and activation, or by-pass the quantization operation if possible, and check that the accuracy matches that of the FP32 model. +**2. Weights or activations quantization** + Next, identify how activation or weight quantization impacts the performance independently. Does performance recover if all weights are quantized to a higher bit-width while activations are kept in a lower bitwidth, or vice versa? This step can show the relative contribution of activations and weight quantization to the overall performance drop and point toward the appropriate solution. -2. Weights or activations quantization -====================================== +**3. Fixing weight quantization** + If the previous step shows that weight quantization causes significant accuracy drop, try the following solutions: -Next, identify how activation or weight quantization impacts the performance independently. Does performance recover if all weights are quantized to a higher bit-width while activations are kept in a lower bitwidth, or vice versa? This step can show the relative contribution of activations and weight quantization to the overall performance drop and point toward the appropriate solution. + 1. Apply cross-layer equalization (CLE) if not already implemented, especially for models with depth-wise separable convolutions. + 2. Try per-channel quantization. This addresses the issue of uneven per-channel weight distribution. + 3. Apply bias correction or AdaRound if calibration data is available. -3. Fixing weight quantization -============================= +**4. Fixing activation quantization** + Generic CLE can lead to uneven activation distribution. To reduce the quantization error from activation quantization, try using different range setting methods or adjust CLE to take activation quantization ranges into account. -If the previous step shows that weight quantization causes significant accuracy drop, try the following solutions: +**5. Doing per-layer analysis** + If global solutions have not restored accuracy to acceptable levels, consider each quantizer individually. Set each quantizer sequentially to the target bit-width while holding the rest of the network at 32 bits (see inner `for` loop in figure. -1. Apply cross-layer equalization (CLE) if not already implemented, especially for models with depth-wise separable convolutions. -2. Try per-channel quantization. This addresses the issue of uneven per-channel weight distribution. -3. Apply bias correction or AdaRound if calibration data is available. +**6. Visualizing layers** + If the quantization of an individual tensor leads to significant accuracy drop, try visualizing the tensor distribution at different granularities, for example per-channel, and dimensions,for example per-token or per-embedding for activations in BERT. -4. Fixing activation quantization -================================= +**7. Fixing individual quantizers** + The previous step (visualization) can reveal the source of a tensor's sensitivity to quantization. Some common solutions involve custom range setting for this quantizer or allowing a higher bit-width for a problematic quantizer. If the problem is fixed and the accuracy recovers, continue to the next quantizer. If not, you might have to resort to other methods, such as quantization-aware training (QAT). -Generic CLE can lead to uneven activation distribution. To reduce the quantization error from activation quantization, try using different range setting methods or adjust CLE to take activation quantization ranges into account. - -5. Doing per-layer analysis -=========================== - -If global solutions have not restored accuracy to acceptable levels, consider each quantizer individually. Set each quantizer sequentially to the target bit-width while holding the rest of the network at 32 bits (see inner `for` loop in figure. - -6. Visualizing layers -===================== - -If the quantization of an individual tensor leads to significant accuracy drop, try visualizing the tensor distribution at different granularities, for example per-channel, and dimensions,for example per-token or per-embedding for activations in BERT. - -7. Fixing individual quantizers -=============================== - -The previous step (visualization) can reveal the source of a tensor's sensitivity to quantization. Some common solutions involve custom range setting for this quantizer or allowing a higher bit-width for a problematic quantizer. If the problem is fixed and the accuracy recovers, continue to the next quantizer. If not, you might have to resort to other methods, such as quantization-aware training (QAT). - -8. Quantize the model -===================== - -After you complete these steps, quantize the complete model to the desired bit-width. If the accuracy is acceptable, this yields a final quantized model ready to use. Otherwise, consider higher bit-widths and smaller granularities or revert to more powerful quantization methods, such as quantization-aware training. +**8. Quantize the model** + After you complete these steps, quantize the complete model to the desired bit-width. If the accuracy is acceptable, this yields a final quantized model ready to use. Otherwise, consider higher bit-widths and smaller granularities or revert to more powerful quantization methods, such as quantization-aware training. From 48fa155ace48144b81f4a8c3cb01184a3a958cdc Mon Sep 17 00:00:00 2001 From: Dave Welsch Date: Thu, 26 Sep 2024 12:30:22 -0700 Subject: [PATCH 4/5] Review changes PR #3348 - Quantization UG edits. Signed-off-by: Dave Welsch --- Docs/user_guide/adaround.rst | 2 +- Docs/user_guide/bn_reestimation.rst | 8 +- Docs/user_guide/examples.rst | 76 ++++++++++--------- Docs/user_guide/model_guidelines.rst | 2 +- Docs/user_guide/model_quantization.rst | 4 +- .../post_training_quant_techniques.rst | 4 +- Docs/user_guide/quant_analyzer.rst | 6 +- .../quantization_aware_training.rst | 2 +- .../user_guide/quantization_configuration.rst | 32 ++++---- 9 files changed, 69 insertions(+), 67 deletions(-) diff --git a/Docs/user_guide/adaround.rst b/Docs/user_guide/adaround.rst index 38a8ca3b48..7ba973cc38 100644 --- a/Docs/user_guide/adaround.rst +++ b/Docs/user_guide/adaround.rst @@ -12,7 +12,7 @@ The Adaptive Rounding (AdaRound) feature uses a subset of the unlabeled training .. image:: ../images/adaround.png :width: 900px -AdaRound optimizes a loss function using the unlabelled training data to decide whether to quantize a weight to the closer or further integer value. AdaRound quantization acieves accuracy closer to the FP32 model using low bit-width integer quantization. +AdaRound optimizes a loss function using the unlabelled training data to decide whether to quantize a weight to the closer or further integer value. AdaRound quantization achieves accuracy closer to the FP32 model, while using low bit-width integer quantization. When creating a QuantizationSimModel using AdaRounded, use the QuantizationSimModel provided in the API to set and freeze parameter encodings before computing the encodings. Refer the code example in the AdaRound API. diff --git a/Docs/user_guide/bn_reestimation.rst b/Docs/user_guide/bn_reestimation.rst index f22fc610c0..73e6b21e94 100644 --- a/Docs/user_guide/bn_reestimation.rst +++ b/Docs/user_guide/bn_reestimation.rst @@ -1,14 +1,14 @@ .. _ug-bn-reestimation: -====================== -AIMET Batch Normal Re-estimation -====================== +############################## +AIMET Batch Norm Re-estimation +############################## Overview ======== -The Batch Normal (BN) re-estimation feature utilizes a small subset of training data to individually re-estimate the statistics of the BN layers in a model. These BN statistics are then used to adjust the quantization scale parameters of the preceeding Convolution or Linear layers. Effectively, the BN layers are folded. +The Batch Norm (BN) re-estimation feature utilizes a small subset of training data to individually re-estimate the statistics of the BN layers in a model. These BN statistics are then used to adjust the quantization scale parameters of the preceeding Convolution or Linear layers. Effectively, the BN layers are folded. The BN re-estimation feature is applied after performing Quantization Aware Training (QAT) with Range Learning, with Per Channel Quantization (PCQ) enabled. It is important *not* to fold the BN layers before performing QAT. Fold the BN layers only after QAT and the re-estimation of the BN statistics are completed. See the Workflow section below for the exact sequence of steps. diff --git a/Docs/user_guide/examples.rst b/Docs/user_guide/examples.rst index 9c1f80c051..218e52c2c7 100644 --- a/Docs/user_guide/examples.rst +++ b/Docs/user_guide/examples.rst @@ -2,33 +2,34 @@ .. image:: ../images/logo-quic-on@h68.png -============== -AIMET Examples -============== +############## +AIMET examples +############## -AIMET Examples provide reference code (in the form of *Jupyter Notebooks*) to learn how to -apply AIMET quantization and compression features. It is also a quick way to become -familiar with AIMET usage and APIs. +AIMET examples are *Jupyter Notebooks* that are intended to: -For more details on each of the features and APIs please refer: -:ref:`Links to User Guide and API Documentation` +- Familiarize you with the AIMET APIs +- Demonstrate basic usage: how to apply AIMET to a model +- Teach you how to use AIMET quantization and compression + +For more details on each of the features and APIs see :ref:`Links to User Guide and API Documentation`. Browse the notebooks ==================== -The following table has links to browsable versions of the notebooks for different features. +The following table provides links to browsable versions of the notebooks for several different AIMET features. **Model Quantization Examples** .. list-table:: - :widths: 40 12 12 12 + :widths: 32 12 12 12 :header-rows: 1 - * - Features + * - Feature - PyTorch - TensorFlow - ONNX - * - Quantsim / Quantization-Aware Training (QAT) + * - QuantSim / Quantization-Aware Training (QAT) - `Link <../Examples/torch/quantization/qat.ipynb>`_ - `Link <../Examples/tensorflow/quantization/keras/qat.ipynb>`_ - `Link <../Examples/onnx/quantization/quantsim.ipynb>`_ (no training) @@ -53,10 +54,10 @@ The following table has links to browsable versions of the notebooks for differe **Model Compression Examples** .. list-table:: - :widths: 40 12 12 + :widths: 40 12 :header-rows: 1 - * - Features + * - Feature - PyTorch * - Channel Pruning - `Link <../Examples/torch/compression/channel_pruning.ipynb>`_ @@ -70,41 +71,42 @@ The following table has links to browsable versions of the notebooks for differe Running the notebooks ===================== -Install Jupyter ---------------- -- Install the Jupyter metapackage as follows (pre-pend with "sudo -H" if appropriate): -``python3 -m pip install jupyter`` +To run the notebooks, follow the instructions below. + +Prerequisites +------------- -- Start the notebook server as follows (please customize the command line options if appropriate): -``jupyter notebook --ip=* --no-browser &`` +#. Install the Jupyter metapackage using the following command. (If necessary, pre-pend the command with "sudo -H".) + .. code-block:: shell -- The above command will generate and display a URL in the terminal. Copy and paste it into your browser. + python3 -m pip install jupyter +#. Start the notebook server as follows: +.. code-block:: shell + + jupyter notebook --ip=* --no-browser & +#. The command generates and displays a URL in the terminal. Copy and paste the URL into your browser. +#. Install AIMET and its dependencies using the instructions in :doc:`install`. -Download the Example notebooks and related code ------------------------------------------------- -- Clone the AIMET repo as follows to any location: +1. Download the example notebooks and related code +-------------------------------------------------- +#. Clone the AIMET repo by running the following commands: .. code-block:: shell - WORKSPACE="" mkdir $WORKSPACE && cd $WORKSPACE - # Go to https://github.com/quic/aimet/releases and identify the release tag () of the AIMET package that you're working with. + # Identify the release tag () of the AIMET package that you're working with at: https://github.com/quic/aimet/releases. git clone https://github.com/quic/aimet.git --branch - # Update the environment variable as follows: + # Update the path environment variable: export PYTHONPATH=$PYTHONPATH:${WORKSPACE}/aimet +#. The dataloader, evaluator, and trainer used in the examples is for the ImageNet dataset. + Download the ImageNet dataset from here: https://www.image-net.org/download.php -- The dataloader, evaluator, and trainer utilized in the examples is for the ImageNet dataset. - To run the example, please download the ImageNet dataset from here: https://www.image-net.org/download.php - -- Install AIMET and its dependencies using the instructions in the Installation section' - -Run the notebooks ------------------ - -- Navigate to one of the following paths under the Examples directory and launch your chosen Jupyter Notebook (`.ipynb` extension): +2. Run the notebooks +-------------------- +#. Navigate to one of the following paths under the Examples directory and launch your chosen Jupyter Notebook (`.ipynb` extension): - `Examples/torch/quantization/` - `Examples/torch/compression/` - `Examples/tensorflow/quantization/keras/` -- Follow the instructions therein to execute the code. +#. Follow the instructions in the notebook to execute the code. diff --git a/Docs/user_guide/model_guidelines.rst b/Docs/user_guide/model_guidelines.rst index daa953373d..13b625e05a 100644 --- a/Docs/user_guide/model_guidelines.rst +++ b/Docs/user_guide/model_guidelines.rst @@ -8,7 +8,7 @@ To implement the Cross Layer Equalization API, `aimet_torch.cross_layer_equaliz Certain model constructs prevent AIMET from creating and analyzing the computing graph. The following table lists these potential issues and workarounds. -.. admonition NOTE:: +.. note:: These restrictions are not applicable if you are using the **Primitive APIs**. diff --git a/Docs/user_guide/model_quantization.rst b/Docs/user_guide/model_quantization.rst index bcf1ffac77..6a7a062d6a 100644 --- a/Docs/user_guide/model_quantization.rst +++ b/Docs/user_guide/model_quantization.rst @@ -140,7 +140,7 @@ Before attempting quantization, ensure that models are defined according to mode PyTorch has two utilities to automate model complaince: - The Model Validator utility automates checking PyTorch model requirements - - he Model Preparer utility automates updating model definition to align with requirements + - The Model Preparer utility automates updating model definition to align with requirements In model prep and validation using PyTorch, we recommend the following flow: @@ -153,7 +153,7 @@ Before attempting quantization, ensure that models are defined according to mode **2. Apply PTQ and AutoQuant** -Apply PTQ techniques to adjust model parameters and make the model more robust to quantization. We recommend trying AutoQuant first. AutoQuant tries various other PTQ methods and finds the best combination of methods to apply. See :ref:`aimet-quantization-features`. +Apply PTQ techniques to adjust model parameters and make the model more robust to quantization. We recommend trying AutoQuant first. AutoQuant tries various other PTQ methods and finds the best combination of methods to apply. See :ref:`aimet-quantization-features`_. **3. Use QAT** diff --git a/Docs/user_guide/post_training_quant_techniques.rst b/Docs/user_guide/post_training_quant_techniques.rst index e6f7dae2d4..55910516de 100644 --- a/Docs/user_guide/post_training_quant_techniques.rst +++ b/Docs/user_guide/post_training_quant_techniques.rst @@ -9,7 +9,7 @@ AIMET post-training quantization techniques Overview ======== -Sme ML models show reduced inference accuracy when run on quantized hardware due to approximation noise. AIMET provides post-training quantization techniques that help adjust the parameters in the model such that the model becomes more quantization-friendly. AIMET post-training quantizations are designed to be applied on pre-trained ML models. These techniques are explained as part of the "Data-Free Quantization Through Weight Equalization and Bias Correction” paper at ICCV 2019 - https://arxiv.org/abs/1906.04721 +Some ML models show reduced inference accuracy when run on quantized hardware due to approximation noise. AIMET provides post-training quantization techniques that help adjust the parameters in the model such that the model becomes more quantization-friendly. AIMET post-training quantizations are designed to be applied on pre-trained ML models. These techniques are explained as part of the "Data-Free Quantization Through Weight Equalization and Bias Correction” paper at ICCV 2019 - https://arxiv.org/abs/1906.04721 User Flow @@ -78,4 +78,4 @@ FAQs References ========== -1. Markus Nagel, Mart van Baalen, Tijmen Blankevoort, Max Welling. “Data-Free Quantization Through Weight Equalization and Bias Correction.” IEEE International Conference on Computer Vision (ICCV), Seoul, October 2019. \ No newline at end of file +1. Markus Nagel, Mart van Baalen, Tijmen Blankevoort, Max Welling. “Data-Free Quantization Through Weight Equalization and Bias Correction.” IEEE International Conference on Computer Vision (ICCV), Seoul, October 2019. diff --git a/Docs/user_guide/quant_analyzer.rst b/Docs/user_guide/quant_analyzer.rst index b52ea9340c..b913d267c9 100644 --- a/Docs/user_guide/quant_analyzer.rst +++ b/Docs/user_guide/quant_analyzer.rst @@ -8,7 +8,7 @@ AIMET QuantAnalyzer Overview ======== -The QuantAnalyzer performs several analyses to identify sensitive areas and hotspots in the model. These analyses are performed automatically. To use QuantAnalyzier, you pass in callbacks to perform forward passes and evaluations, and optionally a dataloader for MSE loss analysis. +The QuantAnalyzer performs several analyses to identify sensitive areas and hotspots in the model. These analyses are performed automatically. To use QuantAnalyzer, you pass in callbacks to perform forward passes and evaluations, and optionally a dataloader for MSE loss analysis. For each analysis, QuantAnalyzer outputs JSON and/or HTML files containing data and plots for visualization. @@ -25,7 +25,7 @@ To call the QuantAnalyzer API, you must provide the following: Other quantization-related settings are also provided in the call to analyze a model. See :doc:`PyTorch QuantAnalyzer API Docs<../api_docs/torch_quant_analyzer>` for more about how to call the QuantAnalyzer feature. -.. admonition:: NOTE +.. note:: Typically on quantized runtimes, batch normalization (BN) layers are folded where possible. So that you don't have to call a separate API to do so, QuantAnalyzer automatically performs Batch Norm Folding before running its analyses. Detailed analysis descriptions @@ -67,7 +67,7 @@ Per-layer statistics histogram Under the TF Enhanced quantization scheme, encoding min/max values for each quantizer are obtained by collecting a histogram of tensor values seen at that quantizer and deleting outliers. When this quantization scheme is selected, QuantAnalyzer outputs plots for each quantizer in the model, displaying the histogram of tensor values seen at that quantizer. - These plots are available as part of the `activations_pdf` and `weights_pdf folders`, containing a separate .html plot for each quantizer. + These plots are available as part of the `activations_pdf` and `weights_pdf` folders, containing a separate .html plot for each quantizer. Per layer mean-square-error (MSE) loss (optional) QuantAnalyzer can monitor each layer's output in the original FP32 model as well as the corresponding layer output in the quantized model and calculate the MSE loss between the two. diff --git a/Docs/user_guide/quantization_aware_training.rst b/Docs/user_guide/quantization_aware_training.rst index cdc6309539..6f1548c013 100644 --- a/Docs/user_guide/quantization_aware_training.rst +++ b/Docs/user_guide/quantization_aware_training.rst @@ -7,7 +7,7 @@ AIMET quantization aware training Overview ======== -When post-training quantizatio (PTQ) doesn't sufficiently reduce quantization error, the next step is to use quantization-aware training (QAT). QAT finds more accurate solutions than PTQ by modeling the quantization noise during training. This higher accuracy comes at the usual cost of neural network training, including longer training times and the need for labeled data and hyperparameter search. +When post-training quantization (PTQ) doesn't sufficiently reduce quantization error, the next step is to use quantization-aware training (QAT). QAT finds better-optimized solutions than PTQ by fine-tuning the model parameters in the presence of quantization noise. This higher accuracy comes at the usual cost of neural network training, including longer training times and the need for labeled data and hyperparameter search. QAT workflow ============ diff --git a/Docs/user_guide/quantization_configuration.rst b/Docs/user_guide/quantization_configuration.rst index d585ccef65..bb73eb9ec6 100644 --- a/Docs/user_guide/quantization_configuration.rst +++ b/Docs/user_guide/quantization_configuration.rst @@ -73,20 +73,20 @@ Configure individual sections as described here. Outside the "ops" and "params" dictionaries, the following additional quantizer settings are available: - - strict_symmetric: - Optional. If included, value is "True" or "False". - "True" causes quantizers configured in symmetric mode to use strict symmetric quantization. - "False", or omitting the parameter, causes quantizers configured in symmetric mode to not use strict symmetric quantization. + - strict_symmetric: + Optional. If included, value is "True" or "False". + "True" causes quantizers configured in symmetric mode to use strict symmetric quantization. + "False", or omitting the parameter, causes quantizers configured in symmetric mode to not use strict symmetric quantization. - - unsigned_symmetric: - Optional. If included, value is "True" or "False". - "True" causes quantizers configured in symmetric mode use unsigned symmetric quantization when available. - "False", or omitting the parameter, causes quantizers configured in symmetric mode to not use unsigned symmetric quantization. + - unsigned_symmetric: + Optional. If included, value is "True" or "False". + "True" causes quantizers configured in symmetric mode use unsigned symmetric quantization when available. + "False", or omitting the parameter, causes quantizers configured in symmetric mode to not use unsigned symmetric quantization. - - per_channel_quantization: - Optional. If included, value is "True" or "False". - "True" causes parameter quantizers to use per-channel quantization rather than per-tensor quantization. - When set to "False" or omitting the parameter, causes parameter quantizers to use per-tensor quantization. + - per_channel_quantization: + Optional. If included, value is "True" or "False". + "True" causes parameter quantizers to use per-channel quantization rather than per-tensor quantization. + "False" or omitting the parameter, causes parameter quantizers to use per-tensor quantization. 2. **params**: @@ -147,10 +147,10 @@ Configure individual sections as described here. Omitting the setting causes quantizers of this op type to fall back to the setting specified by the defaults section. - per_channel_quantization: - Optional. If included, value is "True" or "False". - "True" sets parameter quantizers of this op type to use per-channel quantization rather than per-tensor quantization. - "False" sets parameter quantizers of this op type to use per-tensor quantization. - Omitting the setting causes parameter quantizers of this op type to fall back to the setting specified by the defaults section. + Optional. If included, value is "True" or "False". + "True" sets parameter quantizers of this op type to use per-channel quantization rather than per-tensor quantization. + "False" sets parameter quantizers of this op type to use per-tensor quantization. + Omitting the setting causes parameter quantizers of this op type to fall back to the setting specified by the defaults section. For a particular op type, settings for particular parameter types can also be specified. For example, specifying settings for weight parameters of a Conv op type affects only Conv weights and not weights of Gemm op types. From ea12cc49771835beae5bda28441ec2656fd750a7 Mon Sep 17 00:00:00 2001 From: Dave Welsch Date: Fri, 27 Sep 2024 16:56:38 -0700 Subject: [PATCH 5/5] More review changes PR #3348 - Quantization UG edits. Signed-off-by: Dave Welsch --- Docs/user_guide/model_guidelines.rst | 79 +++++++++++-------- .../user_guide/quantization_configuration.rst | 29 +++++++ 2 files changed, 77 insertions(+), 31 deletions(-) diff --git a/Docs/user_guide/model_guidelines.rst b/Docs/user_guide/model_guidelines.rst index 13b625e05a..44ab150f5c 100644 --- a/Docs/user_guide/model_guidelines.rst +++ b/Docs/user_guide/model_guidelines.rst @@ -4,40 +4,57 @@ Model Guidelines for PyTorch ############################ -To implement the Cross Layer Equalization API, `aimet_torch.cross_layer_equalization.equalize_model()`, AIMET creates a computing graph to analyze the sequence of operations in the model. +To implement the Cross Layer Equalization API, +:code:`aimet_torch.cross_layer_equalization.equalize_model()`, AIMET creates a computing graph to analyze the sequence of operations in the model. -Certain model constructs prevent AIMET from creating and analyzing the computing graph. The following table lists these potential issues and workarounds. +Certain model constructs prevent AIMET from creating and analyzing the computing graph. The following list describes these potential issues and workarounds. .. note:: These restrictions are not applicable if you are using the **Primitive APIs**. -+------------------------+------------------------------+-----------------------------------+ -| Potential Issue | Description | Workaround | -+========================+==============================+===================================+ -| ONNX Export | Use torch.onnx.export() | If ONNX export fails, rewrite the | -| | to export your model. | specific layer so that ONNX | -| | Make sure ONNX export passes | export passes | -+------------------------+------------------------------+-----------------------------------+ -| Slicing Operation |Some models use | Rewrite the x.view() statement | -| |**torch.tensor.view()** in the| as follows: | -| |forward function as follows: | `x = x.view(x.size(0), -1)` | -| |x = x.view(-1, 1024) If | | -| |the view function is written | | -| |this way, it causes an issue | | -| |while creating the | | -| |computing graph | | -+------------------------+------------------------------+-----------------------------------+ -| Bilinear, upsample |Some models use the upsample |Set the align_corners parameter to | -| operation |operation in the forward |False as follows: | -| |function as: x= |x = | -| |torch.nn.functional.upsample( |torch.nn.functional.upsample(x, | -| |x, size=torch.Size([129,129]) |size=torch.Size([129, 129]), | -| |, mode = 'bilinear', |mode='bilinear', | -| |align_corners=True) |align_corners=False) | -+------------------------+------------------------------+-----------------------------------+ -| Deconvolution operation|The deconvolution operation | There is no workaround available | -| |is used in the DeepLabV3 | at this time. This issue will be | -| |model. This is not | addressed in a subsequent AIMET | -| |supported by AIMET. | release. | -+------------------------+------------------------------+-----------------------------------+ +**ONNX Export** + *Description*: Use :code:`torch.onnx.export()` to export your model. Make sure ONNX export passes. + + *Workaround*: If ONNX export fails, rewrite the specific layer so that ONNX export passes. + +**Slicing operation** + *Description*: Some models use :code:`torch.tensor.view()` in the forward function as follows: + + .. code:: python + + x = x.view(-1, 1024) + + If the view function is written this way, it causes an issue while creating the computing graph. + + *Workaround*: Rewrite the :code:`x.view()` statement as follows: + + .. code:: python + + x = x.view(x.size(0), -1) + +**Bilinear, upsample operation** + *Description*: Some models use the upsample operation in the forward function as: + + .. code:: python + + x = + torch.nn.functional.upsample(x, + size=torch.Size([129,129]), + mode='bilinear', + align_corners=True) + + *Workaround*: Set the align_corners parameter to False as follows: + + .. code:: python + + x = + torch.nn.functional.upsample(x, + size=torch.Size([129,129]), + mode='bilinear', + align_corners=False) + +**Deconvolution operation** + *Description*: The dconvolution operation is used in the DeepLabV3 model. This is not supported by AIMET. + + *Workaround*: There is no workaround available at this time. This issue will be addressed in a subsequent AIMET release. diff --git a/Docs/user_guide/quantization_configuration.rst b/Docs/user_guide/quantization_configuration.rst index bb73eb9ec6..daa0da9f43 100644 --- a/Docs/user_guide/quantization_configuration.rst +++ b/Docs/user_guide/quantization_configuration.rst @@ -55,7 +55,9 @@ Configure individual sections as described here. - is_symmetric: Optional. If included, value is "True" or "False". + "True" places all activation quantizers in symmetric mode by default. + "False", or omitting the parameter, sets all activation quantizers to asymmetric mode by default. The "params" dictionary holds settings that apply to all parameter quantizers in the model. @@ -63,29 +65,39 @@ Configure individual sections as described here. - is_quantized: Optional. If included, value is "True" or "False". + "True" turns on all parameter quantizers by default. + "False", or omitting the parameter, disables all parameter quantizers by default. - is_symmetric: Optional. If included, value is "True" or "False". + "True" places all parameter quantizers in symmetric mode by default. + "False", or omitting the parameter, sets all parameter quantizers to asymmetric mode by default. Outside the "ops" and "params" dictionaries, the following additional quantizer settings are available: - strict_symmetric: Optional. If included, value is "True" or "False". + "True" causes quantizers configured in symmetric mode to use strict symmetric quantization. + "False", or omitting the parameter, causes quantizers configured in symmetric mode to not use strict symmetric quantization. - unsigned_symmetric: Optional. If included, value is "True" or "False". + "True" causes quantizers configured in symmetric mode use unsigned symmetric quantization when available. + "False", or omitting the parameter, causes quantizers configured in symmetric mode to not use unsigned symmetric quantization. - per_channel_quantization: Optional. If included, value is "True" or "False". + "True" causes parameter quantizers to use per-channel quantization rather than per-tensor quantization. + "False" or omitting the parameter, causes parameter quantizers to use per-tensor quantization. 2. **params**: @@ -107,14 +119,20 @@ Configure individual sections as described here. - is_quantized: Optional. If included, value is "True" or "False". + "True" turns on all parameter quantizers of that type. + "False" disables all parameter quantizers of that type. + Omitting the setting causes the parameter to use the setting specified by the defaults section. - is_symmetric: Optional. If included, value is "True" or "False". + "True" places all parameter quantizers of that type in symmetric mode. + "False" places all parameter quantizers of that type in asymmetric mode. + Omitting the setting causes the parameter to use the setting specified by the defaults section. 3. **op_type**: @@ -131,25 +149,36 @@ Configure individual sections as described here. - is_input_quantized: Optional. If included, must be "True". + Including this setting turns on input quantization for all ops of this op type. + Omitting the setting keeps input quantization disabled for all ops of this op type. - is_output_quantized: Optional. If included, value is "True" or "False". + "True" turns on output quantization for all ops of this op type. + "False" disables output quantization for all ops of this op type. + Omitting the setting causes output quantizers of this op type to fall back to the setting specified by the defaults section. - is_symmetric: Optional. If included, value is "True" or "False". + "True" places all quantizers of this op type in symmetric mode. + "False" places all quantizers of this op type in asymmetric mode. + Omitting the setting causes quantizers of this op type to fall back to the setting specified by the defaults section. - per_channel_quantization: Optional. If included, value is "True" or "False". + "True" sets parameter quantizers of this op type to use per-channel quantization rather than per-tensor quantization. + "False" sets parameter quantizers of this op type to use per-tensor quantization. + Omitting the setting causes parameter quantizers of this op type to fall back to the setting specified by the defaults section. For a particular op type, settings for particular parameter types can also be specified.