Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quantization User Guide edits #3348

Open
wants to merge 5 commits into
base: develop
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion Docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -112,7 +112,7 @@ def setup(app):
#
# This is also used if you do content translation via gettext catalogs.
# Usually you set "language" from the command line for these cases.
language = None
language = 'en'

# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
Expand Down
91 changes: 44 additions & 47 deletions Docs/user_guide/adaround.rst
Original file line number Diff line number Diff line change
@@ -1,84 +1,81 @@
.. _ug-adaround:


=====================
##############
AIMET AdaRound
=====================
##############

AIMET quantization features, by default, use the "nearest rounding" technique for achieving quantization.
In the following figure, a single weight value in a weight tensor is shown as an illustrative example. When using the
"nearest rounding" technique, this weight value is quantized to the nearest integer value. The Adaptive Rounding
(AdaRound) feature, uses a smaller subset of the unlabelled training data to adaptively round the weights of modules
with weights. In the following figure, the weight value is quantized to the integer value far from it. AdaRound,
optimizes a loss function using the unlabelled training data to adaptively decide whether to quantize a specific
weight to the integer value near it or away from it. Using the AdaRound quantization, a model is able to achieve an
accuracy closer to the FP32 model, while using low bit-width integer quantization.

When creating a QuantizationSimModel using the AdaRounded model, use the QuantizationSimModel provided API for
setting and freezing parameter encodings before computing the encodings. Please refer the code example in the AdaRound
API section.
By default, AIMET uses *nearest rounding* for quantization. A single weight value in a weight tensor is illustrated in the following figure. In nearest rounding, this weight value is quantized to the nearest integer value.

The Adaptive Rounding (AdaRound) feature uses a subset of the unlabeled training data to adaptively round weights. In the following figure, the weight value is quantized to the integer value far from it.

.. image:: ../images/adaround.png
:width: 900px

AdaRound Use Cases
=====================
AdaRound optimizes a loss function using the unlabelled training data to decide whether to quantize a weight to the closer or further integer value. AdaRound quantization acieves accuracy closer to the FP32 model using low bit-width integer quantization.
dwelsch-esi marked this conversation as resolved.
Show resolved Hide resolved

When creating a QuantizationSimModel using AdaRounded, use the QuantizationSimModel provided in the API to set and freeze parameter encodings before computing the encodings. Refer the code example in the AdaRound API.

AdaRound use cases
==================

**Terminology**

Common terminology
=====================
* BC - Bias Correction
* BNF - Batch Norm Folding
* CLE - Cross Layer Equalization
* HBF - High Bias Folding
* QAT - Quantization Aware Training
* { } - An optional step in the use case
The following abbreviations are used in the following use case descriptions:

BC
Bias Correction
BNF
Batch Norm Folding
CLE
Cross Layer Equalization
HBF
High Bias Folding
QAT
Quantization Aware Training
{ }
An optional step in the use case

Use Cases
=====================
**Recommended**

The following sequences are recommended:

#. {BNF} --> {CLE} --> AdaRound
Applying BNF and CLE are optional steps before applying AdaRound. Some models benefit from applying CLE
while some don't get any benefit.
Applying BNF and CLE are optional steps before applying AdaRound. Some models benefit from applying CLE while some don't.

#. AdaRound --> QAT
AdaRound is a post-training quantization feature. But, for some models applying BNF and CLE may not be beneficial.
For these models, QAT after AdaRound may be beneficial. AdaRound is considered as a better weights initialization
step which helps for faster QAT.
AdaRound is a post-training quantization feature, but for some models applying BNF and CLE may not help. For these models, applying AdaRound before QAT might help. AdaRound is a better weights initialization step that speeds up QAT.

**Not recommended**

Not recommended
=====================
Applying BC either before or after AdaRound is not recommended.
Applying bias correction (BC) either before or after AdaRound is *not* recommended.

#. AdaRound --> BC

#. BC --> AdaRound


AdaRound Hyper parameters guidelines
AdaRound hyper parameters guidelines
=====================================

There are couple of hyper parameters required during AdaRound optimization and are exposed to users. But some of them
are with their default values which lead to good and stable results over many models and not recommended to change often.

Following is guideline for Hyper parameters:

#. Hyper Parameters to be changed often: number of batches (approximately 500-1000 images, if batch size of data loader
is 64, then 16 number of batches leads to 1024 images), number of iterations(default 10000)
A number of hyper parameters used during AdaRound optimization are exposed to users. The default values of some of these parameters lead to stable, good results over many models; we recommend that you not change these.

#. Hyper Parameters to be changed moderately: regularization parameter (default 0.01)
Use the following guideline for adjusting hyper parameters with AdaRound.

#. Hyper Parameters to be changed least: beta range(default (20, 2)), warm start period (default 20%)
* Hyper Parameters to be changed often
* Number of batches (approximately 500-1000 images. If batch size of data loader is 64, then 16x the number of batches leads to 1024 images)
* Number of iterations(default 10000)

|
* Hyper Parameters to change with caution
* Regularization parameter (default 0.01)

* Hyper Parameters to avoid changing
* Beta range (default (20, 2))
* Warm start period (default 20%)

AdaRound API
============

Please refer to the links below to view the AdaRound API for each AIMET variant:
See the AdaRound API variant for your platform:

- :ref:`AdaRound for PyTorch<api-torch-adaround>`
- :ref:`AdaRound for Keras<api-keras-adaround>`
Expand Down
43 changes: 21 additions & 22 deletions Docs/user_guide/auto_quant.rst
Original file line number Diff line number Diff line change
@@ -1,48 +1,47 @@
.. _ug-auto-quant:


===============
###############
AIMET AutoQuant
===============
###############

Overview
========
AIMET offers a suite of neural network post-training quantization techniques. Often, applying these techniques in a
specific sequence, results in better accuracy and performance. Without the AutoQuant feature, the AIMET
user needs to manually try out various combinations of AIMET quantization features. This manual process is
error-prone and often time-consuming.

The AutoQuant feature, analyzes the model, determines the sequence of AIMET quantization techniques and applies these
techniques. In addition, the user can specify the amount of accuracy drop that can be tolerated, in the AutoQuant API.
As soon as this threshold accuracy is reached, AutoQuant stops applying any additional quantization technique. In
summary, the AutoQuant feature saves time and automates the quantization of the neural networks.
AIMET offers a suite of neural network post-training quantization techniques. Often, applying these techniques in a specific sequence results in better accuracy and performance.

The AutoQuant feature analyzes the model, determines the best sequence of AIMET quantization techniques, and applies these techniques. You can specify the accuracy drop that can be tolerated in the AutoQuant API.
As soon as this threshold accuracy is reached, AutoQuant stops applying quantization techniques.

Without the AutoQuant feature, you must manually try combinations of AIMET quantization techniques. This manual process is error-prone and time-consuming.

Workflow
========

Before entering the optimization workflow, AutoQuant performs the following preparation steps:
The workflow looks like this:

1) Check the validity of the model and convert it into an AIMET quantization-friendly format (denoted as `Prepare Model` below).
2) Select the best-performing quantization scheme for the given model (denoted as `QuantScheme Selection` below)

After the prepration steps, AutoQuant mainly consists of the following three stages:
.. image:: ../images/auto_quant_v2_flowchart.png

1) BatchNorm folding
2) :ref:`Cross-Layer Equalization <ug-post-training-quantization>`
3) :ref:`AdaRound <ug-adaround>`

These techniques are applied in a best-effort manner until the model meets the allowed accuracy drop.
If applying AutoQuant fails to satisfy the evaluation goal, AutoQuant will return the model to which the best combination
of the above techniques is applied.
Before entering the optimization workflow, AutoQuant prepares by:

.. image:: ../images/auto_quant_v2_flowchart.png
1. Checking the validity of the model and converting the model into an AIMET quantization-friendly format (`Prepare Model`).
2. Selecting the best-performing quantization scheme for the given model (`QuantScheme Selection`)

After the prepration steps, AutoQuant proceeds to try three techniques:

1. BatchNorm folding
2. :ref:`Cross-Layer Equalization (CLE) <ug-post-training-quantization>`
3. :ref:`AdaRound <ug-adaround>`

These techniques are applied in a best-effort manner until the model meets the allowed accuracy drop.
If applying AutoQuant fails to satisfy the evaluation goal, AutoQuant returns the model that returned the best results.

AutoQuant API
=============

Please refer to the links below to view the AutoQuant API for each AIMET variant:
See the AutoQuant API for your AIMET variant:

- :ref:`AutoQuant for PyTorch<api-torch-auto-quant>`
- :ref:`AutoQuant for ONNX<api-onnx-auto-quant>`
Expand Down
27 changes: 11 additions & 16 deletions Docs/user_guide/bn_reestimation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,49 +2,44 @@


======================
AIMET BN Re-estimation
AIMET Batch Normal Re-estimation
dwelsch-esi marked this conversation as resolved.
Show resolved Hide resolved
======================

Overview
========

The BN Re-estimation feature utilizes a small subset of training data to individually re-estimate the statistics of the
Batch Normalization (BN) layers in a model. These BN statistics are then used to adjust the quantization scale parameters
of the preceeding Convolution or Linear layers. Effectively, the BN layers are folded.
The Batch Normal (BN) re-estimation feature utilizes a small subset of training data to individually re-estimate the statistics of the BN layers in a model. These BN statistics are then used to adjust the quantization scale parameters of the preceeding Convolution or Linear layers. Effectively, the BN layers are folded.

The BN Re-estimation feature is applied after performing Quantization Aware Training (QAT) with Range Learning, with
Per Channel Quantization (PCQ) enabled. It is very important NOT to fold the BN layers before performing QAT. The BN layers are
folded ONLY after QAT and the re-estimation of the BN statistics are completed. The Workflow section below, covers
the exact sequence of steps.
The BN re-estimation feature is applied after performing Quantization Aware Training (QAT) with Range Learning, with Per Channel Quantization (PCQ) enabled. It is important *not* to fold the BN layers before performing QAT. Fold the BN layers only after QAT and the re-estimation of the BN statistics are completed. See the Workflow section below for the exact sequence of steps.

The BN Re-estimation feature is specifically recommended for the following scenarios:
The BN re-estimation feature is specifically recommended for the following scenarios:

- Low-bitwidth weight quantization (e.g., 4-bits)
- Models for which Batch Norm Folding leads to decreased performance.
- Models for which Batch Norm Folding leads to decreased performance
- Models where the main issue is weight quantization (including higher bitwidth quantization)
- Low bitwidth quantization of depthwise separable layers since their Batch Norm Statistics are affected by oscillations


Workflow
========

BN-Re-estimation requires that
BN re-estimation requires that:

1. BN layers not be folded before QAT.
2. Per Channel Quantization is enabled.

To use the BN-Re-estimation feature, the following sequence of steps must be followed in the correct order.
To use the BN re-estimation feature, the following sequence of steps must be followed in order:

1. Create the QuantizationSimModel object with Range Learning Quant Scheme
2. Perform QAT with Range Learning
3. Re-estimate the BN statistics
4. Fold the BN layers
5. Using the QuantizationSimModel, export the model and encodings.

Once the above steps are completed, the model can be run on the target for inference.
Once the steps are completed, the model can be run on the target for inference.

The following high level call flow diagrams, enumerates the work flow for PyTorch.
The workflow is the same for TensorFlow and Keras.
The following sequence diagram shows the workflow for PyTorch.
The workflow is the same for TensorFlow and Keras.

.. image:: ../images/bn_reestimation.png
:width: 1200px
Expand All @@ -53,7 +48,7 @@ The workflow is the same for TensorFlow and Keras.
BN Re-estimation API
====================

Please refer to the links below to view the BN Re-estimation API for each AIMET variant:
See the links below to view the BN re-estimation API for each AIMET variant:

- :ref:`BN Re-estimation for PyTorch<api-torch-bn-reestimation>`
- :ref:`BN Re-estimation for Keras<api-keras-bn-reestimation>`
Expand Down
58 changes: 23 additions & 35 deletions Docs/user_guide/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,76 +2,64 @@
:class: hideitem
.. _ug-index:

======================================
######################################
AI Model Efficiency Toolkit User Guide
======================================
######################################

Overview
========

AI Model Efficiency Toolkit (AIMET) is a software toolkit that enables users to quantize and compress models.
Quantization is a must for efficient edge inference using fixed-point AI accelerators.

AIMET optimizes pre-trained models (e.g., FP32 trained models) using post-training and fine-tuning techniques that
minimize accuracy loss incurred during quantization or compression.
AIMET optimizes pre-trained models (for example, FP32 trained models) using post-training and fine-tuning techniques that minimize accuracy loss incurred during quantization or compression.

AIMET currently supports PyTorch, TensorFlow, and Keras models.
AIMET supports PyTorch, TensorFlow, and Keras models.

The following diagram shows a high-level view of the AIMET workflow.

.. image:: ../images/AIMET_index_no_fine_tune.png

The above picture shows a high-level view of the workflow when using AIMET. The user will start with a trained
model in either the PyTorch, TensorFlow, or Keras training framework. This trained model is passed to AIMET using APIs
for compression and quantization. AIMET returns a compressed/quantized version of the model
that the users can fine-tune (or train further for a small number of epochs) to recover lost accuracy. Users can then
export via ONNX/meta/h5 to an on-target runtime like Qualcomm\ |reg| Neural Processing SDK.
You train a model in the PyTorch, TensorFlow, or Keras training framework, then pass the model to AIMET, using its APIs for compression and quantization. AIMET returns a compressed and/or quantized version of the model that you can fine-tune (or train further for a small number of epochs) to recover lost accuracy. You can then export the model using ONNX, meta/checkpoint, or h5 to an on-target runtime like the Qualcomm\ |reg| Neural Processing SDK.

Features
========

AIMET supports two sets of model optimization techniques:

- Model Quantization: AIMET can simulate behavior of quantized HW for a given trained
model. This model can be optimized using Post-Training Quantization (PTQ) and fine-tuning (Quantization Aware Training
- QAT) techniques.

- Model Compression: AIMET supports multiple model compression techniques that allow the
user to take a trained model and remove redundancies, resulting in a smaller model that runs faster on target.
AIMET supports two model optimization techniques:

Release Information
===================
Model Quantization
AIMET can simulate the behavior of quantized hardware for a trained model. This model can be optimized using Post-Training Quantization (PTQ) and Quantization Aware Training (QAT) fine-tuning techniques.

For information specific to this release, please see :ref:`Release Notes <ug-release-notes>` and :ref:`Known Issues <ug-known-issues>`.
Model Compression
AIMET supports multiple model compression techniques that remove redundancies from a trained model, resulting in a smaller model that runs faster on target.

Installation Guide
==================
More Information
================

Please visit the :ref:`AIMET Installation <ug-installation>` for more details.

Getting Started
===============

Please refer to the following documentation:
For more information about AIMET, see the following documentation:

- :ref:`Installation <ug-installation>`
- :ref:`Quantization User Guide <ug-model-quantization>`
- :ref:`Compression User Guide <ug-model-compression>`
- :ref:`API Documentation <ug-apidocs>`
- :ref:`Examples Documentation <ug-examples>`
- :ref:`Installation <ug-installation>`
- :ref:`API Documentation <ug-apidocs>`

Release Information
===================

For information specific to this release, see :ref:`Release Notes <ug-release-notes>` and :ref:`Known Issues <ug-known-issues>`.

:hideitem:`toc tree`
------------------------------------
.. toctree::
:hidden:

Installation <../install/index>
Quantization User Guide <model_quantization>
Compression User Guide <model_compression>
API Documentation<../api_docs/index>
Examples Documentation <examples>
Installation <../install/index>

|

|

| |project| is a product of |author|
| Qualcomm\ |reg| Neural Processing SDK is a product of Qualcomm Technologies, Inc. and/or its subsidiaries.
Expand Down
Loading