Quantization

Since the 0.6.10 version, DNNLibrary has supported NNAPI QUANT8_ASYMM, which enables 8-bit integer operations(convolutions, poolings, ...) and brings 2x+ speedup.

Different from tensorflow lite, which simulates quantization in training, we publish a quant.py to quantize a pretrained float model to an 8-bit model. It means that you don't need to retrain a model for quantization, but meanwhile, the accuracy will be lower than the retrained ones.

Generate a quantized daq model

Prepare a dataset for collecting the stats of outputs of each layer (It's usually a subset of your training dataset), install python3 package onnxruntime or onnxruntime-gpu
Run quant.py to quantize the model and generate a table file containing scales and zero points. The table file will be used in the next step.

usage: quant.py [-h] [--image_dir IMAGE_DIR]
                [--dequantize_after DEQUANTIZE_AFTER]
                [--batch_size BATCH_SIZE]
                [--num_workers NUM_WORKERS]
                onnx_model_path output_table

For example, if you have an onnx model named mobilenetv2.onnx, the output filename is table.txt, the directory of your dataset is my_dataset

python3 quant.py --image_dir my_dataset mobilenetv2.onnx table.txt

If you want the last fc layer to remain float precision(it will improve the accuracy), and the input name of this layer is abc, you can set the --dequantize_after argument

python3 quant.py --image_dir my_dataset --dequantize_after abc mobilenetv2.onnx table.txt

After running either of the above two commands, table.txt and a quantized onnx model named quant-mobilenetv2.onnx will be generated.

Run onnx2daq,

the_path_of_onnx2daq the_quantized_onnx_model output_daq the_table_file

for example,

./tools/onnx2daq/onnx2daq quant-mobilenetv2.onnx quant-mobilenetv2.daq table.txt

then the daq model quant-mobilenetv2.daq is the only one you need to deploy the quantized model, the table.txt is not needed anymore.

Run the quantized model

For quantized models, the input data type should be uint8_t(for cpp) or byte(for Java). You need to quantize the input manually, by the following formula

integer_value = real_value / scale + zeroPoint

scale and zeroPoint could be found in the table file.

In Java library, there is four methods for model inference, corresponding to float/int8 input/output respectively. For example, predict(float[] input) is for "float input, float output", and predict_quant8(float[] input) is for `float input, int8 output".

dnnlibrary-example contains all code needed to run an int8 model, check it out if you have any trouble :)

Android Q

Android Q will introduce many new features for NNAPI, such as QUANT8_SYMM and QUANT8_SYMM_PER_CHANNEL data type. We will support these new feature as soon as possible after Android Q is published.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quantization

Generate a quantized daq model

Run the quantized model

Android Q

Clone this wiki locally