-
Notifications
You must be signed in to change notification settings - Fork 59
Quantization
Since the 0.6.10 version, DNNLibrary has supported NNAPI QUANT8_ASYMM, which enables 8-bit integer operations(convolutions, poolings, ...) and brings 2x+ speedup.
Different from tensorflow lite, which simulates quantization in training, we publish a quant.py
to quantize a pretrained float model to an 8-bit model. It means that you don't need to retrain a model for quantization, but meanwhile, the accuracy will be lower than the retrained ones.
- Prepare a dataset for collecting the stats of outputs of each layer (It's usually a subset of your training dataset), install python3 package onnxruntime or onnxruntime-gpu
- Run quant.py to quantize the model and generate a table file containing scales and zero points. The table file will be used in the next step.
usage: quant.py [-h] [--image_dir IMAGE_DIR]
[--dequantize_after DEQUANTIZE_AFTER]
[--batch_size BATCH_SIZE]
[--num_workers NUM_WORKERS]
onnx_model_path output_table
For example, if you have an onnx model named mobilenetv2.onnx
, the output filename is table.txt
, the directory of your dataset is my_dataset
python3 quant.py --image_dir my_dataset mobilenetv2.onnx table.txt
If you want the last fc layer to remain float precision(it will improve the accuracy), and the input name of this layer is abc
, you can set the --dequantize_after
argument
python3 quant.py --image_dir my_dataset --dequantize_after abc mobilenetv2.onnx table.txt
After running either of the above two commands, table.txt
and a quantized onnx model named quant-mobilenetv2.onnx
will be generated.
- Run onnx2daq,
the_path_of_onnx2daq the_quantized_onnx_model output_daq the_table_file
for example,
./tools/onnx2daq/onnx2daq quant-mobilenetv2.onnx quant-mobilenetv2.daq table.txt
then the daq model quant-mobilenetv2.daq
is the only one you need to deploy the quantized model, the table.txt
is not needed anymore.
- For quantized models, the input data type should be
uint8_t
(for cpp) orbyte
(for Java). You need to quantize the input manually, by the following formula
integer_value = real_value / scale + zeroPoint
scale and zeroPoint could be found in the table file.
- In Java library, there is four methods for model inference, corresponding to float/int8 input/output respectively. For example,
predict(float[] input)
is for "float input, float output", andpredict_quant8(float[] input)
is for `float input, int8 output".
dnnlibrary-example contains all code needed to run an int8 model, check it out if you have any trouble :)
Android Q will introduce many new features for NNAPI, such as QUANT8_SYMM
and QUANT8_SYMM_PER_CHANNEL
data type. We will support these new feature as soon as possible after Android Q is published.