A complex model is beneficial to improve the performance of the model, but it also leads to some redundancy in the model. This part provides the function of reducing the model, including two parts: model quantization (quantization training, offline quantization), model pruning.
Among them, model quantization reduces the full precision to fixed-point numbers to reduce this redundancy, so as to reduce the computational complexity of the model and improve the inference performance of the model. Model quantization can convert FP32-precision model parameters to Int8-precision without losing the accuracy of the model, reducing the size of model parameters and speeding up the calculation. Using the quantized model has a speed advantage when deploying on mobile terminals.
Model pruning cuts out the unimportant convolution kernels in the CNN, reduces the amount of model parameters, and thus reduces the computational complexity of the model.
This tutorial will introduce how to use PaddleSlim, a paddle model compression library, to compress PaddleVideo models. PaddleSlim integrates model pruning, quantization (including quantization training and offline quantization), distillation and neural network search and other commonly used and leading model compression functions in the industry. If you are interested, you can follow and understand.
Before starting this tutorial, it is recommended to understand PaddleVideo model training method and [PaddleSlim](https://paddleslim.readthedocs.io/zh_CN/ latest/index.html)
After training a model, if you want to further compress the model size and speed up prediction, you can use quantization or pruning to compress the model.
Model compression mainly includes five steps:
- Install PaddleSlim
- Prepare the trained model
- Model Compression
- Export the quantitative inference model
- Quantitative Model Prediction Deployment
- It can be installed by pip install.
python3.7 -m pip install paddleslim -i https://pypi.tuna.tsinghua.edu.cn/simple
- If you get the latest features of PaddleSlim, you can install it from source.
git clone https://github.com/PaddlePaddle/PaddleSlim.git
cd Paddleslim
python3.7 setup.py install
PaddleVideo provides a series of trained models. If the model to be quantized is not in the list, you need to follow the [regular training](../ ../docs/zh-CN/usage.md) method to get the trained model.
Go to PaddleVideo root directory
cd PaddleVideo
The offline quantization code is located in deploy/slim/quant_post_static.py
.
Quantization training includes offline quantization training and online quantization training (TODO). The effect of online quantization training is better. The pre-training model needs to be loaded, and the model can be quantized after the quantization strategy is defined.
TODO
Note: For offline quantization, you must use the inference model
exported from the trained model for quantization. For general model export inference model
, please refer to [Tutorial](../../docs/zh-CN/usage.md#5-Model Inference).
Generally speaking, the offline quantization loss model has more accuracy.
Taking the PP-TSM model as an example, after generating the inference model
, the offline quantization operation is as follows
# download a small amount of data for calibration
pushd ./data/k400
wget -nc https://videotag.bj.bcebos.com/Data/k400_rawframes_small.tar
tar -xf k400_rawframes_small.tar
popd
# then switch to deploy/slim
cd deploy/slim
# execute quantization script
python3.7 quant_post_static.py \
-c ../../configs/recognition/pptsm/pptsm_k400_frames_uniform_quantization.yaml \
--use_gpu=True
All quantization environment parameters except use_gpu
are configured in pptsm_k400_frames_uniform_quantization.yaml
file
Where inference_model_dir
represents the directory path of the inference model
exported in the previous step, and quant_output_dir
represents the output directory path of the quantization model
After successful execution, the __model__
file and the __params__
file are generated in the quant_output_dir
directory, which are used to store the generated offline quantization model
Similar to the usage of inference model
, you can directly use these two files for prediction deployment without re-exporting the model.
# Use PP-TSM offline quantization model for prediction
# Go back to the PaddleVideo directory
cd ../../
# Use the quantized model to make predictions
python3.7 tools/predict.py \
--input_file data/example.avi \
--config configs/recognition/pptsm/pptsm_k400_frames_uniform.yaml \
--model_file ./inference/ppTSM/quant_model/__model__ \
--params_file ./inference/ppTSM/quant_model/__params__ \
--use_gpu=True \
--use_tensorrt=False
The output is as follows:
Current video file: data/example.avi
top-1 class: 5
top-1 score: 0.9997928738594055
TODO
TODO
The model exported in the above steps can be converted through the opt model conversion tool of PaddleLite. Reference for model deployment Serving Python Deployment Serving C++ Deployment
- During quantitative training, it is recommended to load the pre-trained model obtained from regular training to accelerate the convergence of quantitative training.
- During quantitative training, it is recommended to modify the initial learning rate to
1/20~1/10
of conventional training, and modify the number of training epochs to1/5~1/2
of conventional training. In terms of learning rate strategy, add On Warmup, other configuration information is not recommended to be modified.