Skip to content

Latest commit

 

History

History

pp_mobileseg

PP-MobileSeg: Exploring Transformer Blocks for Efficient Mobile Segmentation.

Reference

Shiyu Tang, Ting Sun, Juncai Peng, Guowei Chen, Yuying Hao, Manhui Lin, Zhihong Xiao, Jiangbin You, Yi Liu. PP-MobileSeg: Explore the Fast and Accurate Semantic Segmentation Model on Mobile Devices. https://arxiv.org/abs/2304.05152

Contents

  1. Overview
  2. Performance
  3. Reproduction

Overview

With the success of transformers in computer vision, several attempts have been made to adapt transformers to mobile devices. However, their performance is not satisfied for some real world applications. Therefore, we propose PP-MobileSeg, a SOTA semantic segmentation model for mobile devices.

It is composed of three newly proposed parts, the strideformer backbone, the Aggregated Attention Module(AAM), and the Valid Interpolate Module(VIM):

  • With the four-stage MobileNetV3 block as the feature extractor, we manage to extract rich local features of different receptive fields with little parameter overhead. Also, we further efficiently empower features from the last two stages with the global view using strided sea attention.
  • To effectively fuse the features, we use AAM to filter the detail features with ensemble voting and add the semantic feature to it to enhance the semantic information to the most content.
  • At last, we use VIM to upsample the downsampled feature to the original resolution and significantly decrease latency in model inference stage. It only interpolates classes present in the final prediction which only takes around 10% in the ADE20K dataset. This is a common scenario for datasets with large classes. Therefore it significantly decreases the latency of the final upsample process which takes the greatest part of the model's overall latency.

Extensive experiments show that PP-MobileSeg achieves a superior params-accuracy-latency tradeoff compared to other SOTA methods.

Performance

ADE20K

Model Backbone Training Iters Batchsize Train Resolution mIoU(%) latency(ms)* params(M) Links
PP-MobileSeg-Base StrideFormer-Base 80000 32 512x512 41.57% 265.5 5.62 config|model|log|vdl|exported model
PP-MobileSeg-Tiny StrideFormer-Tiny 80000 32 512x512 36.39% 215.3 1.61 config|model|log|vdl|exported model

Compare with SOTA on ADE20Ks

Model Backbone mIoU(%) latency(ms)* params(M)
LR-ASPP MobileNetV3_large_x1_0 33.10 730.9 3.20
MobileSeg-Base MobileNetV3_large_x1_0 33.26 391.5 2.85
TopFormer-Tiny TopTransformer-Tiny 32.46 490.3 1.41
SeaFormer-Tiny SeaFormer-Tiny 35.00 459.0 1.61
PP-MobileSeg-Tiny StrideFormer-Tiny 36.39 215.3 1.44
TopFormer-Base TopTransformer-Base 38.28 480.6 5.13
SeaFormer-Base SeaFormer-Base 40.07** 465.4 8.64
PP-MobileSeg-Base StrideFormer-Base 41.57 265.5 5.62

Ablation study of PP-MobileSeg-Base on ADE20K

Model Backbone Train Resolution mIoU(%) latency(ms)* params(M) Links
baseline Seaformer-Base 512x512 40.00% 465.6 8.27 model|log|vdl|exported model
+VIM Seaformer-Base 512x512 40.07% 234.6 8.17 model|log|vdl|exported model
+VIM+StrideFormer StrideFormer-Base 512x512 40.98% 235.1 5.54 model|log|vdl|exported model
+VIM+StrideFormer+AAM StrideFormer-Base 512x512 41.57% 265.5 5.62 model|log|vdl|exported model

* Note that the latency is test with the final argmax operator using PaddleLite on xiaomi9 (Snapdragon 855 CPU) with single thread and 512x512 as input shape. Therefore the output of model is the segment result with single channel rather then probability logits. Inspired by the ineffectiveness of the final argmax operator that greatly increase the overall latency, we designed VIM to significantly decrease the latency.

** The accuracy is reported based on self-trained reproduced result.

Reproduction

Preparation

  • Install PaddlePaddle and relative environments based on the installation guide.
  • Install PaddleSeg based on the reference.
  • Download the ADE20k dataset and link to PaddleSeg/data, or you can directly run the training script. The dataset will be automatically downloaded.
PaddleSeg/data
├── ADEChallengeData2016
│   ├── ade20k_150_embedding_42.npy
│   ├── annotations
│   ├── annotations_detectron2
│   ├── images
│   ├── objectInfo150.txt
│   └── sceneCategories.txt

Training

You can start training by assign the tools/train.py with config files, the config files are under PaddleSeg/configs/pp_mobileseg. Details about training are under training guide. You can find the trained models under Paddleseg/save/dir/best_model/model.pdparams

export CUDA_VISIBLE_DEVICES=0,1

python3  -m paddle.distributed.launch tools/train.py \
    --config configs/pp_mobileseg/pp_mobileseg_base_ade20k_512x512_80k.yml \
    --save_dir output/pp_mobileseg_base \
    --save_interval 1000 \
    --num_workers 4 \
    --log_iters 100 \
    --use_ema \
    --do_eval \
    --use_vdl

Validation

With the trained model on hand, you can verify the model's accuracy through evaluation. Details about evaluation are under evaluation guide.

python  -m paddle.distributed.launch tools/val.py \
       --config configs/pp_mobileseg/pp_mobileseg_base_ade20k_512x512_80k.yml \
       --model_path output/pp_mobileseg_base/best_model/model.pdparams

Deployment

We deploy the model on mobile devices for inference. To do that, we need to export the model and use PaddleLite to inference on mobile devices. You can also refer to lite deploy guide for details of PaddleLite deployment.

0. Preparation

  • An android mobile phone with usb debugger mode on and are already linked to your PC.
  • Install the adb tool.

Run the following command to make sure you are ready:

adb devices
# The following information will show if you are good to go:
List of devices attached
017QXM19C1000664    device

1. Model exportation

The model needs to be transferred from dynamic graph to static graph for PaddleLite inference. In this step, we can use VIM to speed the model up. You only need to change model::upsample to vim in the config file, and the exported model can be found on the PaddleSeg/save/dir

python tools/export.py \
      --config configs/pp_mobileseg/pp_mobileseg_base_ade20k_512x512_80k.yml \
      --save_dir output/pp_mobileseg_base  \
      --input_shape 1 3 512 512 \ # The model is set to infer one image with this input shape, feel free to suit this to your dataset.
      --output_op none   # If do not use VIM, you need to set this to argmax to get the final prediction rather than logits.

2. Model inference

  • After the model is exported, you can download all the exported files and tool zipfile as shown in the following file tree.
Speed_test_dir
├── models_dir
│   ├── pp_mobileseg_base  # Files under this directory is generated through exportation
│   │   ├── model.pdmodel
│   │   ├── mdoel.pdiparams
│   │   ├── model.pdiparams.info
│   │   └── deploy.yaml
│   ├── pp_mobileseg_tiny
│   │   ├── model.pdmodel
│   │   ├── mdoel.pdiparams
│   │   ├── model.pdiparams.info
│   │   └── deploy.yaml
├── benchmark_bin   # The complied testscript of PaddleLite, which is in the tool zipfile.
├── image1.txt      # The txt file that stores the value of resized and normalized image
└── gen_val_txt.py  # You can use this script to generate the image1.txt for your test image
  • And you can test the speed of the model using the following script. The tested result will be shown in the test_result.txt.
sh benchmark.sh benchmark_bin models_dir test_result.txt image1.txt

The test result on our PP-MobileSeg-Base is as following:

-----------------Model=MV3_4stage_AAMSx8_valid_0321 Threads=1-------------------------
Delete previous optimized model: /data/local/tmp/seg_benchmark/models_0321/MV3_4stage_AAMSx8_valid_0321/opt.nb

---------- Opt Info ----------
Load paddle model from /data/local/tmp/seg_benchmark/models_0321/MV3_4stage_AAMSx8_valid_0321/model.pdmodel and /data/local/tmp/seg_benchmark/models_0321/MV3_4stage_AAMSx8_valid_0321/model.pdiparams
Save optimized model to /data/local/tmp/seg_benchmark/models_0321/MV3_4stage_AAMSx8_valid_0321/opt.nb

---------- Device Info ----------
Brand: Xiaomi
Device: cepheus
Model: MI 9
Android Version: 9
Android API Level: 28

---------- Model Info ----------
optimized_model_file: /data/local/tmp/seg_benchmark/models_0321/MV3_4stage_AAMSx8_valid_0321/opt.nb
input_data_path: /data/local/tmp/seg_benchmark/image1_norm.txt
input_shape: 1,3,512,512
output tensor num: 1
--- output tensor 0 ---
output shape(NCHW): 1 512 512
output tensor 0 elem num: 262144
output tensor 0 mean value: 1.18468e-44
output tensor 0 standard deviation: 2.52949e-44

---------- Runtime Info ----------
benchmark_bin version: e79b4b6
threads: 1
power_mode: 0
warmup: 20
repeats: 50
result_path:

---------- Backend Info ----------
backend: arm
cpu precision: fp32

---------- Perf Info ----------
Time(unit: ms):
init  = 33.071  
first = 314.619  
min   = 265.450  
max   = 271.217  
avg   = 267.246