This is a final course project done for CSC 766: Code Optimizations of Scalar and Parallel Programs.
Table of Contents
The hardware requirements include any Linux machine with Ubuntu installed and an Android smartphone.
Running the project on an M1 Mac poses challenges due to the Python version requirement (3.7), which is difficult to install natively on ARM platforms. While x86_64 emulation via Rosetta offers a workaround, using a Linux machine with sudo permissions is recommended for convenience.
First, clone the repository and all submodules.
git clone --recurse git@github.com:briancpark/csc766-project.git
Next, you'll need to install the dependencies as required by MACE.
CMake version 3.11.3 or higher is required to build MACE. This may not be required unless Bazel doesn't work.
sudo apt install cmake
If for some reason, you don't have CMake installed, fear not! You can use Bazel to build the project instead. This project is acutally developed under Bazel environment, and the commands below assume a Bazel environment. Run the commands below. Please be logged in as root (via sudo bash
).
export BAZEL_VERSION=0.13.1
mkdir /bazel && \
cd /bazel && \
wget https://sgit.luolix.top/bazelbuild/bazel/releases/download/$BAZEL_VERSION/bazel-$BAZEL_VERSION-installer-linux-x86_64.sh && \
chmod +x bazel-*.sh && \
./bazel-$BAZEL_VERSION-installer-linux-x86_64.sh && \
cd / && \
rm -f /bazel/bazel-$BAZEL_VERSION-installer-linux-x86_64.sh
After installation has been completed, add source /usr/local/lib/bazel/bin/bazel-complete.bash
to ~/.bashrc
MACE requires the Android NDK to be installed. Please follow the instructions here to install the NDK. It's very strict requirement under Bazel environment to use anything below NDK r16c, or else there might be compilation errors.
cd /opt/ && \
wget -q https://dl.google.com/android/repository/android-ndk-r15c-linux-x86_64.zip && \
unzip -q android-ndk-r15c-linux-x86_64.zip && \
rm -f android-ndk-r15c-linux-x86_64.zip
After successfully installing, please add the following to ~/.bashrc
.
export ANDROID_NDK_VERSION=r15c
export ANDROID_NDK=/opt/android-ndk-${ANDROID_NDK_VERSION}
export ANDROID_NDK_HOME=${ANDROID_NDK}
# add to PATH
export PATH=${PATH}:${ANDROID_NDK_HOME}
IMPORTANT: Please make sure that you have developer mode enabled on the physical Android device that you're using. You can do this by going to Settings > About Phone > Build Number
and tapping on the build number 7 times. After that, you should see a message that says You are now a developer!
. Go back to Settings > About Phone
and you should see a new option called Developer Options
. Enable this option.
In addition, you need libcurses, which may not be installed by default in Ubuntu.
sudo apt-get install libncurses5
When connecting an Android device, you may need to toggle on USB Debugging in Developer Options. Once that is done and connected to USB, approve the connection.
Running adb devices
should show your devices. Run adb shell
to get a shell on the device.
Please create a Conda environment with Python 3.7 installed. If you don't have Conda installed, please follow the instructions here.
conda create -n csc766 python=3.7
pip3 install --upgrade pip
Install all the requirements for MACE, which are pinned in requirements.txt
. It's very crucial that you install the exact versions of the packages listed in the file. Sometimes, different ONNX versions produce slightly different outputs, which can cause the MACE conversion to fail.
pip3 install -r requirements.txt
NOTE: If you are trying to replicate the sample model examples that MACE provides under MACE Model Zoo, then you'll need to create a separate Conda environment with Python 3.6 installed. You MUST use Python 3.6 for running those examples. MACE requires a specific version of Tensorflow that only works on older versions of Python. Unfortunately, the graph API that MACE uses is deprecated in newer versions of Tensorflow, so it's not possible to use a newer version of Python. But for the purposes of this project, we can use Python 3.7. because we don't use the Tensorflow backend, but the ONNX backend instead.
At last, the MACE installation. Please follow the instructions here to install MACE.
Clone the fork of the MACE repository that I've created. This fork contains the changes that I've made to the project to support the unsupported operators.
git clone git@github.com:briancpark/mace.git
The task of this project is to support the operators for DNNs ShuffleNet and RegNet.
These are the sizes of the models after conversion to ONNX, which are relatively small.
11M regnet.onnx
5.3M shufflenet_opt_pt-sim.onnx
The models are available in the onnx_models
directory.
The project was evaluated on a XiaoMi Mi 11 Lite. Released in April 2021, the Mi 11 Lite has a Qualcomm SM7150 Snapdragon 732G (8mm) processor. The CPU is an octa-core (2x2.3 GHz Kryo 470 Gold & 6x1.8 GHz Kryo 470 Silver) processor. The GPU is an Adreno 618. The device has 6GB of RAM and 128GB of storage. The device has a 6.55" AMOLED display with a resolution of 1080x2400 pixels.
The OS is Android 11 (RKQ1.200826.002), with MIUI 12.5.1. The device has a 4,250 mAh battery, 48MP main camera, a 8MP ultra-wide camera, a 5MP macro camera, a 2MP depth camera, and a 16MP front-facing camera. [Source] (https://www.gsmarena.com/xiaomi_mi_11_lite-10665.php)
First you will need to optimize ONNX files as follows:
python tools/onnx_optimizer.py ../onnx_models/regnet.onnx ../onnx_models/regnet_opt.onnx
python tools/onnx_optimizer.py ../onnx_models/shufflenet.onnx ../onnx_models/shufflenet_opt.onnx
Once done so, check the SHA256 checksums to make sure that the files are the same. You will later need to have the checksum for the configuration file, but I've already included it in the repository. So the following commands are just for reference.
sha256sum /home/bcpark/csc766-project/onnx_models/regnet_opt.onnx
sha256sum /home/bcpark/csc766-project/onnx_models/shufflenet_opt.onnx
IMPORTANT: When compiling and running between the two models, please make sure to checkout the correct branch that contains the changes for each model. The master
branch is just a copy of the original MACE repository, and does not contain the changes for the unsupported operators.
regnet-gpu-optimized
contains the changes for RegNet, and shufflenet-gpu-optimized
contains the changes for ShuffleNet.
cd mace
git checkout regnet-gpu-optimized
# Run RegNet related commands
git checkout shufflenet-gpu-optimized
# Run ShuffleNet related commands
To compile and run RegNet:
python tools/converter.py convert --config=../deployment_config/regnet.yml --debug_mode --vlog_level=3 && \
python tools/converter.py run --config=../deployment_config/regnet.yml --debug_mode --vlog_level=3
To compile and run ShuffleNet:
python tools/converter.py convert --config=../deployment_config/shufflenet.yml --debug_mode --vlog_level=3 && \
python tools/converter.py run --config=../deployment_config/shufflenet.yml --debug_mode --vlog_level=3
To verify model correctness, you can run the following command:
python tools/converter.py run --config=../deployment_config/regnet.yml --validate
python tools/converter.py run --config=../deployment_config/shufflenet.yml --validate
A correct model will output the following in green. This is tested on sythetic data. Simulatriy is cosine simularity, where values closer to 1.0 is correct. SQNR is signal to quantization noise ratio, where higher is better. Pixel accuracy is the percentage of pixels that are the same between the two models, where values closer to 1.0 is correct. You can learn more here.
output MACE VS ONNX similarity: 0.9999781457219321 , sqnr: 22823.096153305403 , pixel_accuracy: 1.0
******************************************
Similarity Test Passed
******************************************
An incorrect model will output the following in red:
output MACE VS ONNX similarity: 0.017589891526221674 , sqnr: 0.10534350168590623 , pixel_accuracy: 0.0
ERROR: [] /home/bcpark/csc766-project/mace/tools/validate.py:131: ******************************************
Similarity Test Failed
******************************************
You can validate the model layer by layer, but this can give some false positives since the model is cross validated against ONNX model, and some MACE ops are fused or optimized.
python tools/converter.py run --config=../deployment_config/regnet.yml --validate --layers :
python tools/converter.py run --config=../deployment_config/shufflenet.yml --validate --layers :
To benchmark the model, you can run the following command:
# RegNet
python tools/converter.py run --config=../deployment_config/regnet.yml --benchmark --round=1000 --gpu_priority_hint=3
# ShuffleNet
python tools/converter.py run --config=../deployment_config/shufflenet.yml --benchmark --round=1000 --gpu_priority_hint=3
To benchmark a specific OP (TODO: this doesn't work yet for the ops I've implemented):
python tools/bazel_adb_run.py --target="//test/ccbenchmark:mace_cc_benchmark" --run_target=True --args="--filter=.*CONV.*"
The files are uploaded onto /data/local/tmp/mace_run/
directory on the device.
After doing this project, it turns out not all ops are fully supported for each hardware target. Look here for a comprehensive list of supported operators.
This is for my own reference, when I want to debug through a fully implemented model for reference.
# Convert the model
python tools/converter.py convert --config=../mace-models/mobilenet-v2/mobilenet-v2.yml
# Run the model
python tools/converter.py run --config=../mace-models/mobilenet-v2/mobilenet-v2.yml
# Test model run time
python tools/converter.py run --config=../mace-models/mobilenet-v2/mobilenet-v2.yml --round=100
# Validate the correctness by comparing the results against the
# original model and framework, measured with cosine distance for similarity.
python tools/converter.py run --config=../mace-models/mobilenet-v2/mobilenet-v2.yml --validate
The diff stat for each branch is as follows.
For regnet-gpu-optimized
, the OpenCL implementation lies in mace/ops/opencl/cl/group_conv2d.cl
.
git diff --stat origin
mace/core/net/serial_net.cc | 45 +++---
mace/core/net_optimizer.cc | 5 +-
mace/ops/arm/base/group_conv2d.cc | 277 ++++++++++++++++++++++++++++++++++
mace/ops/arm/base/group_conv2d.h | 132 ++++++++++++++++
mace/ops/arm/base/group_conv2d_3x3.cc | 36 +++++
mace/ops/arm/base/group_conv2d_3x3.h | 59 ++++++++
mace/ops/arm/base/group_conv2d_mxn.h | 87 +++++++++++
mace/ops/arm/fp32/group_conv2d_3x3.cc | 401 +++++++++++++++++++++++++++++++++++++++++++++++++
mace/ops/delegator/group_conv2d.h | 96 ++++++++++++
mace/ops/group_conv2d.cc | 263 ++++++++++++++++++++++++++++++++
mace/ops/group_conv2d.h | 103 +++++++++++++
mace/ops/opencl/buffer/group_conv2d.h | 96 ++++++++++++
mace/ops/opencl/cl/group_conv2d.cl | 169 +++++++++++++++++++++
mace/ops/opencl/group_conv2d.h | 57 +++++++
mace/ops/opencl/image/group_conv2d.cc | 130 ++++++++++++++++
mace/ops/opencl/image/group_conv2d.h | 102 +++++++++++++
mace/ops/opencl/image/group_conv2d_3x3.cc | 194 ++++++++++++++++++++++++
mace/ops/opencl/image/group_conv2d_general.cc | 204 +++++++++++++++++++++++++
mace/ops/ref/group_conv2d.cc | 134 +++++++++++++++++
mace/ops/registry/op_delegators_registry.cc | 8 +-
mace/ops/registry/ops_registry.cc | 2 +
mace/utils/statistics.cc | 96 ++++++------
repository/opencl-kernel/opencl_kernel_configure.bzl | 2 +
tools/device.py | 5 +
tools/python/transform/base_converter.py | 2 +
tools/python/transform/onnx_converter.py | 17 ++-
tools/python/transform/transformer.py | 1 +
27 files changed, 2642 insertions(+), 81 deletions(-)
For shufflenet-gpu-optimized
, the OpenCL implementation lies in mace/ops/opencl/cl/channel_shuffle.cl
.
git diff --stat origin
mace/core/net_def_adapter.cc | 416 ++++++++++++++++++++++++++++---------------------------------
mace/core/net_optimizer.cc | 5 +-
mace/ops/channel_shuffle.cc | 26 ++--
mace/ops/opencl/cl/channel_shuffle.cl | 46 +++----
mace/ops/opencl/image/channel_shuffle.cc | 16 ++-
tools/device.py | 5 +
tools/python/transform/transformer.py | 12 +-
7 files changed, 244 insertions(+), 282 deletions(-)
Some more detailed debugging notes and results obtained are shown in DEVELOPMENT.md.