How to speed up inference on a quantized model #44

Medicmind · 2022-03-13T04:01:53Z

Medicmind
Mar 13, 2022

When I do inference on an Efficientnet model I trained it takes 0.26s for a single image. But when I perform inference on a quantized version of the same Efficientnet model it takes 26 seconds. I thought that because the model is quantized it'd be doing integer operations rather than float operations so should be faster. I wondering if you know what I could do to speed up inference on a quantized model? This is the code Im using:

import tensorflow as tf
from keras_cv_attention_models import regnet, resnet_family, nfnets, cotnet, beit, efficientnet
from keras_cv_attention_models.imagenet import data
from PIL import Image
import glob
import numpy as np
from keras_cv_attention_models import model_surgery
from keras_cv_attention_models.imagenet import eval_func
import timeit

image_size=384
mm = efficientnet.EfficientNetV2S(num_classes=2,pretrained=None)
mm.load_weights('checkpoints/keepefficientnet.EfficientNetV2S_256_LAMB_eyepacs_batchsize_8_randaug_6_mixup_0.1_cutmix_1.0_RRC_0.08_lr512_0.008_wd_0.02_latest.h5') #, by_name=True) 
mm = model_surgery.convert_groups_conv2d_2_split_conv2d(mm)

inference = eval_func.TFLiteModelInterf('keepefficientnet.EfficientNetV2S_256_LAMB_eyepacs_batchsize_8_randaug_6_mixup_0.1_cutmix_1.0_RRC_0.08_lr512_0.008_wd_0.02_latest.tflite')

fl="image.jpg"
input_data_type=np.float32
image = np.array(Image.open(fl).resize((image_size,image_size)), dtype=input_data_type)
if input_data_type == np.float32:
    image = image / 255.
image= np.expand_dims(image, axis=0)  

start = timeit.default_timer()
ynew = mm.predict(image)
end = timeit.default_timer() 
print("fl",fl,np.argmax(ynew), end-start)

start = timeit.default_timer()
pred=inference(image)
end = timeit.default_timer() 
print("prediction",fl,pred,np.argmax(pred), end-start)

leondgarse · 2022-03-13T06:04:51Z

leondgarse
Mar 13, 2022
Maintainer

Yes, and we can even say this expected. There are many issues related in Tensorflow, like INT TFLITE very much slower than FLOAT TFLITE #21698, that quantized int requires an arm neon to be faster than float. Here I took a full test on this:

Convert TFLite models:

import kecam  #  I made an alias name `kecam` for this package.
mm = kecam.efficientnet.EfficientNetV2S(num_classes=2)

converter = tf.lite.TFLiteConverter.from_keras_model(mm)
open(mm.name + "_default.tflite", "wb").write(converter.convert())
# 80586304

converter = tf.lite.TFLiteConverter.from_keras_model(mm)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
open(mm.name + "_dynamic.tflite", "wb").write(converter.convert())
# 22873280

converter = tf.lite.TFLiteConverter.from_keras_model(mm)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_types = [tf.float16]
open(mm.name + "_float16.tflite", "wb").write(converter.convert())
# 40455920

def representative_dataset_gen():
    for _ in range(100):
        # Get sample input data as a numpy array in a method of your choosing.
        # It's `[]` wrapped image input with batch_size 1, like `[np.ones([1, 224, 224, 3])]`.
        yield [tf.random.uniform([1, *mm.input_shape[1:]])]

converter = tf.lite.TFLiteConverter.from_keras_model(mm)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_dataset_gen
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.uint8
converter.inference_output_type = tf.uint8
open(mm.name + "_uint8.tflite", "wb").write(converter.convert())
# 23330144

inputs = tf.random.uniform([1, *mm.input_shape[1:]])
%timeit mm(inputs)
# 247 ms ± 7.63 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

bb = kecam.imagenet.eval_func.TFLiteModelInterf(mm.name + "_default.tflite")
%timeit bb(inputs)
# 311 ms ± 2.27 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

bb = kecam.imagenet.eval_func.TFLiteModelInterf(mm.name + "_dynamic.tflite")
%timeit bb(inputs)
# 19 s ± 1.07 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

bb = kecam.imagenet.eval_func.TFLiteModelInterf(mm.name + "_float16.tflite")
%timeit bb(inputs)
# 315 ms ± 3.14 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

bb = kecam.imagenet.eval_func.TFLiteModelInterf(mm.name + "_uint8.tflite")
%timeit bb(inputs)
# 18.8 s ± 893 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Type	model size (MB)	Average Inference (ms)
original	82.7479	247
default	80.5863	311
dynamic	22.8733	19000
float16	40.4559	315
uint8	23.3301	18800

Testing on ARM64. First of all, build benchmark tool following TFLite Model Benchmark Tool:

git clone https://github.com/tensorflow/tensorflow.git
cd tensorflow
./configure
bazel build -c opt --config=android_arm64 tensorflow/lite/tools/benchmark:benchmark_model

Connect android device and push benchmark tool + all tflite models in:

adb mkdir /data/local/tmp
adb push bazel-bin/tensorflow/lite/tools/benchmark/benchmark_model /data/local/tmp
adb shell chmod +x /data/local/tmp/benchmark_model
adb push {your_tflite_model_path}/efficientnet_v2-s_*.tflite /data/local/tmp

Run tests for each model with num_threads 1 or 4, use_xnnpack true or false:

model_name = "efficientnet_v2-s_default.tflite"
! adb shell /data/local/tmp/benchmark_model --graph=/data/local/tmp/{model_name} --num_threads=1
! adb shell /data/local/tmp/benchmark_model --graph=/data/local/tmp/{model_name} --num_threads=4
! adb shell /data/local/tmp/benchmark_model --graph=/data/local/tmp/{model_name} --num_threads=1 --use_xnnpack=true
! adb shell /data/local/tmp/benchmark_model --graph=/data/local/tmp/{model_name} --num_threads=4 --use_xnnpack=true

Type	model size (MB)	num_threads	use_xnnpack	Average Inference (ms)
default	80.5863	1	false	558.783
default	80.5863	4	false	186.174
default	80.5863	1	true	557.978
default	80.5863	4	true	186.233
dynamic	22.8733	1	false	292.446
dynamic	22.8733	4	false	180.924
dynamic	22.8733	1	true	292.570
dynamic	22.8733	4	true	182.191
float16	40.4559	1	false	558.883
float16	40.4559	4	false	186.722
float16	40.4559	1	true	557.977
float16	40.4559	4	true	187.069
uint8	23.3301	1	false	211.657
uint8	23.3301	4	false	106.078
uint8	23.3301	1	true	211.894
uint8	23.3301	4	true	106.685

I think use_xnnpack is enabled by default now, so setting use_xnnpack true or false makes no difference. If we build benchmark_model with

bazel build --config opt --config monolithic --define tflite_with_xnnpack=false tensorflow/lite/tools/benchmark:benchmark_model

, will see the gap:

Type	model size (MB)	num_threads	use_xnnpack	Average Inference (ms)
float16	40.4559	1	false	719.556
float16	40.4559	4	false	352.561
float16	40.4559	1	true	556.823
float16	40.4559	4	true	186.786

Another parameter for quantized model is use_nnapi, but my device doesn't have supporting devices like NPU, so it's not working well:

model_name = "efficientnet_v2-s_uint8.tflite"
! adb shell /data/local/tmp/benchmark_model --graph=/data/local/tmp/{model_name} --num_threads=1 --use_nnapi=true
! adb shell /data/local/tmp/benchmark_model --graph=/data/local/tmp/{model_name} --num_threads=4 --use_nnapi=true

Type	model size (MB)	num_threads	use_nnapi	Average Inference (ms)
uint8	23.3301	1	true	794.497
uint8	23.3301	4	true	798.342

2 replies

leondgarse Mar 13, 2022
Maintainer

Anyway, the result seems supporting the claim, that quantized mode is for ARM platform, not X86 one.

Medicmind Mar 13, 2022
Author

The numbers above show that unint8 on an ARM64 seems to double the speed when you use uint8 vs floats.
I have a RegNet model I trained which I quantized and am running on my iPhone. It takes about 1 second to do an inference but running the inference on an X86 for the same tflite model takes 20s. I would have thought X86 wasn't optimized to for integer operations it'd be at worst twice as slow for uint8 operations vs float operations. Possibly something else is going on that is making it slow on an X86 when using uint8.

Or do you think the ARM64 is so heavily optimized for uint8 operations that it is actually a lot faster than an x86 even by a factor of 20?

leondgarse · 2022-03-14T00:55:40Z

leondgarse
Mar 14, 2022
Maintainer

I think maybe Github openvinotoolkit/openvino is for this inference? I'm not familiar with this.

Another trick you may try is fuse conv batchnorm layers by model_surgery.convert_to_fused_conv_bn_model, it will make a little speed up:

from keras_cv_attention_models import regnet, model_surgery
mm = regnet.RegNetZB16()  # Trainable params: 9,715,480
bb = model_surgery.convert_to_fused_conv_bn_model(mm)  # Trainable params: 9,690,584

inputs = tf.random.uniform([1, *mm.input_shape[1:]])
print(np.allclose(mm(inputs), bb(inputs), atol=1e-6))
# True
%timeit mm(inputs)
# 125 ms ± 246 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit bb(inputs)
# 109 ms ± 2.86 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

2 replies

Medicmind Mar 14, 2022
Author

model_surgery.convert_to_fused_conv_bn_model didn't speed inference for me for regnet.

But its funny on my smartphone with ARM the regnet tflite model with float32 takes 400ms per image but uint8 is 250ms per image so its like you say, uint8 definately is faster on ARM. Just not on X86. I'm writing a paper and am trying to show how quantization is smaller and faster but the X86 result put a spanner in the works. But ARM works as expected with speed.

leondgarse Mar 14, 2022
Maintainer

For X86, you may refer openvino result INT8 vs FP32 Comparison on Select Networks and Platforms, all of them have a > 1.5 speedup.

leondgarse · 2022-03-14T03:19:50Z

leondgarse
Mar 14, 2022
Maintainer

Have you tried float16 models on ios? It maybe better for using CoreML: Tensorflow Lite Core ML delegate. This inference speed comparing highly depends on platform and framework supporting...

1 reply

Medicmind Mar 14, 2022
Author

I should try it. I did try Metal a few years ago which I think CoreML uses and Metal gave great results because utilizes the phones GPU instead of CPU.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to speed up inference on a quantized model #44

{{title}}

Replies: 3 comments 5 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

How to speed up inference on a quantized model #44

Medicmind Mar 13, 2022

Replies: 3 comments · 5 replies

leondgarse Mar 13, 2022 Maintainer

leondgarse Mar 13, 2022 Maintainer

Medicmind Mar 13, 2022 Author

leondgarse Mar 14, 2022 Maintainer

Medicmind Mar 14, 2022 Author

leondgarse Mar 14, 2022 Maintainer

leondgarse Mar 14, 2022 Maintainer

Medicmind Mar 14, 2022 Author

Medicmind
Mar 13, 2022

Replies: 3 comments 5 replies

leondgarse
Mar 13, 2022
Maintainer

leondgarse Mar 13, 2022
Maintainer

Medicmind Mar 13, 2022
Author

leondgarse
Mar 14, 2022
Maintainer

Medicmind Mar 14, 2022
Author

leondgarse Mar 14, 2022
Maintainer

leondgarse
Mar 14, 2022
Maintainer

Medicmind Mar 14, 2022
Author