Speedup of INT8/XNOR on Tensor Cores far less than claimed #2365

JC-13 · 2019-02-08T00:43:23Z

I have been testing the speed of my custom trained yolov3-tiny with 4 classes, on a 2080ti (Turing) and a Xavier (Volta). However using xnor or INT8 both have <10% speedup compared to normal fp32. All testing has been done with the same 1080p video file as input.
Repo was git pulled yesterday (07/02/19)

MAKEFILE-Xavier GPU=1 CUDNN=1 CUDNN_HALF=1 OPENCV=1 AVX=0 OPENMP=0 LIBSO=0
MAKEFILE-2080ti GPU=1 CUDNN=1 CUDNN_HALF=1 OPENCV=1 AVX=1 OPENMP=1 LIBSO=0
Have also included the lines in makefile for CC=7.5 and CC=7.3 respectively.

Any ideas why there is not a significant speedup by using mixed precision?

The text was updated successfully, but these errors were encountered:

njgre6 · 2019-02-11T22:02:13Z

@AlexeyAB I am also fairly interested in this, I was looking at using low precision inference for a real time embedded system with object detection. Would love to know why the above results are not as good as the theoretical?

JC-13 · 2019-02-11T22:54:27Z

Seems like the real solution is just to use nVidia's trt-yolo app based on tensorRT. Can't comment on the accuracy but the speed was significantly better:

Note: Used 544 because trt-yolo-app only accepts square inputs and can't handle video

AlexeyAB · 2019-02-11T23:56:39Z

@JC-13 Hi,

ll testing has been done with the same 1080p video file as input.
Repo was git pulled yesterday (07/02/19)
...
Note: Used 544 because trt-yolo-app only accepts square inputs and can't handle video

Try to check GPU-usage during detection - looks like your CPU just can't capture more than 205-230 frames per second from videofile. Also there is still not optimal post-processing on CPU. So try to test both repo with image.

Try to update your code from this GitHub, last couple commits.
Try to train your (not Tiny) Full-XNOR-net model 608x608 or 544x544 using this cfg-file yolov3-spp_xnor_obj.cfg.txt and this pre-trained file https://drive.google.com/file/d/1d4CkgR--7bEEN0kWy-osR3kjLVFDIrnl/view?usp=sharing
Then try to test it on your image, and just divide 1000ms / measured ms = to get fps

darknet.exe detector test data/obj.data yolov3-spp_xnor_obj.cfg backup/yolov3-spp_xnor_obj_last.weights -thresh 0.15 image2.jpg

MAKEFILE-Xavier GPU=1 CUDNN=1 CUDNN_HALF=1 OPENCV=1 AVX=0 OPENMP=0 LIBSO=0
MAKEFILE-2080ti GPU=1 CUDNN=1 CUDNN_HALF=1 OPENCV=1 AVX=1 OPENMP=1 LIBSO=0
Have also included the lines in makefile for CC=7.5 and CC=7.3 respectively.

2080ti (Turing) CC 7.5 and a Xavier (Volta) CC 7.2 (not 7.3)

On 2080ti (Turing) CC 7.5 - Tensor Cores are used for both Float-32/16 and XNOR-net.
On Xavier (Volta) CC 7.2 - Tensor Cores are used only for Float-32/16, so XNOR will not be as fast as possible on Xavier: https://devblogs.nvidia.com/nvidia-jetson-agx-xavier-32-teraops-ai-robotics/

It supports CUDA 10 with a compute capability of sm_72.

Also I didn't optimized INT8 for Tensor Cores, because there is bug in cuDNN which should be bypassed in a non-common way (apparently, the TensorRT uses it): #407 (comment)

AlexeyAB · 2019-02-12T20:35:02Z

Commits on Feb 12, 2019 is used.
Network resolution 608x608 in both cases.

Test commands:

darknet.exe detector test data/coco.data cfg/yolov3-spp.cfg yolov3-spp.weights dog.jpg
darknet.exe detector test data/obj.data yolov3-spp_xnor_obj.cfg backup/yolov3-spp_xnor_obj_last.weights image2.jpg

Model	RTX 2070 `CUDNN_HALF=0`, ms	RTX 2070 `CUDNN_HALF=1`, ms	Speedup X times
yolov3-spp.cfg 608x608 Float-32/16 bit precision	40.9	27.2	1.5x
yolov3-spp_xnor_obj.cfg.txt 608x608 CC7.5 (Tensor Cores for XNOR) Bit-1 precision	13.5	13.2	1.0x
Speedup X times	3.0x	2.0x	-

There is still room for optimization.

Used:
CUDA 10.0, cuDNN 7.4.2, OpenCV 3.2.0, Windows 7 x64, MSVS 2015
nVidia GPU GeForce RTX 2070 CC7.5 (Turing, TU106) - 7.5 Tflops-SP (Tensor Cores 59.7 Tflops-HP)
If CUDNN_HALF=1 is set, then Tensor Cores are used for Floats, otherwise Tensor Cores aren't used for floats.
Tensor Cores are used for XNOR in any case, if the CC > = 7.3 on GPU and un-commented:

darknet/Makefile

Line 27 in 3d9c853

# ARCH= -gencode arch=compute_75,code=[sm_75,compute_75]

This file was used to train the XNOR-model.: https://drive.google.com/open?id=1IT-vvyxRLlxY5g9rJp_G2U3TXYphjBv8

XNOR-net training process:

LukeAI · 2019-03-05T16:17:23Z

@AlexeyAB
Thanks very much for the above provision of pretrained feature extractor weights - is that using openimages?
Would you be kind enough to also share your final yolov3-spp_xnor_obj.weights ?
I want to train my own using the above .cfg and pretrained weights but would like to compare to yours as a reference.

AlexeyAB · 2019-03-05T19:48:58Z

@LukeAI Hi,

I can't share yolov3-spp_xnor_obj.weights
But I can share new pre-trained weights: https://drive.google.com/open?id=1IT-vvyxRLlxY5g9rJp_G2U3TXYphjBv8
It should give you better mAP for your training.

LukeAI · 2019-03-06T09:44:08Z

thankyou very much! Are these for openimages? presumably trained at 448x448? Would that transfer ok to 608 x 608?

AlexeyAB · 2019-03-06T10:02:46Z

@LukeAI It is trained on ImageNet (137 GB, ~1 300 000 images) ILSVRC2012_img_train.tar https://github.com/AlexeyAB/darknet/blob/master/scripts/get_imagenet_train.sh

You can use it for training for Openimages dataset on 608x608.

AlexeyAB mentioned this issue Feb 12, 2019

Where can I see examples of WMMA GEMM usage for INT1 (bit 1)? NVIDIA/cutlass#34

Closed

JC-13 closed this as completed Feb 14, 2019

This was referenced Mar 8, 2019

How to get better mAP in xnor net training #2556

Open

Yolov3-Tiny vs Yolov3-Tiny-Xnor [Comparison] #2605

Closed

AlexeyAB mentioned this issue Apr 5, 2019

Missing labels for COCO Dataset; Training throws NaN halfway pjreddie/darknet#1533

Open

This was referenced Apr 29, 2019

Worse performance in this repo than in a yolov3 pytorch implementation #2914

Open

YoloV3-Tiny outperform YoloV3-Tiny-Xnor? #3095

Open

AlexeyAB mentioned this issue Aug 1, 2019

I added support for Tensor Cores, which speedup Detection and Training 3x on GPU CC >=7.0 #407

Open

AlexeyAB mentioned this issue Apr 24, 2020

EfficientDet: Scalable and Efficient Object Detection - 51.0% mAP@0.5...0.95 COCO #4346

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speedup of INT8/XNOR on Tensor Cores far less than claimed #2365

Speedup of INT8/XNOR on Tensor Cores far less than claimed #2365

JC-13 commented Feb 8, 2019

njgre6 commented Feb 11, 2019

JC-13 commented Feb 11, 2019

AlexeyAB commented Feb 11, 2019

AlexeyAB commented Feb 12, 2019 •

edited

Loading

LukeAI commented Mar 5, 2019

AlexeyAB commented Mar 5, 2019

LukeAI commented Mar 6, 2019

AlexeyAB commented Mar 6, 2019

Speedup of INT8/XNOR on Tensor Cores far less than claimed #2365

Speedup of INT8/XNOR on Tensor Cores far less than claimed #2365

Comments

JC-13 commented Feb 8, 2019

njgre6 commented Feb 11, 2019

JC-13 commented Feb 11, 2019

AlexeyAB commented Feb 11, 2019

AlexeyAB commented Feb 12, 2019 • edited Loading

LukeAI commented Mar 5, 2019

AlexeyAB commented Mar 5, 2019

LukeAI commented Mar 6, 2019

AlexeyAB commented Mar 6, 2019

AlexeyAB commented Feb 12, 2019 •

edited

Loading