Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speedup of INT8/XNOR on Tensor Cores far less than claimed #2365

Closed
JC-13 opened this issue Feb 8, 2019 · 8 comments
Closed

Speedup of INT8/XNOR on Tensor Cores far less than claimed #2365

JC-13 opened this issue Feb 8, 2019 · 8 comments

Comments

@JC-13
Copy link

JC-13 commented Feb 8, 2019

I have been testing the speed of my custom trained yolov3-tiny with 4 classes, on a 2080ti (Turing) and a Xavier (Volta). However using xnor or INT8 both have <10% speedup compared to normal fp32. All testing has been done with the same 1080p video file as input.
Repo was git pulled yesterday (07/02/19)

image

MAKEFILE-Xavier GPU=1 CUDNN=1 CUDNN_HALF=1 OPENCV=1 AVX=0 OPENMP=0 LIBSO=0
MAKEFILE-2080ti GPU=1 CUDNN=1 CUDNN_HALF=1 OPENCV=1 AVX=1 OPENMP=1 LIBSO=0
Have also included the lines in makefile for CC=7.5 and CC=7.3 respectively.

Any ideas why there is not a significant speedup by using mixed precision?

@njgre6
Copy link

njgre6 commented Feb 11, 2019

@AlexeyAB I am also fairly interested in this, I was looking at using low precision inference for a real time embedded system with object detection. Would love to know why the above results are not as good as the theoretical?

@JC-13
Copy link
Author

JC-13 commented Feb 11, 2019

Seems like the real solution is just to use nVidia's trt-yolo app based on tensorRT. Can't comment on the accuracy but the speed was significantly better:
image
Note: Used 544 because trt-yolo-app only accepts square inputs and can't handle video

@AlexeyAB
Copy link
Owner

@JC-13 Hi,

ll testing has been done with the same 1080p video file as input.
Repo was git pulled yesterday (07/02/19)
...
Note: Used 544 because trt-yolo-app only accepts square inputs and can't handle video

Try to check GPU-usage during detection - looks like your CPU just can't capture more than 205-230 frames per second from videofile. Also there is still not optimal post-processing on CPU. So try to test both repo with image.

  1. Try to update your code from this GitHub, last couple commits.

  2. Try to train your (not Tiny) Full-XNOR-net model 608x608 or 544x544 using this cfg-file yolov3-spp_xnor_obj.cfg.txt and this pre-trained file https://drive.google.com/file/d/1d4CkgR--7bEEN0kWy-osR3kjLVFDIrnl/view?usp=sharing

  3. Then try to test it on your image, and just divide 1000ms / measured ms = to get fps

darknet.exe detector test data/obj.data yolov3-spp_xnor_obj.cfg backup/yolov3-spp_xnor_obj_last.weights -thresh 0.15 image2.jpg


MAKEFILE-Xavier GPU=1 CUDNN=1 CUDNN_HALF=1 OPENCV=1 AVX=0 OPENMP=0 LIBSO=0
MAKEFILE-2080ti GPU=1 CUDNN=1 CUDNN_HALF=1 OPENCV=1 AVX=1 OPENMP=1 LIBSO=0
Have also included the lines in makefile for CC=7.5 and CC=7.3 respectively.

2080ti (Turing) CC 7.5 and a Xavier (Volta) CC 7.2 (not 7.3)

It supports CUDA 10 with a compute capability of sm_72.


Also I didn't optimized INT8 for Tensor Cores, because there is bug in cuDNN which should be bypassed in a non-common way (apparently, the TensorRT uses it): #407 (comment)

@AlexeyAB
Copy link
Owner

AlexeyAB commented Feb 12, 2019

Commits on Feb 12, 2019 is used.
Network resolution 608x608 in both cases.

Test commands:

  • darknet.exe detector test data/coco.data cfg/yolov3-spp.cfg yolov3-spp.weights dog.jpg

  • darknet.exe detector test data/obj.data yolov3-spp_xnor_obj.cfg backup/yolov3-spp_xnor_obj_last.weights image2.jpg

Model RTX 2070 CUDNN_HALF=0, ms RTX 2070 CUDNN_HALF=1, ms Speedup X times
yolov3-spp.cfg 608x608 Float-32/16 bit precision 40.9 27.2 1.5x
yolov3-spp_xnor_obj.cfg.txt 608x608 CC7.5 (Tensor Cores for XNOR) Bit-1 precision 13.5 13.2 1.0x
Speedup X times 3.0x 2.0x -

There is still room for optimization.

Used:
CUDA 10.0, cuDNN 7.4.2, OpenCV 3.2.0, Windows 7 x64, MSVS 2015
nVidia GPU GeForce RTX 2070 CC7.5 (Turing, TU106) - 7.5 Tflops-SP (Tensor Cores 59.7 Tflops-HP)
If CUDNN_HALF=1 is set, then Tensor Cores are used for Floats, otherwise Tensor Cores aren't used for floats.
Tensor Cores are used for XNOR in any case, if the CC > = 7.3 on GPU and un-commented:

# ARCH= -gencode arch=compute_75,code=[sm_75,compute_75]


This file was used to train the XNOR-model.: https://drive.google.com/open?id=1IT-vvyxRLlxY5g9rJp_G2U3TXYphjBv8

XNOR-net training process:
chart_yolov3-spp_xnor_obj

@LukeAI
Copy link

LukeAI commented Mar 5, 2019

@AlexeyAB
Thanks very much for the above provision of pretrained feature extractor weights - is that using openimages?
Would you be kind enough to also share your final yolov3-spp_xnor_obj.weights ?
I want to train my own using the above .cfg and pretrained weights but would like to compare to yours as a reference.

@AlexeyAB
Copy link
Owner

AlexeyAB commented Mar 5, 2019

@LukeAI Hi,

I can't share yolov3-spp_xnor_obj.weights
But I can share new pre-trained weights: https://drive.google.com/open?id=1IT-vvyxRLlxY5g9rJp_G2U3TXYphjBv8
It should give you better mAP for your training.

@LukeAI
Copy link

LukeAI commented Mar 6, 2019

thankyou very much! Are these for openimages? presumably trained at 448x448? Would that transfer ok to 608 x 608?

@AlexeyAB
Copy link
Owner

AlexeyAB commented Mar 6, 2019

@LukeAI It is trained on ImageNet (137 GB, ~1 300 000 images) ILSVRC2012_img_train.tar https://github.com/AlexeyAB/darknet/blob/master/scripts/get_imagenet_train.sh

You can use it for training for Openimages dataset on 608x608.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants