-
Notifications
You must be signed in to change notification settings - Fork 8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speedup of INT8/XNOR on Tensor Cores far less than claimed #2365
Comments
@AlexeyAB I am also fairly interested in this, I was looking at using low precision inference for a real time embedded system with object detection. Would love to know why the above results are not as good as the theoretical? |
Seems like the real solution is just to use nVidia's trt-yolo app based on tensorRT. Can't comment on the accuracy but the speed was significantly better: |
@JC-13 Hi,
Try to check GPU-usage during detection - looks like your CPU just can't capture more than 205-230 frames per second from videofile. Also there is still not optimal post-processing on CPU. So try to test both repo with image.
2080ti (Turing) CC 7.5 and a Xavier (Volta) CC 7.2 (not 7.3)
Also I didn't optimized INT8 for Tensor Cores, because there is bug in cuDNN which should be bypassed in a non-common way (apparently, the TensorRT uses it): #407 (comment) |
Commits on Feb 12, 2019 is used. Test commands:
There is still room for optimization. Used: Line 27 in 3d9c853
This file was used to train the XNOR-model.: https://drive.google.com/open?id=1IT-vvyxRLlxY5g9rJp_G2U3TXYphjBv8 |
@AlexeyAB |
@LukeAI Hi, I can't share yolov3-spp_xnor_obj.weights |
thankyou very much! Are these for openimages? presumably trained at 448x448? Would that transfer ok to 608 x 608? |
@LukeAI It is trained on ImageNet (137 GB, ~1 300 000 images) You can use it for training for Openimages dataset on 608x608. |
I have been testing the speed of my custom trained yolov3-tiny with 4 classes, on a 2080ti (Turing) and a Xavier (Volta). However using xnor or INT8 both have <10% speedup compared to normal fp32. All testing has been done with the same 1080p video file as input.
Repo was git pulled yesterday (07/02/19)
MAKEFILE-Xavier GPU=1 CUDNN=1 CUDNN_HALF=1 OPENCV=1 AVX=0 OPENMP=0 LIBSO=0
MAKEFILE-2080ti GPU=1 CUDNN=1 CUDNN_HALF=1 OPENCV=1 AVX=1 OPENMP=1 LIBSO=0
Have also included the lines in makefile for CC=7.5 and CC=7.3 respectively.
Any ideas why there is not a significant speedup by using mixed precision?
The text was updated successfully, but these errors were encountered: