Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Results P1: Inference speed GPU vs. CPU #61

Closed
2 tasks
stark-t opened this issue Dec 14, 2022 · 12 comments
Closed
2 tasks

Results P1: Inference speed GPU vs. CPU #61

stark-t opened this issue Dec 14, 2022 · 12 comments
Assignees
Milestone

Comments

@stark-t
Copy link
Owner

stark-t commented Dec 14, 2022

  • Add table in Paper (@stark-t )
  • Inference speed for yolov5n, yolov5s, yolov7t for GPU and CPU for confidence 30 and IOU 10 percent threshold (@valentinitnelav )
@stark-t stark-t added this to the Paper1 milestone Dec 14, 2022
@valentinitnelav
Copy link
Collaborator

valentinitnelav commented Dec 20, 2022

It looks like if one exports the weights to ONNX or OpenVINO formats and run detections, that might give up to 3x CPU speedup; In a similar way for GPU, the format TensorRT for up to 5x GPU speedup
ultralytics/yolov5#6736 (comment)

Then there is the option of half precision FP16 inference, but I do not understand if this is applicable for only gaining inference speed for the GPU or also for a CPU as well.

EDIT1: Forgot to add the question: should I invest time to try these options, or just go with a simple detect script set for confidence 30 and IoU 10 % thresholds both for a GPU and a CPU running on the test dataset?

EDIT2: Actually, I just realised that for YOLOv5 there is a benchmarks.py (either in the root folder or in utils), but there is none for YOLOv7. Moreover, there might be problems of converting to other formats with YOLOv7 (e.g. WongKinYiu/yolov7#1269). So, I guess if we do not get the same support for yolov7 as we get for yolov5, then I stay with the simpler approach.

@valentinitnelav
Copy link
Collaborator

Hi @stark-t , I just realised that YOLOv7 and YOLOv5 differ in terms of maximum number of detections per image, max_det. While YOLOv5 allows this to be adjusted by the user in detect.py with a default of 1000, YOLOv7 doesn't allow this and sets internally max_det = 300 in untils/general.py

I think that for a fair comparison, I need to rerun the detect.py of YOLOv5 with --max-det 300. I do not think this will change the results though. What are your thoughts about this? This affects #54

@valentinitnelav
Copy link
Collaborator

valentinitnelav commented Dec 20, 2022

For YOLOv5, the GPU detect speed can be taken from the *.err files obtained from running the scripts yolov5_detect_n_640_rtx.sh & yolov5_detect_s_640_rtx.sh. These scripts run on a GPU looping through various values of conf and IoU.

The results for GPU are:

  • YOLOv5 nano, Job 3273403, file 3273403.err contains (search for "results_at_conf_0.3_iou_0.1" and "conf_thres=0.3, iou_thres=0.1"):
�[34m�[1mdetect: �[0mweights=['/home/sc.uni-leipzig.de/sv127qyji/PAI/detectors/yolov5/runs/train/3219882_yolov5_n_img640_b8_e300_hyp_custom/weights/best.pt'], source=/home/sc.uni-leipzig.de/sv127qyji/datasets/P1_Data_sampled/test/images, data=data/coco128.yaml, imgsz=[640, 640], conf_thres=0.3, iou_thres=0.1, max_det=1000, device=, view_img=False, save_txt=True, save_conf=True, save_crop=False, nosave=True, classes=None, agnostic_nms=False, augment=False, visualize=False, update=False, project=runs/detect/job_3273403_loop_detect_on_3219882_yolov5_n_img640_b8_e300_hyp_custom, name=results_at_conf_0.3_iou_0.1, exist_ok=False, line_thickness=3, hide_labels=False, hide_conf=False, half=False, dnn=False
Unknown option: -C
usage: git [--version] [--help] [-c name=value]
           [--exec-path[=<path>]] [--html-path] [--man-path] [--info-path]
           [-p|--paginate|--no-pager] [--no-replace-objects] [--bare]
           [--git-dir=<path>] [--work-tree=<path>] [--namespace=<name>]
           <command> [<args>]
YOLOv5 🚀 2022-7-11 Python-3.9.6 torch-1.11.0+cu102 CUDA:0 (NVIDIA GeForce RTX 2080 Ti, 11019MiB)

Fusing layers... 
Model summary: 213 layers, 1769989 parameters, 0 gradients, 4.2 GFLOPs
.
.
.
Speed: 0.3ms pre-process, 9.4ms inference, 1.2ms NMS per image at shape (1, 3, 640, 640)
Results saved to �[1mruns/detect/job_3273403_loop_detect_on_3219882_yolov5_n_img640_b8_e300_hyp_custom/results_at_conf_0.3_iou_0.1�[0m
1538 labels saved to runs/detect/job_3273403_loop_detect_on_3219882_yolov5_n_img640_b8_e300_hyp_custom/results_at_conf_0.3_iou_0.1/labels

Email notification with run time: 2022-09-02T16:46:25: Slurm Job_id=3273403 Name=detect_yolov5_gpu Ended, Run time 02:01:26, COMPLETED, ExitCode 0 Warning! This actually refers to the duration of entire loop cluster job!

  • YOLOv5 small, Job 3273410, file 3273410.err contains:
�[34m�[1mdetect: �[0mweights=['/home/sc.uni-leipzig.de/sv127qyji/PAI/detectors/yolov5/runs/train/3219884_yolov5_s_img640_b8_e300_hyp_custom/weights/best.pt'], source=/home/sc.uni-leipzig.de/sv127qyji/datasets/P1_Data_sampled/test/images, data=data/coco128.yaml, imgsz=[640, 640], conf_thres=0.3, iou_thres=0.1, max_det=1000, device=, view_img=False, save_txt=True, save_conf=True, save_crop=False, nosave=True, classes=None, agnostic_nms=False, augment=False, visualize=False, update=False, project=runs/detect/job_3273410_loop_detect_on_3219884_yolov5_s_img640_b8_e300_hyp_custom, name=results_at_conf_0.3_iou_0.1, exist_ok=False, line_thickness=3, hide_labels=False, hide_conf=False, half=False, dnn=False
Unknown option: -C
usage: git [--version] [--help] [-c name=value]
           [--exec-path[=<path>]] [--html-path] [--man-path] [--info-path]
           [-p|--paginate|--no-pager] [--no-replace-objects] [--bare]
           [--git-dir=<path>] [--work-tree=<path>] [--namespace=<name>]
           <command> [<args>]
YOLOv5 🚀 2022-7-11 Python-3.9.6 torch-1.11.0+cu102 CUDA:0 (NVIDIA GeForce RTX 2080 Ti, 11019MiB)

Fusing layers... 
Model summary: 213 layers, 7031701 parameters, 0 gradients, 15.8 GFLOPs
.
.
.
Speed: 0.3ms pre-process, 9.5ms inference, 1.2ms NMS per image at shape (1, 3, 640, 640)
Results saved to �[1mruns/detect/job_3273410_loop_detect_on_3219884_yolov5_s_img640_b8_e300_hyp_custom/results_at_conf_0.3_iou_0.1�[0m
1626 labels saved to runs/detect/job_3273410_loop_detect_on_3219884_yolov5_s_img640_b8_e300_hyp_custom/results_at_conf_0.3_iou_0.1/labels

Email notification with run time: 2022-09-02T16:53:29: Slurm Job_id=3273410 Name=detect_yolov5_gpu Ended, Run time 02:01:14, COMPLETED, ExitCode 0 Warning! This actually refers to the duration of entire loop cluster job!

@valentinitnelav
Copy link
Collaborator

YOLOv7 outputs the information about inference speed in the *.log files.
However, the info is not as detailed as for YOLOv5, after enumerating the time needed for each image, at the very end it prints the total time - see below.
So the results obtained from running the script yolov7_detect_tiny_640_rtx.sh are:

YOLOv7 tiny, Job 191860, file 191860.err contains (search for "conf_thres=0.3, iou_thres=0.1"):

Namespace(weights=['/home/sc.uni-leipzig.de/sv127qyji/PAI/detectors/yolov7/runs/train/191623_yolov7_tiny_img640_b8_e300_hyp_custom/weights/best.pt'], source='/home/sc.uni-leipzig.de/sv127qyji/datasets/P1_Data_sampled/test/images', img_size=640, conf_thres=0.3, iou_thres=0.1, device='', view_img=False, save_txt=True, save_conf=True, nosave=True, classes=None, agnostic_nms=False, augment=False, update=False, project='runs/detect/job_191860_loop_detect_on_191623_yolov7_tiny_img640_b8_e300_hyp_custom', name='results_at_conf_0.3_iou_0.1', exist_ok=False, no_trace=False)
Fusing layers... 
 Convert model to Traced-model... 
 traced_script_module saved! 
 model is traced! 
.
.
.
Done. (100.852s)

The 191860.err file contains this info:

Model Summary: 200 layers, 6025525 parameters, 0 gradients, 13.1 GFLOPS
YOLOR 🚀 v0.1-115-g072f76c torch 1.11.0+cu102 CUDA:0 (NVIDIA GeForce RTX 2080 Ti, 11019.5625MB)

Unfortunately, due to cluster updates, I didn't get the total cluster job 191860 run time as email notification (this was fixed later by IT).

@valentinitnelav
Copy link
Collaborator

Note also the parameter counts for each model:

  • YOLOv5 nano
Model summary: 213 layers, 1769989 parameters, 0 gradients, 4.2 GFLOPs
  • YOLOv5 small
Model summary: 213 layers, 7031701 parameters, 0 gradients, 15.8 GFLOPs
  • YOLOv7 tiny
Model Summary: 200 layers, 6025525 parameters, 0 gradients, 13.1 GFLOPS

valentinitnelav added a commit that referenced this issue Dec 21, 2022
@valentinitnelav
Copy link
Collaborator

Here are some first results for YOLOv5 nano CPU vs GPU:

YOLOv5 nano CPU; 5 iterations with detect.py over the test dataset (210*8=1680 images); values in seconds:
296.905122231
297.216807024
297.304537163
296.560027893
296.715367913
average = 296.9051 sec

Roughly, that means:
1680 img = 296.9051 sec
1 img = x = 296.9051/1680 = 0.1767 sec/img average CPU time for inference


YOLOv5 nano GPU; 5 iterations with detect.py over the test dataset (210*8=1680 images); values in seconds:
89.853897407
90.387605006
90.272118806
89.911479118
90.141063244
average = 89.8539 sec

Roughly, that means:
1680 img = 89.8539 sec
1 img = x = 89.8539/1680 = 0.05348 sec/img average GPU time for inference

@stark-t
Copy link
Owner Author

stark-t commented Dec 23, 2022

New best thresholds for the three models to run speed test (@valentinitnelav ):

  • YOLOv5n: confidence 0.2 iou 0.5
  • YOLOv5s: confidence 0.3 iou 0.6
  • YOLOv7t: confidence 0.1 iou 0.3

valentinitnelav added a commit that referenced this issue Dec 23, 2022
Include the threshold values for conf and IoU from #61 (comment)
@valentinitnelav
Copy link
Collaborator

I have the script now, but somehow I cannot get access to GPUs, but only to CPUs. The CPU jobs are running so I'll get results for that. I think there is an issue on the cluster side because yesterday I could get GPUs to use. This will have to wait until the cluster is available again.

@stark-t
Copy link
Owner Author

stark-t commented Dec 25, 2022

ok no porblem

@valentinitnelav
Copy link
Collaborator

valentinitnelav commented Jan 9, 2023

CPU detection time results for 5 iterations for each model. These were run on the test dataset.

CPU time, YOLOv5 nano

Job ID 883890 which run the script yolov5_detect_n_640_cpu_speed_test.sh

Time results extracted from the file job_883890_yolov5_nano_cpu_results_at_0.2_iou_0.5.txt (in PAI/detectors/yolov5/runs/detect/detect_speed_jobs on the cluster):

mean(c(310.145018077,
       317.129052547,
       320.828178548,
       321.441730295,
       320.760053670))
# [1] 318.0608

That means:
1680 img = 318.0608 sec on average
1 img = 318.06081/1680 = 0.1893219 sec/img average time for inference (detection)

CPU time, YOLOv5 small

Job ID 883892 which run the script yolov5_detect_s_640_cpu_speed_test.sh

Time results extracted from the file job_883892_yolov5_small_cpu_results_at_0.3_iou_0.6.txt (in PAI/detectors/yolov5/runs/detect/detect_speed_jobs on the cluster):

mean(c(823.321014627,
       815.641425552,
       803.712151891,
       810.795042564,
       806.412835093))
# [1] 811.9765

That means:
1680 img = 811.9765 sec on average
1 img = 811.9765/1680 = 0.4833193 sec/img average time for inference (detection)

CPU time, YOLOv7 tiny

Job ID 883893 which run the script yolov7_detect_tiny_640_cpu_speed_test.sh

Time results extracted from the file job_883893_yolov7_tiny_cpu_results_at_0.1_iou_0.3.txt (in PAI/detectors/yolov7/runs/detect/detect_speed_jobs on the cluster):

mean(c(687.549698531,
       683.905298020,
       678.930599692,
       674.795639732,
       674.587100398))
# [1] 679.9537

That means:
1680 img = 679.9537 sec on average
1 img = 679.9537/1680 = 0.4047343 sec/img average time for inference (detection)

@valentinitnelav
Copy link
Collaborator

valentinitnelav commented Jan 9, 2023

GPU results.

Note that the first iteration can take up to two times more time than the other iterations. Perhaps there is some GPU "warm up" taking place? Possibly related to this? ultralytics/yolov5#5806

To solve this, I run 6 intereations and dropped the results of the first iteration. See commit b120ad9

GPU time, YOLOv5 nano

Job ID 1268065 which run the script yolov5_detect_n_640_gpu_rtx_speed_test.sh

Total run time for 6 iterations: 2023-01-09T18:11:01: Slurm Job_id=1268065 Name=detect_speed Ended, Run time 00:11:40, COMPLETED, ExitCode 0

Time results extracted from the file job_1268065_yolov5_nano_gpurtx_results_at_0.2_iou_0.5.txt (in PAI/detectors/yolov5/runs/detect/detect_speed_jobs on the cluster):

231.833884053 # this will be dropped
93.032718445
92.409938917
93.398442854
92.190620597
92.786358883

# Average the last 5
mean(c(93.032718445,
       92.409938917,
       93.398442854,
       92.190620597,
       92.786358883))
# [1] 92.76362

That means:
1680 img = 92.76362 sec on average
1 img = 92.76362/1680 = 0.05521644 sec/img average time for inference (detection)

It is a bit strange that this is similar to the small weights results from YOLOv5.
This might be because I do not use the exact same GPU or the exact node. For each job, I get a new GPU from a different node (or same node; whatever is available on the cluster; but always the rtx2080ti with 11 Gb of RAM).
I run this several times now and I get very similar results between YOLOv5 nano and small.

GPU time, YOLOv5 small

Job ID 1268064 which run the script yolov5_detect_s_640_gpu_rtx_speed_test.sh

Total run time for 6 iterations: 2023-01-09T18:11:03: Slurm Job_id=1268064 Name=detect_speed Ended, Run time 00:13:32, COMPLETED, ExitCode 0

Time results extracted from the file job_1268064_yolov5_small_gpurtx_results_at_0.3_iou_0.6.txt (in PAI/detectors/yolov5/runs/detect/detect_speed_jobs on the cluster):

319.840739573 # this will be dropped
93.036818359
93.183861865
92.507176553
92.128141101
94.300841453

# Average the last 5
mean(c(93.036818359,
       93.183861865,
       92.507176553,
       92.128141101,
       94.300841453))
# [1] 93.03137

That means:
1680 img = 93.03137 sec on average
1 img = 93.03137/1680 = 0.05537582 sec/img average time for inference (detection)

GPU time, YOLOv7 tiny

Job ID 1268059 which run the script yolov7_detect_tiny_640_gpu_rtx_speed_test.sh

Total run time for 6 iterations: 2023-01-09T18:03:11: Slurm Job_id=1268059 Name=detect_speed Ended, Run time 00:12:16, COMPLETED, ExitCode 0

Time results extracted from the file job_1268059_yolov7_tiny_gpurtx_results_at_0.1_iou_0.3.txt (in PAI/detectors/yolov7/runs/detect/detect_speed_jobs on the cluster):

115.640858734 # usually this one takes longer (see Job ID 1268047 with 455 sec & 1268032 with 289 sec)
117.910907591
115.626886750
119.962117953
135.876173902
126.866752428

# Average the last 5
mean(c(117.910907591,
       115.626886750,
       119.962117953,
       135.876173902,
       126.866752428))
# [1] 123.2486

That means:
1680 img = 123.2486 sec on average
1 img = 123.2486/1680 = 0.07336226 sec/img average time for inference (detection)

@valentinitnelav
Copy link
Collaborator

I'll close this issue now. I have put the results in the overleaf manuscript - Table 2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants