-
Notifications
You must be signed in to change notification settings - Fork 319
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unexpected Performance: Single-Threaded Faster than Multi-Threaded in Point Cloud Alignment #145
Comments
Hi everyone, I wanted to share an update on the performance issue I was experiencing with the multi-threaded versions of point cloud alignment algorithms. Initially, I was using the maximum thread count supported by my CPU (32 threads), but this setup was actually resulting in slower performance compared to the single-threaded implementations. However, when I reduced the number of threads to 8, the processing times for the multi-threaded versions improved dramatically and became what one would expect - faster than the single-threaded versions. Here are the updated results: $ rosrun fast_gicp gicp_align 251370668.pcd 251371071.pcd
target:17047[pts] source:17334[pts]
--- pcl_gicp ---
single:114.265[msec] 100times:11190.8[msec] fitness_score:0.204892
--- pcl_ndt ---
single:40.3903[msec] 100times:4108.75[msec] fitness_score:0.229616
--- fgicp_st ---
single:103.508[msec] 100times:10122.8[msec] 100times_reuse:6677.71[msec] fitness_score:0.204376
--- fgicp_mt ---
single:22.2643[msec] 100times:2076.86[msec] 100times_reuse:1322.39[msec] fitness_score:0.204384
--- vgicp_st ---
single:76.7637[msec] 100times:7601.88[msec] 100times_reuse:4227.26[msec] fitness_score:0.205022
--- vgicp_mt ---
single:16.8928[msec] 100times:1723.56[msec] 100times_reuse:964.225[msec] fitness_score:0.205022
--- ndt_cuda (P2D) ---
single:17.818[msec] 100times:1747.58[msec] 100times_reuse:1329.59[msec] fitness_score:0.197216
--- ndt_cuda (D2D) ---
single:13.9255[msec] 100times:1415.41[msec] 100times_reuse:1161.17[msec] fitness_score:0.199983
--- vgicp_cuda (parallel_kdtree) ---
single:36.8168[msec] 100times:2271.8[msec] 100times_reuse:1713.19[msec] fitness_score:0.205017
--- vgicp_cuda (gpu_bruteforce) ---
single:55.5222[msec] 100times:2822.75[msec] 100times_reuse:2615.85[msec] fitness_score:0.249594
--- vgicp_cuda (gpu_rbf_kernel) ---
single:14.8914[msec] 100times:1403.59[msec] 100times_reuse:941.221[msec] fitness_score:0.204766 It appears that using the maximum thread count was creating a bottleneck, possibly due to overheads associated with context switching or resource contention. Using a reduced thread count that better aligns with the CPU's capabilities and the workload's nature seems to be the key to optimal performance. |
Thanks for the helpful information. I will mention this in README. |
Hello there! Thanks for your great work!
I had an issue when I deployed it on my pc. can anyone help me take a look? Thanks!!
Description
I have observed an unexpected performance behavior while using fast_gicp_mt. Specifically, the single-threaded versions of certain point cloud alignment algorithms such as GICP and NDT are outperforming their multi-threaded counterparts. This was observed while aligning two point clouds of sizes 17047 and 17334 points.
Environment
The repo is deployed using docker.
OS: Ubuntu20.04 + ROS Noetic
GPU: RTX4090 32GB
CPU: i9-13900KF
RAM: 32GB
I deployed the repo on WSL using Docker.
Details
The execution times for various algorithms were recorded, and it was noted that single-threaded implementations were consistently faster than multi-threaded ones. Below are some of the results obtained:
Expected Behavior:
Typically, one would expect the multi-threaded implementations to be faster or at least as fast as the single-threaded ones, especially when dealing with large datasets.
The text was updated successfully, but these errors were encountered: