Faster organized search #4496

mvieth · 2020-11-08T16:33:06Z

See commit messages.

Using a vector (ordered by distance) instead of a priority queue makes the nearest k search faster:

with a vector, it is possible to reserve() the needed space, but not with a priority queue
adding a new element to the priority queue is done with push(). In contrast, inserting a new element into the ordered vector is done by first determining the correct place with upper_bound(), and then calling emplace
removing the element with the largest distance (to make sure that there are at most k entries) is done with pop() for the priority queue (logarithmic complexity), and with pop_back() for the vector (constant complexity)
in the end, when k_indices and k_sqr_distances are filled with the info from results, it is faster to use a vector, since the pop() of priority queue (logarithmic complexity) can be avoided
also: caching the size instead of calling size() over and over

To compare the speed of this new implementation with the current one, I measured the time that NormalEstimation needs because it spends most of its time searching the neighbors of each point. I used the datasets milk_cartoon_all_small_clorox.pcd and table_scene_mug_stereo_textured.pcd, each with k=50 and k=100. With this new organized search, NormalEstimation needs only about 80% of the time it took before, so I estimate the nearest k search takes only about three quarters of the time it took before.

Using a vector (ordered by distance) instead of a priority queue makes the nearest k search faster: - with a vector, it is possible to reserve() the needed space, but not with a priority queue - adding a new element to the priority queue is done with push(). In contrast, inserting a new element into the ordered vector is done by first determining the correct place with upper_bound(), and then calling emplace - removing the element with the largest distance (to make sure that there are at most k entries) is done with pop() for the priority queue (logarithmic complexity), and with pop_back() for the vector (constant complexity) - in the end, when k_indices and k_sqr_distances are filled with the info from results, it is faster to use a vector, since the pop() of priority queue (logarithmic complexity) can be avoided - also: caching the size instead of calling size() over and over To compare the speed of this new implementation with the current one, I measured the time that NormalEstimation needs because it spends most of its time searching the neighbors of each point. I used the datasets milk_cartoon_all_small_clorox.pcd and table_scene_mug_stereo_textured.pcd, each with k=50 and k=100. With this new organized search, NormalEstimation needs only about 80% of the time it took before, so I estimate the nearest k search takes only about three quarters of the time it took before.

search/include/pcl/search/organized.h

suat-gedikli · 2020-11-24T10:44:43Z

Hey guys,

sorry but the change to vector from a heap does not make much sense.
When doing a knn search on Organized point clouds, we might add much more elements to the queue than k. Even if we look orthogonal to a plane and can get the nearest points right away the complexity is higher.
Since we only need to find the k-nearest neighbors we are interested if the k-nearests distance to figure out if a candidate point should be within the knn or not. So we do not need a fully sorted list within the algorithm. At the end we can pop from the queue and fill a result-vector and return that.
Here is the math:
inserting a new point results in 1 pop and one push in the queue, both done in O(log(k)) operations, whereas with the vector the upper_bound method needs O(log(k)) but em place needs O(k) -> clear winner is the heap

So when measuring the speed, it should also be done using a large k, since for small k AND also complex clouds. Using a single measure that uses a small k will not give us the right data to make a call.
When k is large and/or also when the cloud is complex (e.g. tree-canopy) then we might end up pushing lots of point into the queue and in this case the use of a heap will be much faster.
E.g. we use k=100 and we have a complex point cloud (or query point is close to a whole on a tree canopy which shows the ground), we will add those ground points first and increase our radius until we see more of the tree-canopy and find closer points. We stop until the k-nearest point is closer than the radius on our organized cloud and we stop. In this case we might have added thousands of points to the queue but kept only 100.
The vector will maintain the order within the loop for points that will be thrown away anyway. All we need to know is the currently furthers point that is within the knn. At the end we utilize pop to get a sorting order within O(k log(k)) (since this is heap-sort!)
Overall i suggest that we revert these changes and first implement a few speed-tests including large k and corner cases before optimizing for a single test. Also it might be that the speed-up comes from the removed size() calls only.

Also there are other options to make this faster.

implement new heap algorithm that operates on pre-allocated vector (like make_heap)
add a replace_top operation for the new heap, so we do not use pop and then push -> single call of re-heap
maintain a mask of points that we already touched so when we increase the search radius in every iteration we can skip the points we already touched.
alternatively to 3) we could have a list of point indices that is concentrically centered around our query point and we do consume until the point radius is larger than the k-nearest.

kunaltyagi · 2020-11-24T11:43:20Z

@suat-gedikli I've amended the PR message with the reasoning posted by @mvieth

I'm adding @koide3 because he's been throwing multiple benchmarks quickly and will be able to help us here.

One thing to note is that the notation you've used is for asymptotic upper bound. The actual performance might be dictated by the constants and lower order variables, particularly at smaller values of n (or k as used by you).

Here's a small demonstration of lookup speed difference in just lookup: http://rrowniak.com/performance/stdvector-vs-sorted-stdvector-vs-stdset/

Conjecture and disclaimer:
Of course, emplace might be overshadowing (since I've not referred to that at all), but it should be noted that size() is 1 lookup and 2 memory access and 1 subtraction on most implementations (and 1 lookup in others). The memory is also likely to be cached (though 1 cache-line is used for this call). I don't foresee the size() impacting test results a lot. Of course, I might be wrong here (hence why I've called the benchmark cavalry)

koide3 · 2020-11-25T11:10:24Z

I did some benchmark on OrganizedNeighbor using three datasets, and confirmed that the new implementation is faster than the old one when k < 1024. As suggested by @suat-gedikli , the priority_queue version can be faster when k is large. But, I think we don't use so large k usually, and the new one is beneficial to most of the people.

# finding k-neighbors of each point

*** freiburg_rgbd (640x480) ***
std::vector(commit 0a425d3f)
k time[msec]
2 41.7397
4 53.9061
8 89.42
16 172.528
32 337.585
64 723.442
128 1522.8
256 3481.55
512 7876.66
1024 18125.7
2048 44713.4
4096 112448

std::priority_queue(commit 208e5b42)
k time[msec]
2 46.3858
4 63.8204
8 109.299
16 209.505
32 445.059
64 914.419
128 1934.89
256 4100.36
512 8875.05
1024 19309
2048 41945
4096 88879.8

*** milk_cartoon (640x480) ***
std::vector(commit 0a425d3f)
k time[msec]
2 43.6693
4 59.4599
8 97.0264
16 184.696
32 385.24
64 820.887
128 1852.2
256 4136.84
512 9097.66
1024 20314
2048 47882.5
4096 120033

std::priority_queue(commit 208e5b42)
2 48.3028
4 69.5141
8 111.994
16 229.91
32 481.509
64 1008.75
128 2166.33
256 4540.19
512 9645.04
1024 20833.1
2048 44430
4096 92578.8

*** table_scene_mug (640x480) ***
std::vector(commit 0a425d3f)
k time[msec]
2 38.1171
4 47.178
8 66.3219
16 133.251
32 271.767
64 603.78
128 1337.19
256 3189.95
512 6824.28
1024 15418.8
2048 36701
4096 93127.3

std::priority_queue(commit 208e5b42)
k time[msec]
2 41.1509
4 55.7241
8 80.5526
16 171.158
32 359.523
64 767.775
128 1646.25
256 3643
512 7612.73
1024 16465.8
2048 35810.8
4096 75988.7

mvieth · 2020-11-25T13:32:45Z

@suat-gedikli Thank you for your interest in this!

At the end we can pop from the queue and fill a result-vector and return that.

As I wrote in my explanation, the `pop' of the priority queue is rather costly, while reading from a vector is (almost) free.

inserting a new point results in 1 pop and one push in the queue, both done in O(log(k)) operations, whereas with the vector the upper_bound method needs O(log(k)) but em place needs O(k) -> clear winner is the heap

The complexity of emplace is linear in the distance between the insertion position and the end of the vector, so that k is not the same as the other k-s. The insertion is probably more often near the end of the vector, so that helps.

implement new heap algorithm that operates on pre-allocated vector (like make_heap)

add a replace_top operation for the new heap, so we do not use pop and then push -> single call of re-heap

Sure, that is something we could try, but to me it seems rather complex, since that would be a totally new container/container adapter.

maintain a mask of points that we already touched so when we increase the search radius in every iteration we can skip the points we already touched.

alternatively to 3) we could have a list of point indices that is concentrically centered around our query point and we do consume until the point radius is larger than the k-nearest.

Also interesting, but would work with both priority queue and sorted vector, right? (although it might work better with the strengths and weaknesses of one or the other).

Overall i suggest that we revert these changes and first implement a few speed-tests including large k and corner cases before optimizing for a single test.

I am not generally opposed to reverting this, but would only really support it if there was clear evidence that the new version is slower. If you have specific benchmarks in mind, please run them and post the results.

Thank you @koide3 for the benchmarks. Based on your results, we could even think about switching between the two implementations based on k (if not too complicated/too much overhead). I share your opinion that the search is mostly used with k<1024

Here are a few more benchmarks I got for different OS while working on #4506:
100 nearest neighbours, table_scene_mug_stereo_textured.pcd

OS	priority queue	sorted vector	ratio
Ubuntu 20.04	4536 ms	2311 ms	0.51
macOS 10.15	2083 ms	1622 ms	0.78
macOS 10.14	2055 ms	1625 ms	0.79
Ubuntu 18.04 Clang	2663 ms	2161 ms	0.81
Windows x86	3250 ms	4281 ms	1.32
Windows x64	3344 ms	2906 ms	0.87

Notes: it seems like on Windows x86, the architecture was different (2394 MHz CPU for priority queue, 2095 MHz CPU for sorted vector), and for Windows x64 as well (2295 MHz CPU for priority queue, 2095 MHz CPU for sorted vector). I am not totally sure why the sorted vector is that much better on Ubuntu 20.04, I think the load average was higher when the priority queue benchmark ran.

mvieth · 2020-11-29T10:36:16Z

Another supplement: boost has several implementations of priority queues and heaps, with much richer functionality than the stl priority queue. They (partly) support reserving memory, update/increase/decrease operations as a possible replacement of calling pop then push, and iterating through the heap without popping the elements. So that is something we could try out. Maybe that would be even faster than the sorted-vector implementation.

mvieth added 2 commits November 7, 2020 18:27

Organized search: make squared_radius a const float

2649cbd

mvieth added changelog: enhancement Meta-information for changelog generation module: search labels Nov 8, 2020

kunaltyagi reviewed Nov 9, 2020

View reviewed changes

search/include/pcl/search/organized.h Outdated Show resolved Hide resolved

Organized search: create lamda to insert into sorted queue

d19200e

mvieth marked this pull request as ready for review November 9, 2020 18:50

kunaltyagi approved these changes Nov 10, 2020

View reviewed changes

kunaltyagi added the needs: code review Specify why not closed/merged yet label Nov 10, 2020

mvieth requested a review from larshg November 20, 2020 10:55

larshg approved these changes Nov 21, 2020

View reviewed changes

kunaltyagi merged commit 0a425d3 into PointCloudLibrary:master Nov 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster organized search #4496

Faster organized search #4496

mvieth commented Nov 8, 2020 •

edited by kunaltyagi

Loading

suat-gedikli commented Nov 24, 2020

kunaltyagi commented Nov 24, 2020 •

edited

Loading

koide3 commented Nov 25, 2020 •

edited

Loading

mvieth commented Nov 25, 2020

mvieth commented Nov 29, 2020 •

edited

Loading

Faster organized search #4496

Faster organized search #4496

Conversation

mvieth commented Nov 8, 2020 • edited by kunaltyagi Loading

suat-gedikli commented Nov 24, 2020

kunaltyagi commented Nov 24, 2020 • edited Loading

koide3 commented Nov 25, 2020 • edited Loading

mvieth commented Nov 25, 2020

mvieth commented Nov 29, 2020 • edited Loading

mvieth commented Nov 8, 2020 •

edited by kunaltyagi

Loading

kunaltyagi commented Nov 24, 2020 •

edited

Loading

koide3 commented Nov 25, 2020 •

edited

Loading

mvieth commented Nov 29, 2020 •

edited

Loading