Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster organized search #4496

Merged
merged 3 commits into from
Nov 22, 2020

Conversation

mvieth
Copy link
Member

@mvieth mvieth commented Nov 8, 2020

See commit messages.

Using a vector (ordered by distance) instead of a priority queue makes the nearest k search faster:

  • with a vector, it is possible to reserve() the needed space, but not with a priority queue
  • adding a new element to the priority queue is done with push(). In contrast, inserting a new element into the ordered vector is done by first determining the correct place with upper_bound(), and then calling emplace
  • removing the element with the largest distance (to make sure that there are at most k entries) is done with pop() for the priority queue (logarithmic complexity), and with pop_back() for the vector (constant complexity)
  • in the end, when k_indices and k_sqr_distances are filled with the info from results, it is faster to use a vector, since the pop() of priority queue (logarithmic complexity) can be avoided
  • also: caching the size instead of calling size() over and over

To compare the speed of this new implementation with the current one, I measured the time that NormalEstimation needs because it spends most of its time searching the neighbors of each point. I used the datasets milk_cartoon_all_small_clorox.pcd and table_scene_mug_stereo_textured.pcd, each with k=50 and k=100. With this new organized search, NormalEstimation needs only about 80% of the time it took before, so I estimate the nearest k search takes only about three quarters of the time it took before.

Using a vector (ordered by distance) instead of a priority queue makes the nearest k search faster:
- with a vector, it is possible to reserve() the needed space, but not with a priority queue
- adding a new element to the priority queue is done with push(). In contrast, inserting a new element into the ordered vector is done by first determining the correct place with upper_bound(), and then calling emplace
- removing the element with the largest distance (to make sure that there are at most k entries) is done with pop() for the priority queue (logarithmic complexity), and with pop_back() for the vector (constant complexity)
- in the end, when k_indices and k_sqr_distances are filled with the info from results, it is faster to use a vector, since the pop() of priority queue (logarithmic complexity) can be avoided
- also: caching the size instead of calling size() over and over
To compare the speed of this new implementation with the current one, I measured the time that NormalEstimation needs because it spends most of its time searching the neighbors of each point. I used the datasets milk_cartoon_all_small_clorox.pcd and table_scene_mug_stereo_textured.pcd, each with k=50 and k=100. With this new organized search, NormalEstimation needs only about 80% of the time it took before, so I estimate the nearest k search takes only about three quarters of the time it took before.
@mvieth mvieth added changelog: enhancement Meta-information for changelog generation module: search labels Nov 8, 2020
@mvieth mvieth marked this pull request as ready for review November 9, 2020 18:50
@kunaltyagi kunaltyagi added the needs: code review Specify why not closed/merged yet label Nov 10, 2020
@mvieth mvieth requested a review from larshg November 20, 2020 10:55
@kunaltyagi kunaltyagi merged commit 0a425d3 into PointCloudLibrary:master Nov 22, 2020
@suat-gedikli
Copy link

Hey guys,

sorry but the change to vector from a heap does not make much sense.
When doing a knn search on Organized point clouds, we might add much more elements to the queue than k. Even if we look orthogonal to a plane and can get the nearest points right away the complexity is higher.
Since we only need to find the k-nearest neighbors we are interested if the k-nearests distance to figure out if a candidate point should be within the knn or not. So we do not need a fully sorted list within the algorithm. At the end we can pop from the queue and fill a result-vector and return that.
Here is the math:
inserting a new point results in 1 pop and one push in the queue, both done in O(log(k)) operations, whereas with the vector the upper_bound method needs O(log(k)) but em place needs O(k) -> clear winner is the heap

So when measuring the speed, it should also be done using a large k, since for small k AND also complex clouds. Using a single measure that uses a small k will not give us the right data to make a call.
When k is large and/or also when the cloud is complex (e.g. tree-canopy) then we might end up pushing lots of point into the queue and in this case the use of a heap will be much faster.
E.g. we use k=100 and we have a complex point cloud (or query point is close to a whole on a tree canopy which shows the ground), we will add those ground points first and increase our radius until we see more of the tree-canopy and find closer points. We stop until the k-nearest point is closer than the radius on our organized cloud and we stop. In this case we might have added thousands of points to the queue but kept only 100.
The vector will maintain the order within the loop for points that will be thrown away anyway. All we need to know is the currently furthers point that is within the knn. At the end we utilize pop to get a sorting order within O(k log(k)) (since this is heap-sort!)
Overall i suggest that we revert these changes and first implement a few speed-tests including large k and corner cases before optimizing for a single test. Also it might be that the speed-up comes from the removed size() calls only.

Also there are other options to make this faster.

  1. implement new heap algorithm that operates on pre-allocated vector (like make_heap)
  2. add a replace_top operation for the new heap, so we do not use pop and then push -> single call of re-heap
  3. maintain a mask of points that we already touched so when we increase the search radius in every iteration we can skip the points we already touched.
  4. alternatively to 3) we could have a list of point indices that is concentrically centered around our query point and we do consume until the point radius is larger than the k-nearest.

@kunaltyagi
Copy link
Member

kunaltyagi commented Nov 24, 2020

@suat-gedikli I've amended the PR message with the reasoning posted by @mvieth

I'm adding @koide3 because he's been throwing multiple benchmarks quickly and will be able to help us here.

One thing to note is that the notation you've used is for asymptotic upper bound. The actual performance might be dictated by the constants and lower order variables, particularly at smaller values of n (or k as used by you).

Here's a small demonstration of lookup speed difference in just lookup: http://rrowniak.com/performance/stdvector-vs-sorted-stdvector-vs-stdset/

Conjecture and disclaimer:
Of course, emplace might be overshadowing (since I've not referred to that at all), but it should be noted that size() is 1 lookup and 2 memory access and 1 subtraction on most implementations (and 1 lookup in others). The memory is also likely to be cached (though 1 cache-line is used for this call). I don't foresee the size() impacting test results a lot. Of course, I might be wrong here (hence why I've called the benchmark cavalry)

@koide3
Copy link
Contributor

koide3 commented Nov 25, 2020

I did some benchmark on OrganizedNeighbor using three datasets, and confirmed that the new implementation is faster than the old one when k < 1024. As suggested by @suat-gedikli , the priority_queue version can be faster when k is large. But, I think we don't use so large k usually, and the new one is beneficial to most of the people.

# finding k-neighbors of each point

*** freiburg_rgbd (640x480) ***
std::vector(commit 0a425d3f)
k time[msec]
2 41.7397
4 53.9061
8 89.42
16 172.528
32 337.585
64 723.442
128 1522.8
256 3481.55
512 7876.66
1024 18125.7
2048 44713.4
4096 112448

std::priority_queue(commit 208e5b42)
k time[msec]
2 46.3858
4 63.8204
8 109.299
16 209.505
32 445.059
64 914.419
128 1934.89
256 4100.36
512 8875.05
1024 19309
2048 41945
4096 88879.8

*** milk_cartoon (640x480) ***
std::vector(commit 0a425d3f)
k time[msec]
2 43.6693
4 59.4599
8 97.0264
16 184.696
32 385.24
64 820.887
128 1852.2
256 4136.84
512 9097.66
1024 20314
2048 47882.5
4096 120033

std::priority_queue(commit 208e5b42)
2 48.3028
4 69.5141
8 111.994
16 229.91
32 481.509
64 1008.75
128 2166.33
256 4540.19
512 9645.04
1024 20833.1
2048 44430
4096 92578.8

*** table_scene_mug (640x480) ***
std::vector(commit 0a425d3f)
k time[msec]
2 38.1171
4 47.178
8 66.3219
16 133.251
32 271.767
64 603.78
128 1337.19
256 3189.95
512 6824.28
1024 15418.8
2048 36701
4096 93127.3

std::priority_queue(commit 208e5b42)
k time[msec]
2 41.1509
4 55.7241
8 80.5526
16 171.158
32 359.523
64 767.775
128 1646.25
256 3643
512 7612.73
1024 16465.8
2048 35810.8
4096 75988.7

@mvieth
Copy link
Member Author

mvieth commented Nov 25, 2020

@suat-gedikli Thank you for your interest in this!

At the end we can pop from the queue and fill a result-vector and return that.

As I wrote in my explanation, the `pop' of the priority queue is rather costly, while reading from a vector is (almost) free.

inserting a new point results in 1 pop and one push in the queue, both done in O(log(k)) operations, whereas with the vector the upper_bound method needs O(log(k)) but em place needs O(k) -> clear winner is the heap

The complexity of emplace is linear in the distance between the insertion position and the end of the vector, so that k is not the same as the other k-s. The insertion is probably more often near the end of the vector, so that helps.

  1. implement new heap algorithm that operates on pre-allocated vector (like make_heap)
  2. add a replace_top operation for the new heap, so we do not use pop and then push -> single call of re-heap

Sure, that is something we could try, but to me it seems rather complex, since that would be a totally new container/container adapter.

  1. maintain a mask of points that we already touched so when we increase the search radius in every iteration we can skip the points we already touched.
  2. alternatively to 3) we could have a list of point indices that is concentrically centered around our query point and we do consume until the point radius is larger than the k-nearest.

Also interesting, but would work with both priority queue and sorted vector, right? (although it might work better with the strengths and weaknesses of one or the other).

Overall i suggest that we revert these changes and first implement a few speed-tests including large k and corner cases before optimizing for a single test.

I am not generally opposed to reverting this, but would only really support it if there was clear evidence that the new version is slower. If you have specific benchmarks in mind, please run them and post the results.

Thank you @koide3 for the benchmarks. Based on your results, we could even think about switching between the two implementations based on k (if not too complicated/too much overhead). I share your opinion that the search is mostly used with k<1024

Here are a few more benchmarks I got for different OS while working on #4506:
100 nearest neighbours, table_scene_mug_stereo_textured.pcd

OS priority queue sorted vector ratio
Ubuntu 20.04 4536 ms 2311 ms 0.51
macOS 10.15 2083 ms 1622 ms 0.78
macOS 10.14 2055 ms 1625 ms 0.79
Ubuntu 18.04 Clang 2663 ms 2161 ms 0.81
Windows x86 3250 ms 4281 ms 1.32
Windows x64 3344 ms 2906 ms 0.87

Notes: it seems like on Windows x86, the architecture was different (2394 MHz CPU for priority queue, 2095 MHz CPU for sorted vector), and for Windows x64 as well (2295 MHz CPU for priority queue, 2095 MHz CPU for sorted vector). I am not totally sure why the sorted vector is that much better on Ubuntu 20.04, I think the load average was higher when the priority queue benchmark ran.

@mvieth
Copy link
Member Author

mvieth commented Nov 29, 2020

Another supplement: boost has several implementations of priority queues and heaps, with much richer functionality than the stl priority queue. They (partly) support reserving memory, update/increase/decrease operations as a possible replacement of calling pop then push, and iterating through the heap without popping the elements. So that is something we could try out. Maybe that would be even faster than the sorted-vector implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
changelog: enhancement Meta-information for changelog generation module: search needs: code review Specify why not closed/merged yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants