[gpu] Fix octree radiusSearch #4146

haritha-j · 2020-05-25T14:42:28Z

Fix issue in GPU radiusSearch() for multiple radiuses which resulted in inconsistent/incorrect output. Fixes #3583

kunaltyagi · 2020-05-26T13:49:37Z

@haritha-j Can you please search and link the issues this PR solves? (If there are any other than 3583)

kunaltyagi · 2020-05-26T13:52:05Z

gpu/octree/src/cuda/radius_search.cu


+                        //broadcast warp_radius
+                        if (active_lane == laneId)


I feel like this needs sync threads before load.

I believe the __any_sync primitive in line 258 will perform the syncronization here, and since we're getting the radius here using essentially the same method used to get the query below, I don't think additional synchronization is necessary. But I'm still learning about this as well so I definitely can't say for certain.

since we're getting the radius here using essentially the same method used to get the query below, I don't think additional synchronization is necessary

This assumes the original isn't buggy. Code can have bugs and still give correct behavior (usually when UB happens)

__any_sync primitive in line 258 will perform the syncronization here

__any_sync isn't a sync primitive, but a test primitive. From CUDA docs:

__any_sync(unsigned mask, predicate): Evaluate predicate for all non-exited threads in mask and return non-zero if and only if predicate evaluates to non-zero for any of them.

If this actually resolves the issue, I'll approve this, but I feel something stinky is going on.

__any_sync isn't a sync primitive, but a test primitive. From CUDA docs:

I think when cuda 9.0 came out, they changed these primitives (like __any or __ballot) to their sync variants (__any_sync , __ballot_sync). From their guide here
All the participating threads must be synchronized for the collective operation to work correctly. Therefore, these primitives first synchronize the threads if they are not already synchronized.

The threads are synced for testing the primitive. But after that any operation (not just a shared mem load/store) will unsync them. The distance isn't large from sync point, so it's a possibility the sync-load doesn't impact value (or impacts only a little since the algorithm is iterative). I'm basing my conclusions only on the details in CUDA programming guide (§5.4.2) and experience with CPU atomics.

Again, I'm making noise based on theory and limited knowledge.

Wouldn't this be a better approach than previous method of using the ``per_warp_buffer` and not being assured of synchronization?

Yes and your and @Cazadorro concerns are totally valid. This section looks bug filled.

So it needed a dive into Nvidia's memory architecture: we're using deprecated functionality of Nvidia where volatile memory was treated as a replacement for atomic, since they lacked atomic instrinsics. Since then, they've updated their drivers and language model to add std::atomic explicitly, and don't recommend depending on volatile doing the trick.

Original impl makes sense after this fact, because I've used such tricks on low-level code before. But this creates bi-directional memory barriers on every operation and is a performance bottleneck. The proposed __shfl_sync is a much better method since it only inserts one-directional barrier.

So should I proceed with that method for just radius or should we also use the same method to get query? Or would it be better to leave it as is?

So it needed a dive into Nvidia's memory architecture: we're using deprecated functionality of Nvidia where volatile memory was treated as a replacement for atomic, since they lacked atomic instrinsics.

Can you provide a source for this? The search results I'm getting even within the context of CUDA simply discuss the traditional volatile definition.

Found this presentation as a digestible source. CUDA didn't have a "nice" memory model prior to 2017-2018 (not their fault. Everyone was neglecting that to some degree. Even C++11 memory model had bugs till C++17 which got retroactively fixed).

TLDR: Lack of emphasis on memory models resulted in people misusing volatile as atomics, whereas atomics need memory fences (implicitly added by all good implementations like stdlib). This observation will be visible in both the presentation I linked to as well as other source material arguing for use of std::atomic or asm intrinsics over volatile.

@haritha-j Reducing the use of volatile would give better results and that should be the direction we take.

haritha-j · 2020-05-26T15:09:12Z

@haritha-j Can you please search and link the issues this PR solves? (If there are any other than 3583)

@kunaltyagi There doesn't seem to be any other issues that are linked to radius search. There is one at #3649 on approx nearest search which seems similar, however there's no directly analogous fix.

haritha-j · 2020-08-05T09:25:58Z

This does not fix the somewhat fishy use of volatile yet, but just aims to address the inaccuracy in the results. This PR passes the gpu_octree_radius_search on my local machine.

larshg

I have run the test and it passes here as well.

@larshg

Author @larshg

SergioRAgostinho

Good to merge from my side if CI is green and after @haritha-j confirms that this is fine.

I've built and ran the corresponding tests locally things are green.

@larshg

Author @larshg

kunaltyagi · 2020-08-07T12:43:19Z

@haritha-j Shall we merge?

kunaltyagi · 2020-08-07T12:44:55Z

No API/ABI/behavioural changes. Possibility of shipping in 1.11.1

SergioRAgostinho · 2020-08-07T12:45:53Z

he added more code. let me have a quick look.

kunaltyagi · 2020-08-07T12:51:25Z

Please merge at your convenience 😄 🚀

SergioRAgostinho

I'm ignoring questionable use of signed integer types for this PR.

SergioRAgostinho · 2020-08-07T12:46:27Z

gpu/octree/src/cuda/radius_search.cu

            {
-                PointType q = queries.data[query_index];
+                const PointType q = queries.data[query_index];


const ref please

SergioRAgostinho · 2020-08-07T12:46:50Z

gpu/octree/src/cuda/radius_search.cu

            {
-                PointType q = queries[queries_indices[query_index]];
+                const PointType q = queries[queries_indices[query_index]];


gpu/octree/src/cuda/radius_search.cu

kunaltyagi added changelog: fix Meta-information for changelog generation module: gpu needs: code review Specify why not closed/merged yet labels May 26, 2020

kunaltyagi changed the title ~~fix gpu radiusSearch~~ [gpu] Fix octree radiusSearch May 26, 2020

kunaltyagi reviewed May 26, 2020

View reviewed changes

haritha-j added 2 commits August 4, 2020 21:56

fix gpu radiusSearch

93c7394

broadcast radius2

7272c09

haritha-j force-pushed the radiusSearchix branch from d248841 to 7272c09 Compare August 5, 2020 09:17

kunaltyagi requested review from larshg and SergioRAgostinho August 5, 2020 10:52

haritha-j added the priority: gsoc Reason for prioritization label Aug 5, 2020

larshg approved these changes Aug 6, 2020

View reviewed changes

Replace read/writes to volatile storage by __shfl primitives.

450eee6

Author @larshg

SergioRAgostinho removed the needs: code review Specify why not closed/merged yet label Aug 6, 2020

SergioRAgostinho approved these changes Aug 6, 2020

View reviewed changes

Replace additional read/writes to volatile storage by __shfl primitives

0c1d65c

Author @larshg

kunaltyagi added the needs: author reply Specify why not closed/merged yet label Aug 7, 2020

SergioRAgostinho reviewed Aug 7, 2020

View reviewed changes

haritha-j added 2 commits August 7, 2020 19:08

use const ref for PointType

0b0b433

remove unnecessary variable

ed96e91

SergioRAgostinho removed the needs: author reply Specify why not closed/merged yet label Aug 14, 2020

SergioRAgostinho approved these changes Aug 14, 2020

View reviewed changes

SergioRAgostinho merged commit 3dcfd6c into PointCloudLibrary:master Aug 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[gpu] Fix octree radiusSearch #4146

[gpu] Fix octree radiusSearch #4146

haritha-j commented May 25, 2020

kunaltyagi commented May 26, 2020 •

edited

Loading

kunaltyagi May 26, 2020

haritha-j May 26, 2020

kunaltyagi May 26, 2020

haritha-j May 26, 2020 •

edited

Loading

kunaltyagi May 27, 2020 •

edited

Loading

SergioRAgostinho May 27, 2020

kunaltyagi May 27, 2020

haritha-j May 31, 2020

SergioRAgostinho May 31, 2020

kunaltyagi May 31, 2020 •

edited

Loading

haritha-j commented May 26, 2020

haritha-j commented Aug 5, 2020

larshg left a comment

SergioRAgostinho left a comment

kunaltyagi commented Aug 7, 2020

kunaltyagi commented Aug 7, 2020

SergioRAgostinho commented Aug 7, 2020

kunaltyagi commented Aug 7, 2020

SergioRAgostinho left a comment

SergioRAgostinho Aug 7, 2020

SergioRAgostinho Aug 7, 2020

[gpu] Fix octree radiusSearch #4146

[gpu] Fix octree radiusSearch #4146

Conversation

haritha-j commented May 25, 2020

kunaltyagi commented May 26, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

haritha-j May 26, 2020 • edited Loading

Choose a reason for hiding this comment

kunaltyagi May 27, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kunaltyagi May 31, 2020 • edited Loading

Choose a reason for hiding this comment

haritha-j commented May 26, 2020

haritha-j commented Aug 5, 2020

larshg left a comment

Choose a reason for hiding this comment

SergioRAgostinho left a comment

Choose a reason for hiding this comment

kunaltyagi commented Aug 7, 2020

kunaltyagi commented Aug 7, 2020

SergioRAgostinho commented Aug 7, 2020

kunaltyagi commented Aug 7, 2020

SergioRAgostinho left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kunaltyagi commented May 26, 2020 •

edited

Loading

haritha-j May 26, 2020 •

edited

Loading

kunaltyagi May 27, 2020 •

edited

Loading

kunaltyagi May 31, 2020 •

edited

Loading