-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[gpu] Fix octree radiusSearch #4146
[gpu] Fix octree radiusSearch #4146
Conversation
@haritha-j Can you please search and link the issues this PR solves? (If there are any other than 3583) |
gpu/octree/src/cuda/radius_search.cu
Outdated
|
||
//broadcast warp_radius | ||
if (active_lane == laneId) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this needs sync threads before load.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the __any_sync
primitive in line 258 will perform the syncronization here, and since we're getting the radius here using essentially the same method used to get the query below, I don't think additional synchronization is necessary. But I'm still learning about this as well so I definitely can't say for certain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since we're getting the radius here using essentially the same method used to get the query below, I don't think additional synchronization is necessary
This assumes the original isn't buggy. Code can have bugs and still give correct behavior (usually when UB happens)
__any_sync
primitive in line 258 will perform the syncronization here
__any_sync
isn't a sync primitive, but a test primitive. From CUDA docs:
__any_sync(unsigned mask, predicate):
Evaluate predicate for all non-exited threads in mask and return non-zero if and only if predicate evaluates to non-zero for any of them.
If this actually resolves the issue, I'll approve this, but I feel something stinky is going on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
__any_sync isn't a sync primitive, but a test primitive. From CUDA docs:
I think when cuda 9.0 came out, they changed these primitives (like __any
or __ballot
) to their sync variants (__any_sync
, __ballot_sync
). From their guide here
All the participating threads must be synchronized for the collective operation to work correctly. Therefore, these primitives first synchronize the threads if they are not already synchronized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The threads are synced for testing the primitive. But after that any operation (not just a shared mem load/store) will unsync them. The distance isn't large from sync point, so it's a possibility the sync-load doesn't impact value (or impacts only a little since the algorithm is iterative). I'm basing my conclusions only on the details in CUDA programming guide (§5.4.2) and experience with CPU atomics.
Again, I'm making noise based on theory and limited knowledge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't this be a better approach than previous method of using the ``per_warp_buffer` and not being assured of synchronization?
Yes and your and @Cazadorro concerns are totally valid. This section looks bug filled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it needed a dive into Nvidia's memory architecture: we're using deprecated functionality of Nvidia where volatile memory was treated as a replacement for atomic, since they lacked atomic instrinsics. Since then, they've updated their drivers and language model to add std::atomic
explicitly, and don't recommend depending on volatile
doing the trick.
Original impl makes sense after this fact, because I've used such tricks on low-level code before. But this creates bi-directional memory barriers on every operation and is a performance bottleneck. The proposed __shfl_sync
is a much better method since it only inserts one-directional barrier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So should I proceed with that method for just radius
or should we also use the same method to get query
? Or would it be better to leave it as is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it needed a dive into Nvidia's memory architecture: we're using deprecated functionality of Nvidia where volatile memory was treated as a replacement for atomic, since they lacked atomic instrinsics.
Can you provide a source for this? The search results I'm getting even within the context of CUDA simply discuss the traditional volatile
definition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found this presentation as a digestible source. CUDA didn't have a "nice" memory model prior to 2017-2018 (not their fault. Everyone was neglecting that to some degree. Even C++11 memory model had bugs till C++17 which got retroactively fixed).
TLDR: Lack of emphasis on memory models resulted in people misusing volatile as atomics, whereas atomics need memory fences (implicitly added by all good implementations like stdlib). This observation will be visible in both the presentation I linked to as well as other source material arguing for use of std::atomic
or asm intrinsics over volatile
.
@haritha-j Reducing the use of volatile
would give better results and that should be the direction we take.
@kunaltyagi There doesn't seem to be any other issues that are linked to radius search. There is one at #3649 on approx nearest search which seems similar, however there's no directly analogous fix. |
d248841
to
7272c09
Compare
This does not fix the somewhat fishy use of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have run the test and it passes here as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to merge from my side if CI is green and after @haritha-j confirms that this is fine.
I've built and ran the corresponding tests locally things are green.
@haritha-j Shall we merge? |
No API/ABI/behavioural changes. Possibility of shipping in 1.11.1 |
he added more code. let me have a quick look. |
Please merge at your convenience 😄 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ignoring questionable use of signed integer types for this PR.
gpu/octree/src/cuda/radius_search.cu
Outdated
{ | ||
PointType q = queries.data[query_index]; | ||
const PointType q = queries.data[query_index]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const ref please
gpu/octree/src/cuda/radius_search.cu
Outdated
{ | ||
PointType q = queries[queries_indices[query_index]]; | ||
const PointType q = queries[queries_indices[query_index]]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
Fix issue in GPU radiusSearch() for multiple radiuses which resulted in inconsistent/incorrect output. Fixes #3583