Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[gpu] Fix octree radiusSearch #4146

Merged
merged 6 commits into from
Aug 14, 2020

Conversation

haritha-j
Copy link
Contributor

Fix issue in GPU radiusSearch() for multiple radiuses which resulted in inconsistent/incorrect output. Fixes #3583

@kunaltyagi kunaltyagi added changelog: fix Meta-information for changelog generation module: gpu needs: code review Specify why not closed/merged yet labels May 26, 2020
@kunaltyagi kunaltyagi changed the title fix gpu radiusSearch [gpu] Fix octree radiusSearch May 26, 2020
@kunaltyagi
Copy link
Member

kunaltyagi commented May 26, 2020

@haritha-j Can you please search and link the issues this PR solves? (If there are any other than 3583)


//broadcast warp_radius
if (active_lane == laneId)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this needs sync threads before load.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the __any_sync primitive in line 258 will perform the syncronization here, and since we're getting the radius here using essentially the same method used to get the query below, I don't think additional synchronization is necessary. But I'm still learning about this as well so I definitely can't say for certain.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we're getting the radius here using essentially the same method used to get the query below, I don't think additional synchronization is necessary

This assumes the original isn't buggy. Code can have bugs and still give correct behavior (usually when UB happens)

__any_sync primitive in line 258 will perform the syncronization here

__any_sync isn't a sync primitive, but a test primitive. From CUDA docs:

__any_sync(unsigned mask, predicate):
    Evaluate predicate for all non-exited threads in mask and return non-zero if and only if predicate evaluates to non-zero for any of them.

If this actually resolves the issue, I'll approve this, but I feel something stinky is going on.

Copy link
Contributor Author

@haritha-j haritha-j May 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

__any_sync isn't a sync primitive, but a test primitive. From CUDA docs:

I think when cuda 9.0 came out, they changed these primitives (like __any or __ballot) to their sync variants (__any_sync , __ballot_sync). From their guide here
All the participating threads must be synchronized for the collective operation to work correctly. Therefore, these primitives first synchronize the threads if they are not already synchronized.

Copy link
Member

@kunaltyagi kunaltyagi May 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The threads are synced for testing the primitive. But after that any operation (not just a shared mem load/store) will unsync them. The distance isn't large from sync point, so it's a possibility the sync-load doesn't impact value (or impacts only a little since the algorithm is iterative). I'm basing my conclusions only on the details in CUDA programming guide (§5.4.2) and experience with CPU atomics.

Again, I'm making noise based on theory and limited knowledge.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this be a better approach than previous method of using the ``per_warp_buffer` and not being assured of synchronization?

Yes and your and @Cazadorro concerns are totally valid. This section looks bug filled.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it needed a dive into Nvidia's memory architecture: we're using deprecated functionality of Nvidia where volatile memory was treated as a replacement for atomic, since they lacked atomic instrinsics. Since then, they've updated their drivers and language model to add std::atomic explicitly, and don't recommend depending on volatile doing the trick.

Original impl makes sense after this fact, because I've used such tricks on low-level code before. But this creates bi-directional memory barriers on every operation and is a performance bottleneck. The proposed __shfl_sync is a much better method since it only inserts one-directional barrier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So should I proceed with that method for just radius or should we also use the same method to get query? Or would it be better to leave it as is?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it needed a dive into Nvidia's memory architecture: we're using deprecated functionality of Nvidia where volatile memory was treated as a replacement for atomic, since they lacked atomic instrinsics.

Can you provide a source for this? The search results I'm getting even within the context of CUDA simply discuss the traditional volatile definition.

Copy link
Member

@kunaltyagi kunaltyagi May 31, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found this presentation as a digestible source. CUDA didn't have a "nice" memory model prior to 2017-2018 (not their fault. Everyone was neglecting that to some degree. Even C++11 memory model had bugs till C++17 which got retroactively fixed).

TLDR: Lack of emphasis on memory models resulted in people misusing volatile as atomics, whereas atomics need memory fences (implicitly added by all good implementations like stdlib). This observation will be visible in both the presentation I linked to as well as other source material arguing for use of std::atomic or asm intrinsics over volatile.

@haritha-j Reducing the use of volatile would give better results and that should be the direction we take.

@haritha-j
Copy link
Contributor Author

@haritha-j Can you please search and link the issues this PR solves? (If there are any other than 3583)

@kunaltyagi There doesn't seem to be any other issues that are linked to radius search. There is one at #3649 on approx nearest search which seems similar, however there's no directly analogous fix.

@haritha-j
Copy link
Contributor Author

This does not fix the somewhat fishy use of volatile yet, but just aims to address the inaccuracy in the results. This PR passes the gpu_octree_radius_search on my local machine.

@haritha-j haritha-j added the priority: gsoc Reason for prioritization label Aug 5, 2020
Copy link
Contributor

@larshg larshg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have run the test and it passes here as well.

@SergioRAgostinho SergioRAgostinho removed the needs: code review Specify why not closed/merged yet label Aug 6, 2020
Copy link
Member

@SergioRAgostinho SergioRAgostinho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to merge from my side if CI is green and after @haritha-j confirms that this is fine.

I've built and ran the corresponding tests locally things are green.

@kunaltyagi
Copy link
Member

@haritha-j Shall we merge?

@kunaltyagi kunaltyagi added the needs: author reply Specify why not closed/merged yet label Aug 7, 2020
@kunaltyagi
Copy link
Member

No API/ABI/behavioural changes. Possibility of shipping in 1.11.1

@SergioRAgostinho
Copy link
Member

he added more code. let me have a quick look.

@kunaltyagi
Copy link
Member

Please merge at your convenience 😄 🚀

Copy link
Member

@SergioRAgostinho SergioRAgostinho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ignoring questionable use of signed integer types for this PR.

{
PointType q = queries.data[query_index];
const PointType q = queries.data[query_index];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const ref please

{
PointType q = queries[queries_indices[query_index]];
const PointType q = queries[queries_indices[query_index]];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

gpu/octree/src/cuda/radius_search.cu Show resolved Hide resolved
gpu/octree/src/cuda/radius_search.cu Show resolved Hide resolved
@SergioRAgostinho SergioRAgostinho removed the needs: author reply Specify why not closed/merged yet label Aug 14, 2020
@SergioRAgostinho SergioRAgostinho merged commit 3dcfd6c into PointCloudLibrary:master Aug 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
changelog: fix Meta-information for changelog generation module: gpu priority: gsoc Reason for prioritization
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Inaccurate results with gpu octree search example
4 participants