Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDBSCAN bug on A100 #4024

Merged
merged 4 commits into from
Jul 6, 2021

Conversation

divyegala
Copy link
Member

@divyegala divyegala commented Jul 1, 2021

While this issue only appeared in A100, it could have appeared on any other GPU. In this kernel

__global__ void condense_hierarchy_kernel(
, we launch a thread for every node of a binary tree on the GPU. The problem that occurs then, is:

  1. Each node marks itself out of the frontier
    frontier[node] = false;
  2. For every node that is not a leaf, it marks its left and right child into the frontier
    frontier[left_child] = true;

This is UB because the thread for the non-leaf node could be marking itself out of the frontier, but it could be the child of a node whose thread tries to mark it into the frontier.

Edit: Dropped __threadfence() solution as it wasn't fully correct. Using a next_frontier array instead to keep track of the frontier for the next BFS iteration.

@divyegala divyegala requested review from a team as code owners July 1, 2021 20:36
@github-actions github-actions bot added CUDA/C++ Cython / Python Cython or Python issue labels Jul 1, 2021
@divyegala divyegala added 3 - Ready for Review Ready for review by team bug Something isn't working CUDA / C++ CUDA issue non-breaking Non-breaking change and removed CUDA/C++ Cython / Python Cython or Python issue labels Jul 1, 2021
@divyegala divyegala mentioned this pull request Jul 1, 2021
21 tasks
@github-actions github-actions bot added CUDA/C++ Cython / Python Cython or Python issue labels Jul 1, 2021
@cjnolet
Copy link
Member

cjnolet commented Jul 1, 2021

This is UB because the thread for the non-leaf node could be marking itself out of the frontier, but it could be the child of a node whose thread tries to mark it into the frontier.

@divyegala, I'm a little bit confused by this. Since this is a BFS over each level of a tree, each node is visited only once with a separate kernel launch per level. This means a child will never be visited on the same kernel launch as its parent and the threads are all acting independent of one another within each launch. The cudaStreamSynchronize after each level should also guarantee the subsequent launch is seeing the proper state of the frontier. I'm assuming you noticed this change fixed the test but I'm still trying to figure out the root cause.

EDIT: Nevermind, I just remembered that n_nodes threads are always being launched, so there is definitely a potential race condition here! I think we may need to create a temporary array for this intermediate state, unfortunately. Even w/ the thread fence, it's possible a node can launch AFTER it's been marked by its parent. Fortunately, the frontier is a very small array. (This case is more likely to present it self as the number of data samples grows).

@divyegala
Copy link
Member Author

@cjnolet that is true, I forgot that there is a read in the if conditional as well. In that case, we can get away without creating temporary arrays by doing a grid sync:

namespace cg = cooperative_groups;
auto grid = cg::this_grid();
...
frontier[node] = false;
grid.sync();

@cjnolet
Copy link
Member

cjnolet commented Jul 1, 2021

@divyegala,

I'm not familiar w/ the cooperative groups grid sync (and I'm still out of office but just wanted to chime in on this thread to make sure this problem was actually solved). We just need to make sure it's not possible for any thread (which could be scheduled in different blocks at different times depending on the number of data samples) to be able to read the frontier in the same kernel after its parent marks its children in the frontier.

According to this blog, It also provides host-side APIs to launch grids whose threads are all guaranteed to be executing concurrently to enable synchronization across thread blocks. If that can make the guarantee above then we should be fine using it. Unless that can cause a potential deadlock (which I doubt), I'm not at all concerned about the perf impact here because the condensing step isn't a bottleneck.

@divyegala
Copy link
Member Author

@cjnolet you may be right here. For the cooperative groups grid sync to work, I think we need to be able to guarantee that all threads can fit on the device at the same time, which I don't think we can. Let me think a little more on this, otherwise, I'll go ahead and use the intermediate array solution.

@dantegd dantegd added 4 - Waiting on Author Waiting for author to respond to review and removed 3 - Ready for Review Ready for review by team labels Jul 1, 2021
@divyegala divyegala added 4 - Waiting on Reviewer Waiting for reviewer to review or respond and removed 4 - Waiting on Author Waiting for author to respond to review labels Jul 2, 2021
@codecov-commenter
Copy link

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.08@033a21f). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff               @@
##             branch-21.08    #4024   +/-   ##
===============================================
  Coverage                ?   85.46%           
===============================================
  Files                   ?      230           
  Lines                   ?    18133           
  Branches                ?        0           
===============================================
  Hits                    ?    15498           
  Misses                  ?     2635           
  Partials                ?        0           
Flag Coverage Δ
dask 48.14% <0.00%> (?)
non-dask 77.75% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.


Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 033a21f...c3bcf34. Read the comment docs.

@divyegala
Copy link
Member Author

@cjnolet just tagging for when you are back in office, but I implemented the intermediate frontier solution. Your review would be nice to obtain on this

@dantegd dantegd requested a review from cjnolet July 2, 2021 17:49
Copy link
Member

@cjnolet cjnolet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@cjnolet
Copy link
Member

cjnolet commented Jul 6, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 388f9d3 into rapidsai:branch-21.08 Jul 6, 2021
vimarsh6739 pushed a commit to vimarsh6739/cuml that referenced this pull request Oct 9, 2023
While this issue only appeared in A100, it could have appeared on any other GPU. In this kernel https://github.com/rapidsai/cuml/blob/c6f992a5fcbccf5677ca6d639af6b84e93aa8108/cpp/src/hdbscan/detail/kernels/condense.cuh#L85, we launch a thread for every node of a binary tree on the GPU. The problem that occurs then, is:

1. Each node marks itself out of the frontier https://github.com/rapidsai/cuml/blob/c6f992a5fcbccf5677ca6d639af6b84e93aa8108/cpp/src/hdbscan/detail/kernels/condense.cuh#L94
2. For every node that is not a leaf, it marks its left and right child into the frontier https://github.com/rapidsai/cuml/blob/c6f992a5fcbccf5677ca6d639af6b84e93aa8108/cpp/src/hdbscan/detail/kernels/condense.cuh#L117

This is UB because the thread for the non-leaf node could be marking itself out of the frontier, but it could be the child of a node whose thread tries to mark it into the frontier.

Edit: Dropped `__threadfence()` solution as it wasn't fully correct. Using a `next_frontier` array instead to keep track of the frontier for the next BFS iteration.

Authors:
  - Divye Gala (https://github.com/divyegala)

Approvers:
  - Dante Gama Dessavre (https://github.com/dantegd)
  - Corey J. Nolet (https://github.com/cjnolet)

URL: rapidsai#4024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
4 - Waiting on Reviewer Waiting for reviewer to review or respond bug Something isn't working CUDA / C++ CUDA issue CUDA/C++ Cython / Python Cython or Python issue non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants