HDBSCAN bug on A100 #4024

divyegala · 2021-07-01T20:36:18Z

While this issue only appeared in A100, it could have appeared on any other GPU. In this kernel

cuml/cpp/src/hdbscan/detail/kernels/condense.cuh

Line 85 in 033a21f

__global__ void condense_hierarchy_kernel(

, we launch a thread for every node of a binary tree on the GPU. The problem that occurs then, is:

Each node marks itself out of the frontier

cuml/cpp/src/hdbscan/detail/kernels/condense.cuh

Line 94 in 033a21f

frontier[node] = false;
For every node that is not a leaf, it marks its left and right child into the frontier

cuml/cpp/src/hdbscan/detail/kernels/condense.cuh

Line 117 in 033a21f

frontier[left_child] = true;

This is UB because the thread for the non-leaf node could be marking itself out of the frontier, but it could be the child of a node whose thread tries to mark it into the frontier.

Edit: Dropped __threadfence() solution as it wasn't fully correct. Using a next_frontier array instead to keep track of the frontier for the next BFS iteration.

cjnolet · 2021-07-01T21:29:11Z

This is UB because the thread for the non-leaf node could be marking itself out of the frontier, but it could be the child of a node whose thread tries to mark it into the frontier.

@divyegala, I'm a little bit confused by this. Since this is a BFS over each level of a tree, each node is visited only once with a separate kernel launch per level. This means a child will never be visited on the same kernel launch as its parent and the threads are all acting independent of one another within each launch. The cudaStreamSynchronize after each level should also guarantee the subsequent launch is seeing the proper state of the frontier. I'm assuming you noticed this change fixed the test but I'm still trying to figure out the root cause.

EDIT: Nevermind, I just remembered that n_nodes threads are always being launched, so there is definitely a potential race condition here! I think we may need to create a temporary array for this intermediate state, unfortunately. Even w/ the thread fence, it's possible a node can launch AFTER it's been marked by its parent. Fortunately, the frontier is a very small array. (This case is more likely to present it self as the number of data samples grows).

divyegala · 2021-07-01T21:39:32Z

@cjnolet that is true, I forgot that there is a read in the if conditional as well. In that case, we can get away without creating temporary arrays by doing a grid sync:

namespace cg = cooperative_groups;
auto grid = cg::this_grid();
...
frontier[node] = false;
grid.sync();

cjnolet · 2021-07-01T21:49:55Z

@divyegala,

I'm not familiar w/ the cooperative groups grid sync (and I'm still out of office but just wanted to chime in on this thread to make sure this problem was actually solved). We just need to make sure it's not possible for any thread (which could be scheduled in different blocks at different times depending on the number of data samples) to be able to read the frontier in the same kernel after its parent marks its children in the frontier.

According to this blog, It also provides host-side APIs to launch grids whose threads are all guaranteed to be executing concurrently to enable synchronization across thread blocks. If that can make the guarantee above then we should be fine using it. Unless that can cause a potential deadlock (which I doubt), I'm not at all concerned about the perf impact here because the condensing step isn't a bottleneck.

divyegala · 2021-07-01T22:06:53Z

@cjnolet you may be right here. For the cooperative groups grid sync to work, I think we need to be able to guarantee that all threads can fit on the device at the same time, which I don't think we can. Let me think a little more on this, otherwise, I'll go ahead and use the intermediate array solution.

codecov-commenter · 2021-07-02T04:40:31Z

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.08@033a21f). Click here to learn what that means.
The diff coverage is n/a.

@@               Coverage Diff               @@
##             branch-21.08    #4024   +/-   ##
===============================================
  Coverage                ?   85.46%           
===============================================
  Files                   ?      230           
  Lines                   ?    18133           
  Branches                ?        0           
===============================================
  Hits                    ?    15498           
  Misses                  ?     2635           
  Partials                ?        0

Flag	Coverage Δ
dask	`48.14% <0.00%> (?)`
non-dask	`77.75% <0.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 033a21f...c3bcf34. Read the comment docs.

divyegala · 2021-07-02T17:18:32Z

@cjnolet just tagging for when you are back in office, but I implemented the intermediate frontier solution. Your review would be nice to obtain on this

cjnolet

LGTM

cjnolet · 2021-07-06T17:04:04Z

@gpucibot merge

While this issue only appeared in A100, it could have appeared on any other GPU. In this kernel https://github.com/rapidsai/cuml/blob/c6f992a5fcbccf5677ca6d639af6b84e93aa8108/cpp/src/hdbscan/detail/kernels/condense.cuh#L85, we launch a thread for every node of a binary tree on the GPU. The problem that occurs then, is: 1. Each node marks itself out of the frontier https://github.com/rapidsai/cuml/blob/c6f992a5fcbccf5677ca6d639af6b84e93aa8108/cpp/src/hdbscan/detail/kernels/condense.cuh#L94 2. For every node that is not a leaf, it marks its left and right child into the frontier https://github.com/rapidsai/cuml/blob/c6f992a5fcbccf5677ca6d639af6b84e93aa8108/cpp/src/hdbscan/detail/kernels/condense.cuh#L117 This is UB because the thread for the non-leaf node could be marking itself out of the frontier, but it could be the child of a node whose thread tries to mark it into the frontier. Edit: Dropped `__threadfence()` solution as it wasn't fully correct. Using a `next_frontier` array instead to keep track of the frontier for the next BFS iteration. Authors: - Divye Gala (https://github.com/divyegala) Approvers: - Dante Gama Dessavre (https://github.com/dantegd) - Corey J. Nolet (https://github.com/cjnolet) URL: rapidsai#4024

adding threadfence to condense_hierarchy_kernel

e2af2be

divyegala requested review from a team as code owners July 1, 2021 20:36

github-actions bot added CUDA/C++ Cython / Python Cython or Python issue labels Jul 1, 2021

divyegala added 3 - Ready for Review Ready for review by team bug Something isn't working CUDA / C++ CUDA issue non-breaking Non-breaking change and removed CUDA/C++ Cython / Python Cython or Python issue labels Jul 1, 2021

divyegala mentioned this pull request Jul 1, 2021

[TASK] Post HDBSCAN merge tasks #3879

Open

21 tasks

fixing style check

06cc925

github-actions bot added CUDA/C++ Cython / Python Cython or Python issue labels Jul 1, 2021

dantegd approved these changes Jul 1, 2021

View reviewed changes

dantegd added 4 - Waiting on Author Waiting for author to respond to review and removed 3 - Ready for Review Ready for review by team labels Jul 1, 2021

using next_frontier instead of threadfence for bfs

9019c72

divyegala added 4 - Waiting on Reviewer Waiting for reviewer to review or respond and removed 4 - Waiting on Author Waiting for author to respond to review labels Jul 2, 2021

fixing wrong thrust fill argument

c3bcf34

dantegd requested a review from cjnolet July 2, 2021 17:49

cjnolet approved these changes Jul 6, 2021

View reviewed changes

rapids-bot bot merged commit 388f9d3 into rapidsai:branch-21.08 Jul 6, 2021

divyegala mentioned this pull request Jul 6, 2021

[BUG] Sporadic pytest crash in hdbscan #3997

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDBSCAN bug on A100 #4024

HDBSCAN bug on A100 #4024

divyegala commented Jul 1, 2021 •

edited

Loading

cjnolet commented Jul 1, 2021 •

edited

Loading

divyegala commented Jul 1, 2021

cjnolet commented Jul 1, 2021 •

edited

Loading

divyegala commented Jul 1, 2021

codecov-commenter commented Jul 2, 2021

divyegala commented Jul 2, 2021

cjnolet left a comment

cjnolet commented Jul 6, 2021

HDBSCAN bug on A100 #4024

HDBSCAN bug on A100 #4024

Conversation

divyegala commented Jul 1, 2021 • edited Loading

cjnolet commented Jul 1, 2021 • edited Loading

divyegala commented Jul 1, 2021

cjnolet commented Jul 1, 2021 • edited Loading

divyegala commented Jul 1, 2021

codecov-commenter commented Jul 2, 2021

Codecov Report

divyegala commented Jul 2, 2021

cjnolet left a comment

Choose a reason for hiding this comment

cjnolet commented Jul 6, 2021

divyegala commented Jul 1, 2021 •

edited

Loading

cjnolet commented Jul 1, 2021 •

edited

Loading

cjnolet commented Jul 1, 2021 •

edited

Loading