-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add MST optimization to guarantee the connectivity of CAGRA graphs #237
Add MST optimization to guarantee the connectivity of CAGRA graphs #237
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Akira for porting the PR to cuVS! The original PR rapidsai/raft#2218 was already reviewed, and I see that most of the issues are addressed here. I have a marked a few smaller issues below.
There is some scope for improving code readability by relying more on mdspan, and adding a helper function to check duplicates (e.g. here). @anaruse I could easily apply the mdspan changes on the PR, if you are fine with that.
Thanks for the review, Tamas. I have fixed the points you pointed out, and it would be helpful if you could make the mdspan related changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Akira for the update. The PR looks good to me.
Before this is merged can we please have a GitHub issue created to expose this through the public APIs? We already have a use-case for this API in HDBSCAN and single linkage clustering. |
Also, given that both of you have made changes to the PR, I’d like to have a 3rd person do a quick look over. I’ll take a look tomorrow. |
Just noticed this issue #256 was created, so disregard my first comment. |
Thanks Corey, more reviews are welcome! Note that I have only added a merge commit to this PR to bring it up to date with base branch. I planned to add mdspan usage related refactoring, but I decided to postpone that in a follow up (added a note in #256). This PR is already useful in its current form: it helps to exclude/debug one issue that can lead to low recall. That is why I would prefer to get it merged even with some pending refactoring. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. It would be nice to have a little more detail in the description of the guarantee_connectivity
param, but i don't think that needs to hold up this PR.
/ok to test |
/ok to test |
/ok to test |
/merge |
…apidsai#237) This PR allows us to guarantee the connectivity of the CAGRA search graph using approximate MST. It has been empirically shown that the graph indexes generated by CAGRA for search provide comparable search accuracy to other libraries, but reachability from any node to all nodes is not guaranteed. In fact, it has been confirmed that the number of strongly connected components (SCC) of graph indexes created by CAGRA is not 1 in some 100M scale datasets. This problem can be alleviated by increasing the number of degrees in the search graph, but this would increase the size of the graph index. It is desirable to address this problem without increasing the number of degrees of the search graph. Prior study has shown that this can be solved by using a Minimum Spanning Tree (MST)-like approach, but in general, MST calculation takes a long time. However, what is needed here is not an exact MST, but, for example, an approximate MST in which the total number of edges is not necessarily minimum. Such an approximate MST could be computed quickly on GPUs. This PR contains implementation to create a approximate MST on the GPU at high speed based on the above policy and use it to guarantee the connectivity of the search graph. This functionality is not always required, so it is considered an opt-in feature. A member variable named `guarantee_connectivity` is added to `index_params`, so set this variable to `true` if you wish to use this featgure. > cuvs::neighbors::cagra::index_params index_params; > index_params.guarantee_connectivity = true; > auto index = cuvs::neighbors::cagra::build(res, index_params, dataset_view); Authors: - Akira Naruse (https://github.com/anaruse) - Tamas Bela Feher (https://github.com/tfeher) - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Tamas Bela Feher (https://github.com/tfeher) - Corey J. Nolet (https://github.com/cjnolet) URL: rapidsai#237
This PR allows us to guarantee the connectivity of the CAGRA search graph using approximate MST.
It has been empirically shown that the graph indexes generated by CAGRA for search provide comparable search accuracy to other libraries, but reachability from any node to all nodes is not guaranteed. In fact, it has been confirmed that the number of strongly connected components (SCC) of graph indexes created by CAGRA is not 1 in some 100M scale datasets.
This problem can be alleviated by increasing the number of degrees in the search graph, but this would increase the size of the graph index. It is desirable to address this problem without increasing the number of degrees of the search graph.
Prior study has shown that this can be solved by using a Minimum Spanning Tree (MST)-like approach, but in general, MST calculation takes a long time. However, what is needed here is not an exact MST, but, for example, an approximate MST in which the total number of edges is not necessarily minimum. Such an approximate MST could be computed quickly on GPUs.
This PR contains implementation to create a approximate MST on the GPU at high speed based on the above policy and use it to guarantee the connectivity of the search graph.
This functionality is not always required, so it is considered an opt-in feature. A member variable named
guarantee_connectivity
is added toindex_params
, so set this variable totrue
if you wish to use this featgure.