Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix dual-tree traversal #67

Merged
merged 1 commit into from
Feb 21, 2024
Merged

fix dual-tree traversal #67

merged 1 commit into from
Feb 21, 2024

Conversation

azizkayumov
Copy link
Contributor

This PR fixes the bug that's causing HDbscan with boruvka: true to produce a heavier MST than HDbscan with boruvka: false.

The order of dual-tree traversal between query and reference trees seems to be misplaced, please check the Python HDBSCAN code here.

@azizkayumov
Copy link
Contributor Author

Here are the steps to reproduce:

  1. Download the Wine dataset.
  2. Extract winequality-white.csv and put in the project folder.
  3. Replace all ; with , and remove the header info (as required by this lib).
  4. Add println! to print the MST weight (this line seems to be a good place for simplicity).
  5. Change the example code to disable Boruvka: boruvka = true => boruvka = false
  6. Run the example code:
    cargo run --example hdbscan winequality-white.csv
    This prints the following (which is the ground truth exact MST):
weight: 26787.419129474838
========= Report =========
# of events processed: 4898
# of features provided: 12
# of clusters: 8
# of events clustered: 1564
# of outliers: 3334
  1. Change the example code to enable Boruvka: boruvka = false => boruvka = true:
  2. Run the example code again:
    cargo run --example hdbscan winequality-white.csv
    This will output:
weight: 27096.304131959016
========= Report =========
# of events processed: 4898
# of features provided: 12
# of clusters: 5
# of events clustered: 4658
# of outliers: 240

As you can see boruvka = true generates a wrong MST, so the clustering results are also affected.
This PR should fix the bug and print the following:

weight: 26788.24022864788
========= Report =========
# of events processed: 4898
# of features provided: 12
# of clusters: 8
# of events clustered: 1566
# of outliers: 3332

There is still a minor difference in the MST weights, I suppose this should be helpful for future readers.

@msk msk requested a review from minshao February 14, 2024 17:57
@azizkayumov
Copy link
Contributor Author

The small difference between Boruvka MST and Prim's MST looks to be caused by lower bounding the query node.
A quick modification to the bound function as follows should be sufficient:

    #[inline]
    fn bound(&self, parent: usize) -> A {
        let left = 2 * parent + 1;
        let right = left + 1;

        let upper = if self.bounds[left] > self.bounds[right] {
            self.bounds[left]
        } else {
            self.bounds[right]
        };

        upper
        // not using the lower bound
    }

Then, the clustering output should be:

weight: 26787.419129474893
========= Report =========
# of events processed: 4898
# of features provided: 12
# of clusters: 2
# of events clustered: 4643
# of outliers: 255

Compared to the clustering output of Prims (ground truth):

weight: 26787.419129474838
========= Report =========
# of events processed: 4898
# of features provided: 12
# of clusters: 8
# of events clustered: 1564
# of outliers: 3334

Now, both Boruvka and Prims compute MSTs with the same weight, but their clustering output are different, it seems either of these clustering results are acceptable (or MST condensing and cluster stability calculation might be messing with the clustering output of Boruvka?).
I didn't push the aforementioned change to the bound function as it was reported to increase the running time of Boruvka (will open a new issue).

@msk msk merged commit c182936 into petabi:main Feb 21, 2024
1 of 7 checks passed
msk added a commit that referenced this pull request Feb 21, 2024
msk added a commit that referenced this pull request Feb 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants