fix dual-tree traversal #67

azizkayumov · 2024-02-14T06:56:09Z

This PR fixes the bug that's causing HDbscan with boruvka: true to produce a heavier MST than HDbscan with boruvka: false.

The order of dual-tree traversal between query and reference trees seems to be misplaced, please check the Python HDBSCAN code here.

azizkayumov · 2024-02-14T13:23:09Z

Here are the steps to reproduce:

Download the Wine dataset.
Extract winequality-white.csv and put in the project folder.
Replace all ; with , and remove the header info (as required by this lib).
Add println! to print the MST weight (this line seems to be a good place for simplicity).
Change the example code to disable Boruvka: boruvka = true => boruvka = false
Run the example code:
cargo run --example hdbscan winequality-white.csv
This prints the following (which is the ground truth exact MST):

weight: 26787.419129474838
========= Report =========
# of events processed: 4898
# of features provided: 12
# of clusters: 8
# of events clustered: 1564
# of outliers: 3334

Change the example code to enable Boruvka: boruvka = false => boruvka = true:
Run the example code again:
cargo run --example hdbscan winequality-white.csv
This will output:

weight: 27096.304131959016
========= Report =========
# of events processed: 4898
# of features provided: 12
# of clusters: 5
# of events clustered: 4658
# of outliers: 240

As you can see boruvka = true generates a wrong MST, so the clustering results are also affected.
This PR should fix the bug and print the following:

weight: 26788.24022864788
========= Report =========
# of events processed: 4898
# of features provided: 12
# of clusters: 8
# of events clustered: 1566
# of outliers: 3332

There is still a minor difference in the MST weights, I suppose this should be helpful for future readers.

azizkayumov · 2024-02-15T07:41:01Z

The small difference between Boruvka MST and Prim's MST looks to be caused by lower bounding the query node.
A quick modification to the bound function as follows should be sufficient:

    #[inline]
    fn bound(&self, parent: usize) -> A {
        let left = 2 * parent + 1;
        let right = left + 1;

        let upper = if self.bounds[left] > self.bounds[right] {
            self.bounds[left]
        } else {
            self.bounds[right]
        };

        upper
        // not using the lower bound
    }

Then, the clustering output should be:

weight: 26787.419129474893
========= Report =========
# of events processed: 4898
# of features provided: 12
# of clusters: 2
# of events clustered: 4643
# of outliers: 255

Compared to the clustering output of Prims (ground truth):

weight: 26787.419129474838
========= Report =========
# of events processed: 4898
# of features provided: 12
# of clusters: 8
# of events clustered: 1564
# of outliers: 3334

Now, both Boruvka and Prims compute MSTs with the same weight, but their clustering output are different, it seems either of these clustering results are acceptable (or MST condensing and cluster stability calculation might be messing with the clustering output of Boruvka?).
I didn't push the aforementioned change to the bound function as it was reported to increase the running time of Boruvka (will open a new issue).

fix dual-tree traversal

c182936

msk requested a review from minshao February 14, 2024 17:57

minshao approved these changes Feb 20, 2024

View reviewed changes

msk merged commit c182936 into petabi:main Feb 21, 2024
1 of 7 checks passed

msk added a commit that referenced this pull request Feb 21, 2024

Update CHANGELOG for PR #67

8e9a6a5

msk added a commit that referenced this pull request Feb 21, 2024

Update CHANGELOG for PR #67

389190b

azizkayumov mentioned this pull request Feb 27, 2024

inconsistent Boruvka MST #69

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix dual-tree traversal #67

fix dual-tree traversal #67

azizkayumov commented Feb 14, 2024

azizkayumov commented Feb 14, 2024

azizkayumov commented Feb 15, 2024

fix dual-tree traversal #67

fix dual-tree traversal #67

Conversation

azizkayumov commented Feb 14, 2024

azizkayumov commented Feb 14, 2024

azizkayumov commented Feb 15, 2024