Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

inconsistent Boruvka MST #69

Open
azizkayumov opened this issue Feb 27, 2024 · 0 comments
Open

inconsistent Boruvka MST #69

azizkayumov opened this issue Feb 27, 2024 · 0 comments

Comments

@azizkayumov
Copy link
Contributor

As discussed in #67, there is a small difference between MSTs computed by Boruvka and Prim's algorithms in HDBSCAN.
Here are the steps to reproduce:

  1. Download the Wine dataset.
  2. Extract winequality-white.csv and put in the project folder.
  3. Replace all ; with , and remove the header info (as required by this lib).
  4. Add println! to print the MST weight (this line seems to be a good place for simplicity):
        let mut weight = A::zero();
        for (_, _, w) in &mst {
            weight += *w;
        }
        println!("weight: {:?}", weight);
  1. Change the example code to disable Boruvka: boruvka = true => boruvka = false
  2. Run the example code:
    cargo run --example hdbscan winequality-white.csv
    This prints the following (which is the ground truth exact MST):
weight: 26787.419129474838
========= Report =========
# of events processed: 4898
# of features provided: 12
# of clusters: 8
# of events clustered: 1564
# of outliers: 3334
  1. Change the example code to enable Boruvka: boruvka = false => boruvka = true:
  2. Run the example code again:
    cargo run --example hdbscan winequality-white.csv
    This will output:
weight: 26788.24022864788
========= Report =========
# of events processed: 4898
# of features provided: 12
# of clusters: 8
# of events clustered: 1566
# of outliers: 3332

A quick fix would be not to use the lower bound function as suggested here, but this may increase the running time of Boruvka.

Interestingly, Python HDBSCAN does not have this bug if we run the same experiment (with the same configurations). It may be possible that Ball-tree implementation might be causing this (e.g. erroneous lower bounding between tree nodes due to loss of precision?).

@azizkayumov azizkayumov changed the title inconsistent Boruvka and Prim's MSTs inconsistent Boruvka MST Feb 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant