Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect single linkage trees when using Boruvka ball trees for data with duplicate entries #404

Open
GregDemand opened this issue Jul 17, 2020 · 1 comment

Comments

@GregDemand
Copy link
Contributor

GregDemand commented Jul 17, 2020

The borvuka_balltree algorithm produces incorrect single linkage trees for some data sets with duplicate entries. The incorrect single linkage trees result in incorrect clusterings for these data sets.

It appears that the issue is mostly caused by a typo in BallTreeBoruvkaAlgorithm._compute_bounds().

As a note, I have been able to replicate the issue on two linux machines and failed to replicate the issue on a MacOS machine.

The issue can be seen using the following code snippet on the provided sample data (boruvka_testing_data.tar.gz) with duplicate entries:

import pandas as pd
import hdbscan

data = pd.read_csv('boruvka_testing_data.csv', index_col=0)

metric='euclidean'
min_cluster_size=50

clusterer = hdbscan.HDBSCAN(metric=metric, algorithm='generic', min_cluster_size=min_cluster_size, approx_min_span_tree=False)
clusterer_prims = hdbscan.HDBSCAN(metric=metric, algorithm='prims_balltree', min_cluster_size=min_cluster_size, approx_min_span_tree=False)
clusterer_boruvka = hdbscan.HDBSCAN(metric=metric, algorithm='boruvka_balltree', min_cluster_size=min_cluster_size, approx_min_span_tree=False)
clusterer_prims_kdtree = hdbscan.HDBSCAN(metric=metric, algorithm='prims_kdtree', min_cluster_size=min_cluster_size, approx_min_span_tree=False)
clusterer_boruvka_kdtree = hdbscan.HDBSCAN(metric=metric, algorithm='boruvka_kdtree', min_cluster_size=min_cluster_size, approx_min_span_tree=False)

clusterer.fit(data)
clusterer_prims.fit(data)
clusterer_boruvka.fit(data)
clusterer_prims_kdtree.fit(data)
clusterer_boruvka_kdtree.fit(data)

print('Generic:', sum(clusterer.single_linkage_tree_.to_pandas()['distance']))
print('Prims balltree:', sum(clusterer_prims.single_linkage_tree_.to_pandas()['distance']))
print('Boruvka balltree:', sum(clusterer_boruvka.single_linkage_tree_.to_pandas()['distance']))
print('Prims kdtree:', sum(clusterer_prims_kdtree.single_linkage_tree_.to_pandas()['distance']))
print('Borvuka kdtree:', sum(clusterer_boruvka_kdtree.single_linkage_tree_.to_pandas()['distance']))

Using code from master produces the following output:

Generic: 913166.8450209984
Prims balltree: 900935.9500860934
Boruvka balltree: 1675822.769750423
Prims kdtree: 900935.9500860934
Borvuka kdtree: 913166.8450209984 

The sum of distances in the single_linkage tree for the Borvuka ball tree algorithm is almost twice as large as the correct value.

Using the linked pull request #394 (which also incorporates the fix for #393) produces:

Generic: 913166.8450209984
Prims balltree: 913166.8450209984
Boruvka balltree: 913166.8450209984
Prims kdtree: 913166.8450209984
Borvuka kdtree: 913166.8450209984

As a note, while this pull request produces a dramatic improvement in the results, it does not completely fix the problem. Running the same code with min_cluster_size=5 and the linked pull request, produces the output,

Generic: 218923.7797794972
Prims balltree: 218923.7797794972
Boruvka balltree: 219679.3560692969
Prims kdtree: 218923.7797794972
Borvuka kdtree: 219072.49151237102

which contains discrepancies for both the boruvka_balltree and boruvka_kdtree algorithms. These discrepancies
are on the order of 0.5% and should have a much smaller effect on the clustering.

@GregDemand
Copy link
Contributor Author

I've tracked down and fixed the discrepancy for the boruvka_kdtree algorithm and isolated the issue for the boruvka_balltree.

For the boruvka_kdtree algorithm, the discrepancy was caused by adding a reduced distance to the non-reduced distance when calculating the new_lower_bound. I have corrected this in my pull request.

For the boruvka_kdtree algorithm, the issue is also in the new_lower_bound, but I'm unsure of precisely what the issue is. If only the new_upper_bound is used, i.e. replace

new_bound = min(new_upper_bound, new_lower_bound + 2 * node1_info.radius)
with
new_bound=new_upper_bound,

I get the correct answer, but the algorithm would be less performant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant