Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

min_sample is off by one for prims_balltree, prims_kdtree, boruvka_balltree and sparse precomputed #393

Closed
GregDemand opened this issue Jun 12, 2020 · 1 comment · Fixed by #394

Comments

@GregDemand
Copy link
Contributor

GregDemand commented Jun 12, 2020

The prims_balltree, prims_kdtre, boruvka_balltree and sparse precomputed algorithms have min_sample size off by one, presumably due to how the kth nearest neighbour is counted.

This can be seen using the following code snippet,

import hdbscan
import numpy as np
from sklearn.metrics import pairwise_distances
from scipy.sparse import csr_matrix

X= np.array([
    [1],
    [2],
    [4],
    [9],
    [16],
    [25]
])

for algorithm in 'generic', 'boruvka_kdtree', 'boruvka_balltree', 'prims_kdtree', 'prims_balltree':
    for min_samples in [1,2,3,4]:
        clusterer = hdbscan.HDBSCAN(algorithm=algorithm, min_samples=min_samples, approx_min_span_tree=False, min_cluster_size=2, gen_min_span_tree=False).fit(X)
        print('{}, min_samples {}: {}'.format(algorithm, min_samples, sum(clusterer.single_linkage_tree_.to_pandas()['distance'])))
for min_samples in [1,2,3,4]:
    clusterer = hdbscan.HDBSCAN(min_samples=min_samples, metric='precomputed', approx_min_span_tree=False, min_cluster_size=2, gen_min_span_tree=False).fit(pairwise_distances(X))
    print('precomputed, min_samples {}: {}'.format(min_samples, sum(clusterer.single_linkage_tree_.to_pandas()['distance'])))
for min_samples in [1,2,3,4]:
    clusterer = hdbscan.HDBSCAN(min_samples=min_samples, metric='precomputed', approx_min_span_tree=False, min_cluster_size=2, gen_min_span_tree=False).fit(csr_matrix(pairwise_distances(X)))
    print('sparse precomputed, min_samples {}: {}'.format(min_samples, sum(clusterer.single_linkage_tree_.to_pandas()['distance'])))

which outputs

generic, min_samples 1: 24.0
generic, min_samples 2: 38.0
generic, min_samples 3: 55.0
generic, min_samples 4: 78.0
boruvka_kdtree, min_samples 1: 24.0
boruvka_kdtree, min_samples 2: 38.0
boruvka_kdtree, min_samples 3: 55.0
boruvka_kdtree, min_samples 4: 78.0
boruvka_balltree, min_samples 1: 24.0
boruvka_balltree, min_samples 2: 24.0
boruvka_balltree, min_samples 3: 38.0
boruvka_balltree, min_samples 4: 55.0
prims_kdtree, min_samples 1: 24.0
prims_kdtree, min_samples 2: 24.0
prims_kdtree, min_samples 3: 38.0
prims_kdtree, min_samples 4: 55.0
prims_balltree, min_samples 1: 24.0
prims_balltree, min_samples 2: 24.0
prims_balltree, min_samples 3: 38.0
prims_balltree, min_samples 4: 55.0
precomputed, min_samples 1: 24.0
precomputed, min_samples 2: 38.0
precomputed, min_samples 3: 55.0
precomputed, min_samples 4: 78.0
sparse precomputed, min_samples 1: 38.0
sparse precomputed, min_samples 2: 55.0
sparse precomputed, min_samples 3: 78.0
sparse precomputed, min_samples 4: 108.0

For prims_kdtree and prims_balltree, this can be fixed by changing min_samples to min_samples + 1 in the tree.query() call.

For boruvka_balltree I think that it should be fixed by changing min_samples to min_samples + 1 in _compute_bounds(). As a comment, this makes the _compute_bounds() function of BallTreeBoruvkaAlgorithm better match that of KDTreeBoruvkaAlgorithm.

For sparse precomputed it should be fixed by changing min_samples to min_samples - 1 in sparse_mutual_reachability().

I also noticed that min_samples is changed when the match_reference_implementation flag is set. If the comparison to the reference implementation was done using one of the changed algorithms then that code may also need to be changed.

I've linked a pull request with these suggested changes (excluding the possible change to match_reference_implementation).

@yossibiton
Copy link

I'm not sure about sparse_mutual_reachability, for the case where the sparse distances matrix contain the self loop.
So when it looks at : core_distance[i] = sorted_row_data[min_points], if you take off the self loop it actually looks at min_points neighbors (and not min_points+1), as it should be.

My assumption is that min_points refer to the number of neighbors without the data point itself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants