min_sample is off by one for prims_balltree, prims_kdtree, boruvka_balltree and sparse precomputed #393

GregDemand · 2020-06-12T12:06:20Z

The prims_balltree, prims_kdtre, boruvka_balltree and sparse precomputed algorithms have min_sample size off by one, presumably due to how the kth nearest neighbour is counted.

This can be seen using the following code snippet,

import hdbscan
import numpy as np
from sklearn.metrics import pairwise_distances
from scipy.sparse import csr_matrix

X= np.array([
    [1],
    [2],
    [4],
    [9],
    [16],
    [25]
])

for algorithm in 'generic', 'boruvka_kdtree', 'boruvka_balltree', 'prims_kdtree', 'prims_balltree':
    for min_samples in [1,2,3,4]:
        clusterer = hdbscan.HDBSCAN(algorithm=algorithm, min_samples=min_samples, approx_min_span_tree=False, min_cluster_size=2, gen_min_span_tree=False).fit(X)
        print('{}, min_samples {}: {}'.format(algorithm, min_samples, sum(clusterer.single_linkage_tree_.to_pandas()['distance'])))
for min_samples in [1,2,3,4]:
    clusterer = hdbscan.HDBSCAN(min_samples=min_samples, metric='precomputed', approx_min_span_tree=False, min_cluster_size=2, gen_min_span_tree=False).fit(pairwise_distances(X))
    print('precomputed, min_samples {}: {}'.format(min_samples, sum(clusterer.single_linkage_tree_.to_pandas()['distance'])))
for min_samples in [1,2,3,4]:
    clusterer = hdbscan.HDBSCAN(min_samples=min_samples, metric='precomputed', approx_min_span_tree=False, min_cluster_size=2, gen_min_span_tree=False).fit(csr_matrix(pairwise_distances(X)))
    print('sparse precomputed, min_samples {}: {}'.format(min_samples, sum(clusterer.single_linkage_tree_.to_pandas()['distance'])))

which outputs

generic, min_samples 1: 24.0
generic, min_samples 2: 38.0
generic, min_samples 3: 55.0
generic, min_samples 4: 78.0
boruvka_kdtree, min_samples 1: 24.0
boruvka_kdtree, min_samples 2: 38.0
boruvka_kdtree, min_samples 3: 55.0
boruvka_kdtree, min_samples 4: 78.0
boruvka_balltree, min_samples 1: 24.0
boruvka_balltree, min_samples 2: 24.0
boruvka_balltree, min_samples 3: 38.0
boruvka_balltree, min_samples 4: 55.0
prims_kdtree, min_samples 1: 24.0
prims_kdtree, min_samples 2: 24.0
prims_kdtree, min_samples 3: 38.0
prims_kdtree, min_samples 4: 55.0
prims_balltree, min_samples 1: 24.0
prims_balltree, min_samples 2: 24.0
prims_balltree, min_samples 3: 38.0
prims_balltree, min_samples 4: 55.0
precomputed, min_samples 1: 24.0
precomputed, min_samples 2: 38.0
precomputed, min_samples 3: 55.0
precomputed, min_samples 4: 78.0
sparse precomputed, min_samples 1: 38.0
sparse precomputed, min_samples 2: 55.0
sparse precomputed, min_samples 3: 78.0
sparse precomputed, min_samples 4: 108.0

For prims_kdtree and prims_balltree, this can be fixed by changing min_samples to min_samples + 1 in the tree.query() call.

For boruvka_balltree I think that it should be fixed by changing min_samples to min_samples + 1 in _compute_bounds(). As a comment, this makes the _compute_bounds() function of BallTreeBoruvkaAlgorithm better match that of KDTreeBoruvkaAlgorithm.

For sparse precomputed it should be fixed by changing min_samples to min_samples - 1 in sparse_mutual_reachability().

I also noticed that min_samples is changed when the match_reference_implementation flag is set. If the comparison to the reference implementation was done using one of the changed algorithms then that code may also need to be changed.

I've linked a pull request with these suggested changes (excluding the possible change to match_reference_implementation).

The text was updated successfully, but these errors were encountered:

yossibiton · 2020-12-07T13:48:43Z

I'm not sure about sparse_mutual_reachability, for the case where the sparse distances matrix contain the self loop.
So when it looks at : core_distance[i] = sorted_row_data[min_points], if you take off the self loop it actually looks at min_points neighbors (and not min_points+1), as it should be.

My assumption is that min_points refer to the number of neighbors without the data point itself.

GregDemand mentioned this issue Jun 12, 2020

Fixed off by one errors in min_samples for multiple algorithms #394

Merged

GregDemand mentioned this issue Jul 17, 2020

Incorrect single linkage trees when using Boruvka ball trees for data with duplicate entries #404

Open

lmcinnes closed this as completed in #394 Nov 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

min_sample is off by one for prims_balltree, prims_kdtree, boruvka_balltree and sparse precomputed #393

min_sample is off by one for prims_balltree, prims_kdtree, boruvka_balltree and sparse precomputed #393

GregDemand commented Jun 12, 2020 •

edited

Loading

yossibiton commented Dec 7, 2020

min_sample is off by one for prims_balltree, prims_kdtree, boruvka_balltree and sparse precomputed #393

min_sample is off by one for prims_balltree, prims_kdtree, boruvka_balltree and sparse precomputed #393

Comments

GregDemand commented Jun 12, 2020 • edited Loading

yossibiton commented Dec 7, 2020

GregDemand commented Jun 12, 2020 •

edited

Loading