You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The borvuka_balltree algorithm produces incorrect single linkage trees for some data sets with duplicate entries. The incorrect single linkage trees result in incorrect clusterings for these data sets.
It appears that the issue is mostly caused by a typo in BallTreeBoruvkaAlgorithm._compute_bounds().
As a note, I have been able to replicate the issue on two linux machines and failed to replicate the issue on a MacOS machine.
The issue can be seen using the following code snippet on the provided sample data (boruvka_testing_data.tar.gz) with duplicate entries:
As a note, while this pull request produces a dramatic improvement in the results, it does not completely fix the problem. Running the same code with min_cluster_size=5 and the linked pull request, produces the output,
which contains discrepancies for both the boruvka_balltree and boruvka_kdtree algorithms. These discrepancies
are on the order of 0.5% and should have a much smaller effect on the clustering.
The text was updated successfully, but these errors were encountered:
I've tracked down and fixed the discrepancy for the boruvka_kdtree algorithm and isolated the issue for the boruvka_balltree.
For the boruvka_kdtree algorithm, the discrepancy was caused by adding a reduced distance to the non-reduced distance when calculating the new_lower_bound. I have corrected this in my pull request.
For the boruvka_kdtree algorithm, the issue is also in the new_lower_bound, but I'm unsure of precisely what the issue is. If only the new_upper_bound is used, i.e. replace
new_bound = min(new_upper_bound, new_lower_bound + 2 * node1_info.radius)
with new_bound=new_upper_bound,
I get the correct answer, but the algorithm would be less performant.
The borvuka_balltree algorithm produces incorrect single linkage trees for some data sets with duplicate entries. The incorrect single linkage trees result in incorrect clusterings for these data sets.
It appears that the issue is mostly caused by a typo in BallTreeBoruvkaAlgorithm._compute_bounds().
As a note, I have been able to replicate the issue on two linux machines and failed to replicate the issue on a MacOS machine.
The issue can be seen using the following code snippet on the provided sample data (boruvka_testing_data.tar.gz) with duplicate entries:
Using code from master produces the following output:
The sum of distances in the single_linkage tree for the Borvuka ball tree algorithm is almost twice as large as the correct value.
Using the linked pull request #394 (which also incorporates the fix for #393) produces:
As a note, while this pull request produces a dramatic improvement in the results, it does not completely fix the problem. Running the same code with min_cluster_size=5 and the linked pull request, produces the output,
which contains discrepancies for both the boruvka_balltree and boruvka_kdtree algorithms. These discrepancies
are on the order of 0.5% and should have a much smaller effect on the clustering.
The text was updated successfully, but these errors were encountered: