Well balanced hierarchical graph clustering.
It's based on the Paris algorithm, with just a tiny modification to make the trees more balanced. It's useful for example when you try to choose some cluster by traversing the tree top-down. It operates on networkx graphs.
pip install krakow
from krakow import krakow
dendrogram = krakow(Graph)
Graph
must be a networkx graph, and dendrogram
is in a linkage matrix format. If you want to traverse the tree from top to down, it's useful to convert this linkage matrix into a tree with scipy.cluster.hierarchy.to_tree
.
You can easily visualize dendrograms:
from krakow.utils import create_dendrogram
create_dendrogram(dendrogram)
Balance can be adjusted with the alpha
parameter. On default it's set to 2 (can be seen in the image above). When set to 1, the algorithm is identical to the Paris algorithm. The tree is clearly less balanced then:
dendrogram = krakow(Graph, alpha=1)
create_dendrogram(dendrogram)
Any value >= 1 can be set, but enforcing a very high balance can harm the clustering quality.
The only difference from the Paris algorithm, is the definition of distance between clusters. In Paris (see equation 2 in their paper), it's:
Here, it's defined as:
(with both alpha and beta >= 1)
In effect, big clusters are treated as having more distance between them, so small clusters will be merged first. This leads to a more balanced tree.
TODO: prove that for alpha, beta >= 1 algorithm is still reducible (which is necessary for correctness).
We can also experiment with other distance definitions (as long as they are reducible), and check which give the best balance while preserving clustering quality (clustering quality measured by Dasgupta's cost).