A python implementation for computing the Hopkins' statistic (Lawson and Jurs (1990)) for measuring clustering tendency of data.
Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters).
However, clustering algorithms will locate and specify clusters in data even if none are present. It is therefore appropriate to measure the clustering tendency or randomness of a data set before subjecting it to a clustering algorithm.
To measure cluster tendency is to measure to what degree clusters exist in the data to be clustered, and may be performed as an initial test, before attempting clustering.
Hopkins’ statistic is a simple measure of clustering tendency. It is based on the difference between the distance from a real point to its nearest neighbor, U, and the distance from a randomly chosen point within thedata space to the nearest real data point, W.
- Let X be the set of n data points.
- Consider a random sample (without replacement) of m<<n data points with members . (Lawson and Jurs (1990)) suggest choosing 5% of the data points so that the nearest-neighbor distances will be independent and thus approximate a Beta distribution.
- Generate a set Y of m uniformly randomly distributed data points.
- Define two distance measures,
- if the data is d dimensional, then the Hopkins statistic is defined as:
If X were uniformly distributed, then and would be close to each other, and thus H would be about 0.5. However, if clusters are present in X, then the distances for artificial points would be substantially larger than for the real ones in expectation, and thus the value of H will increase .
A value for H higher than 0.75 indicates a clustering tendency at the 90% confidence level.
The null and the alternative hypotheses are defined as follow:
- Null hypothesis: the data set X is uniformly distributed (i.e., no meaningful clusters)
- Alternative hypothesis: the data set X is not uniformly distributed (i.e., contains meaningful clusters)
Therefore, we can interpret Hopkins' statistic in the following manner:
-
If the value is between {0.01, ...,0.3}, the data is regularly spaced.
-
If the value is around 0.5, it is random.
-
If the value is between {0.7, ..., 0.99}, it has a high tendency to cluster.
Hopkins' Statistic was calculated for the iris dataset which averaged at about 0.83 (>0.75). Thus, we reject the null hypothesis, and conclude the iris dataset is significantly a clusterable data.
- https://www.datanovia.com/en/lessons/assessing-clustering-tendency/#statistical-methods
- https://pubs.acs.org/doi/pdf/10.1021/ci00065a010
- https://datascience.stackexchange.com/questions/14142/cluster-tendency-using-hopkins-statistic-implementation-in-python
- https://matevzkunaver.wordpress.com/2017/06/20/hopkins-test-for-cluster-tendency/