This project contains the implementation of Lloyd's k-means algorithm, which is used for clustering data into K distinct clusters.
Lloyd's algorithm is run using two different initialization methods of cluster centroids: D2 sampling initialization (green line plots) and Metropolis Hastings initialization (red line plots). The experiment is conducted with three different values of K: 10, 100, and 500. When using Metropolis Hastings initialization, the experiment varies in terms of the lengths of the Markov chain, which is represented on the x-axis of the plots.
The results reveal that the D2 Sampling approach consistently outperforms Metropolis Hastings in terms of accuracy. Performance is measured on the y-axis, which represents the sum of squared distances of samples from their respective cluster centroids. Smaller values on the y-axis indicate better performance.
To run the code, follow these steps:
- Clone this repository locally
- Change directory to the cloned repository
- Set the desired value of K by modifying NUM_CLUSTERS variable in utils.py
- Run the python script k_means.py
Once the script completes, a plot like the ones above will be generated.
Note: You might have to install the necessary dependencies like NumPy, Matplotlib, etc.