This project is a C implementation of the k-means clustering algorithm that has been parallelized to run across multiple threads with OpenMP and uses silhouette coefficients to find an optimal number of clusters.
This algorithm first attempts to identify an optimal number of clusters to solve for, using silhouette coefficients that are averaged over k-folds. The dataset is parsed from a file and split into training and testing datasets and uses k-folds cross-validation. Once silhouette coefficients have been calculated for a range of k values, a target k is selected, and centroids are calculated on the entire dataset.
This implementation can handle datasets of arbitrary dimension and length. The expected input format is comma-separated, but the delimiter can be changed with the '-d' flag. For an example dataset, see data/iris.csv.
Two output files will be generated in the directory of the binary. The first 'output_clusters.csv' will be the dataset with an additional column indicating which cluster each point belongs to. The second file is 'output_centroids.csv', which contains the coordinates of the centroids.
-i [filepath]
-
Input filename/path.
-
Default is 'input.cvs'
-
Delimiter used when parsing the input dataset file.
-
Default delimiter is ",".
-
Specify the number of clusters to identify, k. If you know the number of clusters that should be identified, you can pass this option to bypass using silhouette analysis.
-
Must be a positive integer.
-
Specify the minimum number of clusters to analyze during silhouette analysis.
-
Must be a positive integer.
-
Default is 2.
-
Specify the maximum number of clusters to analyze during silhouette analysis.
-
Must be a positive integer.
-
Default is 10.
-
Maximum allowed iterations in each k-means.
-
Must be a positive integer.
-
Default is 100.
-
Number of parallel executed k-means.
-
Must be a positive integer.
-
Default is 100.
-
Number of folds for cross-validation.
-
Must be a positive integer.
-
Default is 5.
-
Number of threads to spread the workload across.
-
Must be a positive integer.
-
Default behavior will use all available threads.
-
Randomize the dataset order. It is important that the dataset is randomized for cross-validation.
-
Normalize the dataset. This is a good idea if the dataset is not already normalized.
Linux:
-
Clone repository
git clone https://github.com/lmarzen/k-means-clustering.git
or download and extract ZIP. -
Open a terminal(or command prompt on Windows) in the src directory and run
make
to build the program. -
Run the program by typing
./kmeans
followed by any valid arguments. -
Done.