This repository contains a C++ implementation of the K-Means clustering algorithm parallelized using OpenMP. K-Means is a popular unsupervised machine learning algorithm used for clustering data points into a predefined number of clusters. Parallelizing the algorithm using OpenMP allows for significant speedup on multi-core processors.
This implementation requires the following dependencies:
- C++ compiler with OpenMP support (e.g., g++)
- The OpenMP library
- Clone this repository to your local machine.
- Navigate to the repository's directory.
- Run
g++ main.cpp -o kmean -fopenmp -march=native
The algorithm takes as input a csv file containing one observation in each row. The feature must be separated by commas. Some inputs generated with dataset_gen.py
are provided.
To run use ./kmean <dataset_file_name> <number_of_clusters> <thread_number> <algorithm_type>
<dataset_file_name>
path of the dataset as a csv file<number_of_clusters>
number of clusters<thread_number>
number of thread to be used<algorithm_type>
can be: rand or pp for random or kmeans pp initializer respectivly
For example:
./kmean 100000_3_6.csv 6 16 rand
The dataset_gen.py module allows the user to create custom datasets and visualize the result of the kmeans algorithm. To use run python3 dataset_gen.py
and follow the instructions.
The result of the kmean algorithm is shown below
Raw Data | After kmeans |
---|---|