Skip to content

Final Project for Introduction to Parallel Computing: Parallelization of K-Means clustering algorithm.

Notifications You must be signed in to change notification settings

AlphaNightLight/parallel_k_means

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Parallel Final Project

Final Project for Introduction to Parallel Computing: Parallelization of K-Means clustering algorithm.

Disclamer: the serial submodule

The serial/ directory is the submodule of the Elia Zonta's KMeans project that we used as the starting point for this one. However, we changed a lot of things, so do not rely on it but consider the distributed/ folder as the root. In fact, all the files we're going to describe belong to that folder.

Files in this repository

In this repository you will find the following files and folders:

  • parallel_computing_final_report.pdf: the report that explains the objectives and realization of this project.
  • setup.sh: a bash script that creates the necessary folders and initializes the headers of measure files.
  • params.sh: a bash script that exports the variables used by other scripts. With it, you just need to modify a single file to change the behavior of many.
  • Makefile: the makefile that contains the recipes to compile the source codes.
  • plot_measures.m: a GNU-Octave script that converts the CSV files in measures/ folder into PNG plots in the plots/ folder.
  • include/: the folder containing the headers.
    • point.h: the header for point.cpp.
    • utils.h: the header for utils.cpp.
    • k_means.h: the header for all the 5 k_menas files.
  • src/: the folder with the C++ source codes.
    • point.cpp: contains the definition of the Point class.
    • utils.cpp: contains functions to read/write the files.
    • create_dataset.cpp: creates a dataset in the data/ folder. To be as fair as possible in time comparison, we create also the random initial centroids as files that will be read.
    • main_serial.cpp: the main code designed to call the serial k_menas.
    • main_omp.cpp: the main code designed to call the OMP versions of k_menas.
    • main_mpi.cpp: the main code designed to call the MPI versions of k_menas.
    • k_means_serial.cpp: the K-Means algorithm in its serial fashion.
    • k_means_omp_static.cpp: the K-Means algorithm parallelized with OMP, static thread scheduling.
    • k_means_omp_dynamic.cpp: the K-Means algorithm parallelized with OMP, dynamic thread scheduling.
    • k_means_mpi.cpp: the K-Means algorithm adapted to MPI.
    • k_means_mpi_asynch.cpp: the K-Means algorithm in which we explore the effects of MPI asynchronous communication.
    • compare_results.cpp: to ensure the correctness of parallel approaches, this program compare all their output files with the serial one, creating a file compare.txt in the out/ folder.
  • scripts/: a folder containing the PBS files to prepare the environment before the tests.
    • compile.pbs: this file calls Makefile to compile the source codes, leading to the creation of obj/ and bin/ folders.
    • create_dataset.pbs: it invokes create_dataset.exe to create the data and some random initial centroids in the data/ folder.
    • compare_results.pbs: a script that calls compare_results.exe to compare the parallel outputs with the serial one and produce a comparison file, everything in the out/ folder.
  • run/: it contains the PBS files to run the experiment.
    • run_serial.pbs: runs k_means_serial.exe, the serial case.
    • run_omp_static_strong.pbs: runs k_means_omp_static.exe, with increasing threads and fixed number of points.
    • run_omp_static_weak.pbs: runs k_means_omp_static.exe, increasing both threads and number of points.
    • run_omp_dynamic_strong.pbs: runs k_means_omp_dynamic.exe, with increasing threads and fixed number of points.
    • run_omp_dynamic_weak.pbs: runs k_means_omp_dynamic.exe, increasing both threads and number of points.
    • run_mpi_strong.pbs: runs k_means_mpi.exe, with increasing processors and fixed number of points.
    • run_mpi_weak.pbs: runs k_means_mpi.exe, increasing both processors and number of points.
    • run_mpi_asynch_strong.pbs: runs k_means_mpi_asynch.exe, with increasing processors and fixed number of points.
    • run_mpi_asynch_weak.pbs: runs k_means_mpi_asynch.exe, increasing both processors and number of points.

The execution of the programs will lead to the creation and filling of the following folders:

  • obj/: intermediate folder containing object files.
  • bin/: the folder that contains the executables.
  • logs/: this folder stores the outputs.o and errors.e files generated by PBS.
  • data/: a folder that stores the dataset as well as the initial random centroids.
  • out/: here is where the programs put their results.
  • measures/: the place where time measurements are stored.
  • plots/: in this folder goes the plots obtained from the measures.

The folder results/ contains the copy of the latter folders that have been generated by our experiments on the unitn hpc cluster. Like this, you can run your own tests while keeping our results for comparison.

How to run the experiment

Run this experiment is very simple thanks to the scripts that automate most of the passages.

In the unitn hpc cluster, do follow these steps:

  1. Run bash setup.sh, it will create the folders required for the next steps.
  2. If you wish to modify some hyper-parameters, edit params.sh.
  3. Submit the compilation task to the cluster: qsub scripts/compile.pbs.
  4. Create the dataset and the initial centroids, again via PBS: qsub scripts/create_dataset.pbs.
  5. Now, you are ready to launch all the tests that you want with qsub run/run_<filename>.pbs. The scripts in the run/ folder are independent, so you can run them in the order that you wish. This will lead to the creation of some files in out/ and measures/ folders.
  6. To check the correctness of the parallel outputs, run qsub scripts/compare_results.pbs, it will create a file out/compare.txt.
  7. On a machine with installed GNU-Octave, you can run octave plot_results.m to create the plots. Alternatively, you can use any plot tool you want. In that case just have a look at the CSV headers to understand the rationale behind them.
  8. If you want to clean your workspace, you have two opportunities. make clean will delete all the objects, executables and logs, but it will preserve data, outputs, measurements and plots. To delete them as well, the command is make clean_everything.

Note: if you run a new simulation without cleaning and set-upping the workspace, the outputs will be overwritten, but the measures will undergo an append, that may make them un-plottable.

For any additional information, read docs/report_parallel.pdf and docs/parallel_presentation.pdf.

Alex Pegoraro, Elia Zonta

About

Final Project for Introduction to Parallel Computing: Parallelization of K-Means clustering algorithm.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published