Skip to content

Latest commit

 

History

History

kmeans_mapreduce

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 

Implementation of the K-Means clustering algorithm on Hadoop

  1. Project Description
  2. Installing & Configuring Hadoop
  3. Running K-Means on Hadoop
  4. Results
  5. Team
  6. External Resourses

The aim of this project is to implement k-means clustering algorithm on Hadoop using sythetic data as a sample. The project was implemented in the context of the course "Big Data Management Systems" taught by Prof. Damianos Chatziantoniou. A detailed description of the assignment can be found here.


1. We assume that Python3 is already installed on the system.

2. Install Hadoop on Ubuntu according to the following website: How to install Hadoop on Ubuntu 18.04 Bionic Beaver Linux.

3. Install necessary requirements:

$ pip install -r requirements.txt

1. Clone this repository:

$ git clone https://github.com/ChryssaNab/BDMS-AUEB.git
$ cd /BDMS-AUEB/kmeans_mapreduce/src/

2. Run generateDataset.py to create the input data points:

 $ python3 generateDataset.py
  • In this example, the initial centers that are used are the following: (-100000, -100000), (1, 1), (100000, 100000).
  • The remaining data points are generated around these points following a normal distribution with a standard deviation of 5.0.

3. Upload the data to HDFS:

$ hdfs dfs -mkdir /kmeans
$ hdfs dfs -put $HADOOP_HOME/localFilePath/data-points.csv /kmeans

4. Run kMeansRunner.py to deploy k-means on Hadoop:

 $ python3 kMeansRunner.py

The output of the MapReduce process is also stored on Hadoop under the name part-00000.