Name		Name	Last commit message	Last commit date
parent directory ..
Report		Report
results		results
src		src
Proj1_Hadoop_Description.pdf		Proj1_Hadoop_Description.pdf
README.md		README.md
requirements.txt		requirements.txt

Implementation of the K-Means clustering algorithm on Hadoop

The aim of this project is to implement k-means clustering algorithm on Hadoop using sythetic data as a sample. The project was implemented in the context of the course "Big Data Management Systems" taught by Prof. Damianos Chatziantoniou. A detailed description of the assignment can be found here.

Installing & Configuring Hadoop

1. We assume that Python3 is already installed on the system.

2. Install Hadoop on Ubuntu according to the following website: How to install Hadoop on Ubuntu 18.04 Bionic Beaver Linux.

3. Install necessary requirements:

$ pip install -r requirements.txt

Running K-Means on Hadoop

1. Clone this repository:

$ git clone https://github.com/ChryssaNab/BDMS-AUEB.git
$ cd /BDMS-AUEB/kmeans_mapreduce/src/

2. Run generateDataset.py to create the input data points:

 $ python3 generateDataset.py

In this example, the initial centers that are used are the following: (-100000, -100000), (1, 1), (100000, 100000).
The remaining data points are generated around these points following a normal distribution with a standard deviation of 5.0.

3. Upload the data to HDFS:

$ hdfs dfs -mkdir /kmeans
$ hdfs dfs -put $HADOOP_HOME/localFilePath/data-points.csv /kmeans

4. Run kMeansRunner.py to deploy k-means on Hadoop:

 $ python3 kMeansRunner.py

Results

The output of the MapReduce process is also stored on Hadoop under the name part-00000.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kmeans_mapreduce

kmeans_mapreduce

README.md

Implementation of the K-Means clustering algorithm on Hadoop

Contents

Project Description

Installing & Configuring Hadoop

Running K-Means on Hadoop

Results

Team

External Resourses

Files

kmeans_mapreduce

Directory actions

More options

Directory actions

More options

Latest commit

History

kmeans_mapreduce

Folders and files

parent directory

Implementation of the K-Means clustering algorithm on Hadoop