AdCluster Insights: KMeans Clustering for Advertising Data

This project utilizes Java with the MapReduce framework to perform KMeans clustering on advertising performance data, aiming to categorize key phrases based on performance metrics like bid amounts, impressions, clicks, and ad ranks. The purpose is to uncover underlying patterns in advertising strategies, offering insights that could potentially guide advertisers towards more impactful methodologies.

Input Format

Input Format The input dataset should be in a tab-separated format with the following fields:

Day of the data record
Anonymized account ID of the advertiser
Rank of the advertisement
Anonymized keyphrase (a list of anonymized keywords)
Average bid for the keyphrase
Number of impressions (times the ad was shown)
Number of clicks (times users interacted with the ad)

Initial Centroids File Format

The file specified by should contain initial centroid values, one centroid per line, with values comma-separated:

bid,impressions,clicks,rank

Output

The MapReduce job outputs recalculated centroid values after processing the dataset. Each line in the output file represents a centroid with its updated values, formatted as follows:

centroid_id    avg_bid,impressions,clicks,rank

Prerequisites

Before you begin, ensure you have the following installed:

Java JDK 11
Apache Maven
Apache Hadoop 3.x
Apache Spark 3.x
Docker (optional for Docker deployment)
AWS CLI (configured for AWS deployment)

Local Installation

macOS and Ubuntu

Step 1: Install Common Utilities and Packages

macOS:

brew install wget curl vim make
# tzdata is generally not required for macOS, as timezone handling is built into the OS

Ubuntu:

sudo apt-get update
sudo apt-get install -y --no-install-recommends apt-utils wget curl vim make
sudo apt-get install -y tzdata

Step 2: Install Java JDK 11

macOS:
```
brew install openjdk@11
```

Ubuntu:

sudo apt update
sudo apt install -y openjdk-11-jdk

Step 3: Set JAVA_HOME Environment Variable

Add to your .bashrc or .zshrc file:

export JAVA_HOME=$(dirname $(dirname $(readlink -f $(which java))))
# For macOS, adjust JAVA_HOME accordingly:
export JAVA_HOME=/usr/local/opt/openjdk@11

Step 4: Install Maven

macOS and Ubuntu:

brew install maven # macOS
sudo apt install maven # Ubuntu

Step 5: Install AWS CLI

macOS:
```
brew install awscli
```
Ubuntu:
```
sudo apt-get install -y awscli
```

Step 6: Install Scala using Coursier (For Future implementation)

Common for both OS:

curl -fLo cs https://github.com/coursier/coursier/releases/latest/download/cs-x86_64-pc-linux.gz
gunzip cs
chmod +x cs
./cs setup -y
./cs install scala:2.12.17 scalac:2.12.17

Step 7: Install Hadoop

Common for both OS:

wget https://downloads.apache.org/hadoop/common/hadoop-3.3.5/hadoop-3.3.5.tar.gz
tar -xzf hadoop-3.3.5.tar.gz -C /usr/local
sudo mv /usr/local/hadoop-3.3.5 /usr/local/hadoop

Step 8: Install Spark [For Spark Code to be added in future]

Common for both OS:

  wget https://archive.apache.org/dist/spark/spark-3.3.2/spark-3.3.2-bin-without-hadoop.tgz
  tar -xzf spark-3.3.2-bin-without-hadoop.tgz -C /usr/local
  sudo mv /usr/local/spark-3.3.2-bin-without-hadoop /usr/local/spark

Step 9: Set Environment Variables

Add the following lines to your shell configuration file (.bashrc, .zshrc, etc.):

  export HADOOP_HOME=/usr/local/hadoop
  export SPARK_HOME=/usr/local/spark
  export SCALA_HOME=$HOME/.local/share/coursier/bin
  export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$SPARK_HOME/bin

Docker Deployment

Building the Docker Image

For ARM64 architecture:
```
  docker build -t kmeans-project .
```

For AMD64 architecture (adjust Dockerfile as needed):

  docker build -f DockerfileAMD -t kmeans-project .

Running the Container

  docker run -it --name kmeans-container kmeans-project

Accessing the Container

  docker exec -it kmeans-container bash

Running the Application

Using Makefile

Ensure your Makefile is properly set up to handle tasks from compilation to cleanup:

Compile the project:

    make jar

Run KMeans locally or within Docker:

    make run-kmeans

Clean up generated output files:

    make clean-local-output

AWS Deployment

Setup and Configuration

Configure your AWS CLI and ensure your credentials are set up:

    # Make sure to add your AWS Credentials for the following locations:-
    ~/.aws/config
    ~/.aws/credentials

Launch and Manage EMR Cluster

Create a bucket on S3:

    make make-bucket

Upload the dataset to S3 Bucket:

    make upload-input-aws

Upload the app jar to S3 Bucket:

    make upload-app-aws

Deploy the application on AWS EMR:

    make aws

Download results from AWS S3 after execution:

    make download-output-aws

Cleanup

Local and AWS Resource Management

Local cleanup:

    make clean-local-output

AWS cleanup (to avoid unnecessary charges):

    make delete-output-aws
    aws emr terminate-clusters --cluster-ids <cluster-id>

Contributing

Contributions to enhance the project are welcome. Please create a branch for your contributions.

Acknowledgments

Yahoo! for providing the Search Marketing Advertiser Bid-Impression-Click dataset. A4 - Yahoo Data Targeting User Modeling, Version 1.0 (Hosted on AWS)(3.7Gb)
Apache Hadoop and Apache Maven communities for their open-source software.

Contact

Parag Ghorpade - Github Profile

Feel free to reach out for any questions or contributions to the project.

License

Distributed under the MIT License. See LICENSE for more information.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
EDA_Files		EDA_Files
config		config
graphics		graphics
input		input
src/main		src/main
target		target
.DS_Store		.DS_Store
.gitignore		.gitignore
Dockerfile		Dockerfile
DockerfileAMD		DockerfileAMD
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AdCluster Insights: KMeans Clustering for Advertising Data

Input Format

Initial Centroids File Format

Output

Table of Contents

Prerequisites

Local Installation

macOS and Ubuntu

Step 1: Install Common Utilities and Packages

Step 2: Install Java JDK 11

Step 3: Set JAVA_HOME Environment Variable

Step 4: Install Maven

Step 5: Install AWS CLI

Step 6: Install Scala using Coursier (For Future implementation)

Step 7: Install Hadoop

Step 8: Install Spark [For Spark Code to be added in future]

Step 9: Set Environment Variables

Docker Deployment

Building the Docker Image

Running the Container

Accessing the Container

Running the Application

Using Makefile

AWS Deployment

Setup and Configuration

Launch and Manage EMR Cluster

Cleanup

Local and AWS Resource Management

Contributing

Acknowledgments

Contact

License

About

Releases 1

Packages

Languages

License

Parag0506/KMeans-in-MapReduce

Folders and files

Latest commit

History

Repository files navigation

AdCluster Insights: KMeans Clustering for Advertising Data

Input Format

Initial Centroids File Format

Output

Table of Contents

Prerequisites

Local Installation

macOS and Ubuntu

Step 1: Install Common Utilities and Packages

Step 2: Install Java JDK 11

Step 3: Set JAVA_HOME Environment Variable

Step 4: Install Maven

Step 5: Install AWS CLI

Step 6: Install Scala using Coursier (For Future implementation)

Step 7: Install Hadoop

Step 8: Install Spark [For Spark Code to be added in future]

Step 9: Set Environment Variables

Docker Deployment

Building the Docker Image

Running the Container

Accessing the Container

Running the Application

Using Makefile

AWS Deployment

Setup and Configuration

Launch and Manage EMR Cluster

Cleanup

Local and AWS Resource Management

Contributing

Acknowledgments

Contact

License

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages