Skip to content

Explorations of k-means clustering for Big Data, featuring sequential, streaming, and distributed implementations tailored for scalability and efficiency.

Notifications You must be signed in to change notification settings

AbirOumghar/BigData-KMeans-Explorations

Repository files navigation

BigData-KMeans-Explorations 📊

Project Overview 🌟

This repository reflects a pivotal element of my academic exploration in Machine Learning for Data Science M2 MLDS/AMSD master. It provides a deep dive into the k-means clustering algorithm through sequential, streaming, and distributed processing paradigms. These implementations embody the challenges and solutions encountered in large-scale data analysis, a testament to the hands-on, problem-solving approach fostered through my studies.

Implementation Highlights

  • Sequential k-means in Python: Generates a synthetic dataset and implements k-means to classify data points with minimal memory footprint.
  • Streaming k-means in Python: Adapts k-means for data streams, enabling dynamic cluster updates without the need to reprocess the entire dataset when new data arrives.
  • Distributed k-means with Apache Beam: Scales k-means to work with massive datasets that exceed single-machine memory capabilities, using Apache Beam for efficient parallel processing.

Table of Contents

  1. Installation
  2. Usage
  3. Contributing
  4. License

Installation

Provide instructions on setting up the project locally. For example:

git clone https://github.com/yourusername/BigData-KMeans-Explorations.git
cd BigData-KMeans-Explorations
pip install -r requirements.txt

About

Explorations of k-means clustering for Big Data, featuring sequential, streaming, and distributed implementations tailored for scalability and efficiency.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published