Project Overview 🌟
This repository reflects a pivotal element of my academic exploration in Machine Learning for Data Science M2 MLDS/AMSD master. It provides a deep dive into the k-means clustering algorithm through sequential, streaming, and distributed processing paradigms. These implementations embody the challenges and solutions encountered in large-scale data analysis, a testament to the hands-on, problem-solving approach fostered through my studies.
Implementation Highlights
- Sequential k-means in Python: Generates a synthetic dataset and implements k-means to classify data points with minimal memory footprint.
- Streaming k-means in Python: Adapts k-means for data streams, enabling dynamic cluster updates without the need to reprocess the entire dataset when new data arrives.
- Distributed k-means with Apache Beam: Scales k-means to work with massive datasets that exceed single-machine memory capabilities, using Apache Beam for efficient parallel processing.
Table of Contents
Provide instructions on setting up the project locally. For example:
git clone https://github.com/yourusername/BigData-KMeans-Explorations.git
cd BigData-KMeans-Explorations
pip install -r requirements.txt