Biased Dataset Clustering: Greedy Preprocessing & k-Means Integration on Large Scale Data

📌 Project Overview

This project addresses challenges in applying k-means clustering to biased datasets by implementing a parallelized greedy clustering algorithm for preprocessing. The algorithm reduces dataset size by selecting representatives based on a tunable distance threshold (τ), enabling efficient clustering. A comparative analysis with random subsampling evaluates computational efficiency and clustering quality.

Dataset: 118,821 data points with inherent biases (e.g., age, wealth).
Goal: Mitigate bias effects by preprocessing data into representative clusters (1%, 10%, 25% sizes) and compare methods.

🚀 Key Features

Parallelized Greedy Clustering:
- Reduces dataset to target cluster ratios (1%, 10%, 25%) via adaptive τ tuning.
- Optimized for cache behavior and minimal memory usage.
k-Means Integration:
- Clusters representatives from preprocessing step.
- Post-processing assigns original data points to clusters.
Random Subsampling Baseline:
- Generates comparison dataset by randomly sampling equivalent proportions.
Performance Analysis:
- Metrics: Runtime, memory usage, Silhouette Score, intra/inter-cluster distances.

🔍 Findings

Greedy Clustering:
- Achieved target cluster sizes (1%, 10%, 25%) with τ=100.
- Runtime: 43.19s | Memory: 1.03MB.
- Clustering Quality: Silhouette Score (-0.0029), Intra-cluster distance (15,289.29).
Random Subsampling:
- Runtime: 37.14s | Memory: Negligible.
- Clustering Quality: Silhouette Score (0.1127), Intra-cluster distance (12,397.45).
Conclusion:
- Subsampling outperformed in speed and clustering quality for this dataset.
- Greedy clustering offers structured preprocessing for bias mitigation but requires tuning.

🛠 System Requirements

Dependencies

Python 3.8+
Libraries: numpy, pandas, scikit-learn, multiprocessing

📄 License

This project is licensed under the MIT License. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
K_Means_Clustering.ipynb		K_Means_Clustering.ipynb
LICENSE		LICENSE
README.md		README.md
dataset.csv		dataset.csv
final_a1.csv		final_a1.csv
final_a2.csv		final_a2.csv
greedy_clustering.ipynb		greedy_clustering.ipynb
small_dataset.csv		small_dataset.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Biased Dataset Clustering: Greedy Preprocessing & k-Means Integration on Large Scale Data

📌 Project Overview

🚀 Key Features

🔍 Findings

🛠 System Requirements

Dependencies

📄 License

About

Releases

Packages

Languages

License

headless-start/bias-aware-kmeans

Folders and files

Latest commit

History

Repository files navigation

Biased Dataset Clustering: Greedy Preprocessing & k-Means Integration on Large Scale Data

📌 Project Overview

🚀 Key Features

🔍 Findings

🛠 System Requirements

Dependencies

📄 License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages