Welcome to Day 21 of the 30 Days of Data Science series! Today, we explore Clustering using the KMeans algorithm with Python's Scikit-learn library. Clustering is a fundamental unsupervised learning technique used to group similar data points together.
Clustering is a type of unsupervised learning that involves grouping data into clusters based on their similarities. Unlike supervised learning, clustering does not use labeled data. It’s widely used in:
- Customer segmentation
- Document clustering
- Image segmentation
- Anomaly detection
The KMeans algorithm works by:
- Randomly initializing
K
cluster centroids. - Assigning each data point to the nearest centroid.
- Updating centroids by calculating the mean of the assigned points.
- Repeating steps 2 and 3 until convergence.
KMeans tries to minimize the within-cluster sum of squares (WCSS) to ensure compact clusters.
Before we proceed, ensure you have Scikit-learn installed:
pip install scikit-learn matplotlib numpy
Let’s implement KMeans using Scikit-learn.
We’ll generate a sample dataset using Scikit-learn’s make_blobs
function:
import numpy as np
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
# Generating synthetic data
X, y = make_blobs(n_samples=300, centers=4, cluster_std=0.6, random_state=42)
# Visualizing the data
plt.scatter(X[:, 0], X[:, 1], s=50)
plt.title("Sample Data for Clustering")
plt.show()
Explanation:
n_samples
: Number of data points.centers
: Number of clusters.cluster_std
: Spread of each cluster.
Now, let’s apply the KMeans algorithm to cluster the data into 4 groups.
from sklearn.cluster import KMeans
# Applying KMeans
kmeans = KMeans(n_clusters=4, random_state=42)
y_kmeans = kmeans.fit_predict(X)
# Printing centroids
print("Cluster Centers:")
print(kmeans.cluster_centers_)
Explanation:
n_clusters
: The number of clusters.fit_predict()
: Assigns each point to a cluster and returns labels.
Let’s visualize the resulting clusters and centroids.
# Visualizing the clusters
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')
# Marking centroids
centroids = kmeans.cluster_centers_
plt.scatter(centroids[:, 0], centroids[:, 1], s=200, c='red', marker='X')
plt.title("Clusters and Centroids")
plt.show()
Output:
- Data points are colored based on their cluster.
- Red
X
marks represent the centroids.
- Market Segmentation: Grouping customers based on purchasing behavior.
- Image Compression: Reducing colors in an image using cluster centroids.
- Document Clustering: Grouping similar text documents.
- Biological Analysis: Grouping genes with similar expression patterns.
- Apply KMeans to a custom dataset of your choice.
- Experiment with different values of
n_clusters
and observe the results. - Explore the Elbow Method to determine the optimal number of clusters.
- Clustering is an essential unsupervised learning technique.
- KMeans groups data into clusters by minimizing WCSS.
- Scikit-learn provides easy-to-use tools for implementing KMeans.
- Visualizing clusters helps interpret results effectively.