Idea: Analyze a retail dataset (e.g., customer purchase data) to identify distinct customer groups using K-means clustering. This helps in targeted marketing strategies.
You’ll need the following libraries for data manipulation, visualization, and clustering:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
You can download a dataset like the Mall Customers Dataset or use any customer data with attributes like age, income, spending score, etc.
Here’s how you load the dataset:
data = pd.read_csv('Mall_Customers.csv')
print(data.head())
Remove any missing values or unnecessary columns (if applicable). You can also scale the features for better clustering performance.
# Selecting the features we want to use for clustering
X = data[['Age', 'Annual Income (k$)', 'Spending Score (1-100)']]
# Standardizing the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
The Elbow Method helps in finding the number of clusters (k) by plotting the within-cluster sum of squares (WCSS) against the number of clusters.
wcss = []
for i in range(1, 11):
kmeans = KMeans(n_clusters=i, init='k-means++', random_state=42)
kmeans.fit(X_scaled)
wcss.append(kmeans.inertia_)
# Plotting the Elbow Curve
plt.plot(range(1, 11), wcss)
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()
Once you've identified the optimal number of clusters (e.g., 4 or 5 from the elbow method), you can fit the K-Means algorithm to the data:
kmeans = KMeans(n_clusters=5, init='k-means++', random_state=42)
kmeans.fit(X_scaled)
# Predicting the clusters
y_kmeans = kmeans.predict(X_scaled)
# Adding the cluster labels to the original dataset
data['Cluster'] = y_kmeans
You can use Seaborn or Matplotlib to visualize the clusters. Here’s an example for a 2D plot using any two features:
plt.figure(figsize=(10, 6))
sns.scatterplot(x='Annual Income (k$)', y='Spending Score (1-100)', hue='Cluster', data=data, palette='viridis', s=100)
plt.title('Customer Segmentation')
plt.show()
You can analyze the resulting clusters by comparing the average characteristics (like age, income, and spending score) of each cluster:
# Grouping data by clusters to understand the characteristics
cluster_profile = data.groupby('Cluster').mean()
print(cluster_profile)