This Jupyter notebook is focused on customer segmentation using the K-Means clustering algorithm.
-
Data Gathering and Preprocessing:
- Imports necessary libraries like pandas, matplotlib, seaborn, numpy, and scikit-learn.
- Loads a customer dataset from a CSV file hosted on GitHub.
- Performs data cleaning by:
- Removing null values in the 'Profession' column.
- Removing inconsistent age entries (below 18).
- Handling inconsistent values in numerical columns (e.g., removing 0 values from 'Annual Income').
- Identifies and removes outliers from numerical columns using the IQR method.
-
Data Visualization:
- Creates histograms to visualize the distribution of numerical features.
- Generates a countplot to show the distribution of customers by gender.
- Uses a pairplot to explore relationships between different features.
- Creates a barplot to display the number of customers in different age groups.
-
Data Standardization:
- Standardizes the numerical features using
StandardScaler
to have a mean of 0 and a standard deviation of 1. This is important for K-Means as it's sensitive to feature scaling.
- Standardizes the numerical features using
-
One-Hot Encoding:
- Applies one-hot encoding to the categorical features ('Gender' and 'Profession') to convert them into numerical representations suitable for the K-Means algorithm.
- Concatenates the standardized numerical features and the one-hot encoded categorical features into a single DataFrame (
df4
).
-
Model Training and Evaluation:
- Uses the Elbow method and the Silhouette score to determine the optimal number of clusters (k) for the K-Means algorithm. Both methods suggest
k=2
. - Trains a K-Means model with the optimal number of clusters.
- (Visualization Code Commented Out): There's commented-out code that would have visualized the clusters, but it's not executed.
- Saves the trained K-Means model to a pickle file (
customer_clustering_model.pkl
). - Loads the saved model from the pickle file and demonstrates how to use it to predict the cluster for new customer data.
- Uses the Elbow method and the Silhouette score to determine the optimal number of clusters (k) for the K-Means algorithm. Both methods suggest
Overall, the notebook provides a comprehensive example of how to perform customer segmentation using K-Means clustering, including data preprocessing, visualization, standardization, one-hot encoding, model training, evaluation, and saving/loading the model.