This project focuses on predicting customer churn in an online retail company using machine learning algorithms. The dataset, sourced from Kaggle, underwent extensive exploratory data analysis (EDA), preprocessing, and classification methods, including Random Forest and XGBoost. Additionally, clustering techniques such as K-Means and DBSCAN were employed to identify customer segments.
- Introduction: The goal is to predict customer churn and perform customer segmentation to tailor promotional strategies.
- Exploratory Data Analysis (EDA): Analyzing data shape, types, correlations, imbalances, and missing values.
- Data Preprocessing: Handling missing values, outliers, encoding categorical variables, and balancing imbalanced data.
- Classification Methods: Employing Random Forest, XGBoost, and Logistic Regression with and without balancing data.
- Clustering Methods: Utilizing K-Means, DBSCAN, and Hierarchical clustering techniques.
- Random Forest: Achieved high accuracy, precision, and AUC-ROC; slightly lower recall.
- XGBoost: Outperformed other classifiers in most metrics.
- Logistic Regression: Showed comparatively lower scores.
- K-Means with t-SNE: Produced the most accurate clusters compared to other clustering methods.
- DBSCAN: Demonstrated less accurate clustering.
- Hierarchical Clustering: Used for visualizing dendrogram structure.
- Dataset source: Kaggle.
- Libraries used: pandas, scikit-learn, xgboost, seaborn, and others.