"Dinosaurs that failed to adapt went extinct. The same thing will happen to data scientists who think that training ML models inside Jupyter notebooks is enough." - Pau Labarta Bajo.
- Overview
- Objective
- Customer Churn and What it's all about
- Dataset
- MlFlow Integration
- Data Version Control (DVC)
- Azure
- Running Locally
This repository is an end-to-end machine learning project that focuses on predicting customer churn. It follows a comprehensive workflow that includes data ingestion, validation, transformation, model training, and model evaluation. The project aims to develop a predictive model that can identify customers who are likely to churn, allowing businesses to take proactive measures to retain them.
Building on the foundational end to end workflow utilized in my previous project, "Prediction of Mohs Hardness", the objective of this project is to integrate the use of MLflow and DVC into my workflow. MLflow is a machine learning lifecycle management platform that enables tracking experiments, packaging code, and managing models. DVC (Data Version Control) is a version control system for machine learning projects that allows for efficient data and model versioning. By integrating MLflow and DVC, I aim to enhance my code reproducibility and efficient version control of my datasets and models.
Customer churn refers to the phenomenon where customers stop doing business with a company or stop using its products or services. It is a critical metric for businesses, especially in industries with subscription-based models or recurring revenue streams.
Identifying customers who are likely to churn can help businesses take proactive measures to retain them, thereby reducing revenue loss and improving customer satisfaction.
The dataset used for this project is obtained from Kaggle. It contains the following attributes:
- Customer ID: A unique identifier for each customer
- Surname: The customer's surname or last name
- Credit Score: A numerical value representing the customer's credit score
- Geography: The country where the customer resides (France, Spain, or Germany)
- Gender: The customer's gender (Male or Female)
- Age: The customer's age
- Tenure: The number of years the customer has been with the bank
- Balance: The customer's account balance
- NumOfProducts: The number of bank products the customer uses (e.g., savings account, credit card)
- HasCrCard: Whether the customer has a credit card (1 = yes, 0 = no)
- IsActiveMember: Whether the customer is an active member (1 = yes, 0 = no)
- EstimatedSalary: The estimated salary of the customer
- Exited: Whether the customer has churned (1 = yes, 0 = no)
To integrate MLflow into the project, I used Dagshub as my remote server where I can easily log and compare different experiments and also track the performance of my model.
To integrate Data Version Control (DVC) into the project, I defined a YAML file that specifies the different stages of the pipeline. Each stage has a command (cmd
) that runs a Python script, dependencies (deps
) that are required for the script to execute, and outputs (outs
) that are generated by the script. Additionally, some stages have parameters (params
) and metrics (metrics
) that are used for model training and evaluation, respectively.
Here is an overview of the stages defined in the YAML file:
-
data_ingestion: This stage runs the
stage_01_data_ingestion.py
script, which is responsible for ingesting the data. The dependencies include the script itself, thedata_ingestion.py
component, theconfig.yaml
file, and the output CSV fileChurn_Modelling.csv
. -
data_validation: This stage runs the
stage_02_data_validation.py
script, which validates the ingested data. The dependencies include the script, thedata_validation.py
component, the output CSV file from the previous stage, theconfig.yaml
file, and theschema.yaml
file. The output is astatus.txt
file indicating the status of the validation. -
data_transformation: This stage runs the
stage_03_data_transformation.py
script, which transforms the validated data. The dependencies include the script, thedata_transformation.py
component, thestatus.txt
file from the previous stage, and theconfig.yaml
file. The outputs include a preprocessor joblib file, and train and test CSV files. -
model_training: This stage runs the
stage_04_model_trainer.py
script, which trains a machine learning model. The dependencies include the script, themodel_trainer.py
component, the train CSV file from the previous stage, and theconfig.yaml
file. The parameters for the model training are specified in the YAML file. The output is a trained model joblib file. -
model_evaluation: This stage runs the
stage_05_model_evaluation.py
script, which evaluates the trained model. The dependencies include the script, themodel_evaluation.py
component, the test CSV file from the previous stage, the trained model joblib file, and theconfig.yaml
file. The metrics generated during the evaluation are stored in ametrics.json
file.
The last step is to actually deploy this project. However, I could not deploy it because I am currently out of my student Azure subscription. If you have an Azure subscription, you can follow the steps below to deploy the project:
- Create an Azure Machine Learning workspace.
- Set up the necessary resources such as compute instances, storage accounts, and container registries.
- Build a Docker image of the project.
- Deploy the Docker image to Azure Container Instances or Azure Kubernetes Service.
This project was however deployed to Heroku. You can access the code here.
git clone https://github.com/Oyebamiji-Micheal/End-to-End-Customer-Churn-Prediction-using-MLflow-and-DVC
Windows (cmd)
cd End-to-End-Customer-Churn-Prediction-using-MLflow-and-DVC
pip install virtualenv
python -m virtualenv venv
or
python3 -m venv venv
macOS/Linux
cd End-to-End-Customer-Churn-Prediction-using-MLflow-and-DVC
pip install virtualenv
python -m virtualenv venv
Windows (cmd)
venv\scripts\activate
macOS/Linux
. venv/bin/activate
or
source venv/bin/activate
Windows/macOS/Linux
pip install -r requirements.txt
python app.py
Now,
Open the url: http://127.0.0.1:5000/