Skip to content

Oyebamiji-Micheal/End-to-End-Customer-Churn-Prediction-using-MLflow-and-DVC

Repository files navigation

Customer Churn Prediction

Language Framework build reposize Topic ML Tool ML Tool

An end to end machine learning project implementation with Azure deployment

"Dinosaurs that failed to adapt went extinct. The same thing will happen to data scientists who think that training ML models inside Jupyter notebooks is enough." - Pau Labarta Bajo.


Table of Contents

Overview

This repository is an end-to-end machine learning project that focuses on predicting customer churn. It follows a comprehensive workflow that includes data ingestion, validation, transformation, model training, and model evaluation. The project aims to develop a predictive model that can identify customers who are likely to churn, allowing businesses to take proactive measures to retain them.

Objective

Building on the foundational end to end workflow utilized in my previous project, "Prediction of Mohs Hardness", the objective of this project is to integrate the use of MLflow and DVC into my workflow. MLflow is a machine learning lifecycle management platform that enables tracking experiments, packaging code, and managing models. DVC (Data Version Control) is a version control system for machine learning projects that allows for efficient data and model versioning. By integrating MLflow and DVC, I aim to enhance my code reproducibility and efficient version control of my datasets and models.

Customer Churn and What it's all about

Customer churn refers to the phenomenon where customers stop doing business with a company or stop using its products or services. It is a critical metric for businesses, especially in industries with subscription-based models or recurring revenue streams.

Identifying customers who are likely to churn can help businesses take proactive measures to retain them, thereby reducing revenue loss and improving customer satisfaction.

Dataset

The dataset used for this project is obtained from Kaggle. It contains the following attributes:

  • Customer ID: A unique identifier for each customer
  • Surname: The customer's surname or last name
  • Credit Score: A numerical value representing the customer's credit score
  • Geography: The country where the customer resides (France, Spain, or Germany)
  • Gender: The customer's gender (Male or Female)
  • Age: The customer's age
  • Tenure: The number of years the customer has been with the bank
  • Balance: The customer's account balance
  • NumOfProducts: The number of bank products the customer uses (e.g., savings account, credit card)
  • HasCrCard: Whether the customer has a credit card (1 = yes, 0 = no)
  • IsActiveMember: Whether the customer is an active member (1 = yes, 0 = no)
  • EstimatedSalary: The estimated salary of the customer
  • Exited: Whether the customer has churned (1 = yes, 0 = no)

MlFlow Integration

To integrate MLflow into the project, I used Dagshub as my remote server where I can easily log and compare different experiments and also track the performance of my model.

Data Version Control (DVC)

To integrate Data Version Control (DVC) into the project, I defined a YAML file that specifies the different stages of the pipeline. Each stage has a command (cmd) that runs a Python script, dependencies (deps) that are required for the script to execute, and outputs (outs) that are generated by the script. Additionally, some stages have parameters (params) and metrics (metrics) that are used for model training and evaluation, respectively.

Here is an overview of the stages defined in the YAML file:

  • data_ingestion: This stage runs the stage_01_data_ingestion.py script, which is responsible for ingesting the data. The dependencies include the script itself, the data_ingestion.py component, the config.yaml file, and the output CSV file Churn_Modelling.csv.

  • data_validation: This stage runs the stage_02_data_validation.py script, which validates the ingested data. The dependencies include the script, the data_validation.py component, the output CSV file from the previous stage, the config.yaml file, and the schema.yaml file. The output is a status.txt file indicating the status of the validation.

  • data_transformation: This stage runs the stage_03_data_transformation.py script, which transforms the validated data. The dependencies include the script, the data_transformation.py component, the status.txt file from the previous stage, and the config.yaml file. The outputs include a preprocessor joblib file, and train and test CSV files.

  • model_training: This stage runs the stage_04_model_trainer.py script, which trains a machine learning model. The dependencies include the script, the model_trainer.py component, the train CSV file from the previous stage, and the config.yaml file. The parameters for the model training are specified in the YAML file. The output is a trained model joblib file.

  • model_evaluation: This stage runs the stage_05_model_evaluation.py script, which evaluates the trained model. The dependencies include the script, the model_evaluation.py component, the test CSV file from the previous stage, the trained model joblib file, and the config.yaml file. The metrics generated during the evaluation are stored in a metrics.json file.

Azure

The last step is to actually deploy this project. However, I could not deploy it because I am currently out of my student Azure subscription. If you have an Azure subscription, you can follow the steps below to deploy the project:

  1. Create an Azure Machine Learning workspace.
  2. Set up the necessary resources such as compute instances, storage accounts, and container registries.
  3. Build a Docker image of the project.
  4. Deploy the Docker image to Azure Container Instances or Azure Kubernetes Service.

This project was however deployed to Heroku. You can access the code here.

Running Locally

STEP 00 - Clone the repository

git clone https://github.com/Oyebamiji-Micheal/End-to-End-Customer-Churn-Prediction-using-MLflow-and-DVC

STEP 01 - Create a virtual environment

Windows (cmd)

cd End-to-End-Customer-Churn-Prediction-using-MLflow-and-DVC
pip install virtualenv
python -m virtualenv venv

or

python3 -m venv venv

macOS/Linux

cd End-to-End-Customer-Churn-Prediction-using-MLflow-and-DVC
pip install virtualenv
python -m virtualenv venv

STEP 02 - Activate environment

Windows (cmd)

venv\scripts\activate

macOS/Linux

. venv/bin/activate

or

source venv/bin/activate

STEP 03 - Install the Requirements

Windows/macOS/Linux

pip install -r requirements.txt

STEP 04 - Run app.py

python app.py

Now,

Open the url: http://127.0.0.1:5000/ 

About

An end to end machine learning project implementation with mlflow and DVC

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published