This project focuses on creating an end-to-end pipeline for model and drift monitoring, ensuring that a loan eligibility classification model remains accurate over time.
In real-world scenarios, machine learning models deployed in production may encounter data drift, concept drift, or model degradation. Monitoring these changes is essential to maintain the model's effectiveness. By orchestrating monitoring pipelines with Airflow and Docker, this project demonstrates practical solutions for real-time model and data tracking.
The project retrieves data from a PostgreSQL server related to loan eligibility. The dataset may contain various features and labels that influence loan decisions. Participants will work with this real-world data to build, monitor, and evaluate a machine learning model.
- Language:
Python
- Libraries:
pandas
,numpy
,matplotlib
,scikit-learn
,deepchecks
,sqlalchemy
,psycopg2-binary
- Services:
Airflow
,Docker
,PostgreSQL
The project follows a structured approach:
- Extracting data from PostgreSQL.
- Data preprocessing, including train-test splitting, encoding, imputation, rescaling, and feature engineering.
- Building and evaluating machine learning models (Random Forest and Gradient Boosting).
- Monitoring for concept drift, data drift, and model drift.
- Orchestrating monitoring pipelines using Airflow.
This project considers data privacy and security, especially when dealing with sensitive information related to loan eligibility. It emphasizes best practices for handling and protecting such data.
The project includes robust error handling and data validation mechanisms to ensure the quality and reliability of the monitoring process.
The project discusses deployment considerations for monitoring systems in a production environment. It covers best practices and potential challenges in deploying machine learning models and monitoring solutions.
The data is available in code>main>dags>data>raw location.
Kindly upload the data and provide the appropriate credentials in code>main>dags>creds.json file.
- Deepcheks
- Airflow
- Slack integration: alerts
Here we will simply setup our environment using docker.
-
Make sure docker and docker-compose are setup properly
-
Make a github repo an check in all the code which can be found here.
-
Clone the gitrepo:
git clone git@github.com:your git address
-
To proceed, make sure
- Docker can have access to at least 4GB of Memory on your system
- Navigate to
dags/src/config.py
and ensureRUN_LOCAL
is set toFalse
-
While in the same home directory as
docker-compose.py
start docker-compose by issuing this command on you terminal:docker-compose up
This will take a couple of minutes to boot up all containers. To check if all containers are running properly, you can rundocker ps --all
. You should see a list of all containers inhealthy
statusCONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 5dea90526ec4 apache/airflow:2.2.4 "/usr/bin/dumb-init …" 23 hours ago Up 23 hours (healthy) 8080/tcp project_01_model-testing_airflow-scheduler_1 b27cf17c76d4 apache/airflow:2.2.4 "/usr/bin/dumb-init …" 23 hours ago Up 23 hours (healthy) 8080/tcp project_01_model-testing_airflow-triggerer_1 b254faa326cb apache/airflow:2.2.4 "/usr/bin/dumb-init …" 23 hours ago Up 23 hours (healthy) 0.0.0.0:5555->5555/tcp, 8080/tcp project_01_model-testing_flower_1 79af795c2ab2 apache/airflow:2.2.4 "/usr/bin/dumb-init …" 23 hours ago Up 23 hours (healthy) 8080/tcp project_01_model-testing_airflow-worker_1 cfe8d1b18f77 apache/airflow:2.2.4 "/usr/bin/dumb-init …" 23 hours ago Up 23 hours (healthy) 0.0.0.0:8080->8080/tcp project_01_model-testing_airflow-webserver_1 c68fc80dbf0d postgres:13 "docker-entrypoint.s…" 23 hours ago Up 23 hours (healthy) 5432/tcp project_01_model-testing_postgres_1 5c0b9f136b75 redis:latest "docker-entrypoint.s…" 23 hours ago Up 23 hours (healthy) 6379/tcp project_01_model-testing_redis_1
-
Delete all files under the following subdirectories. In case subdirectories do not exist (due to .gitignore) please create them
dags/data/raw/*
dags/data/preprocessed
dags/models
dags/results
At the end, the directory should be structured as following (ensure to manually create any directory that is missing)
├── airflow.sh ├── dags │ ├── app.py │ ├── credentials.json │ ├── dag_pipeline.py │ ├── dag_training.py │ ├── data │ │ ├── preprocessed │ │ │ ├── │ │ └── raw │ │ ├── │ ├── main.py │ ├── models │ │ ├── deploy_report.json │ ├── results │ │ ├── │ ├── src │ │ ├── config.py │ │ ├── drifts.py │ │ ├── etl.py │ │ ├── helpers.py │ │ ├── inference.py │ │ ├── preprocess.py │ │ ├── queries.py │ │ └── train.py ├── docker-compose.yaml ├── jobs ├── logs │ ├── ├── plugins ├── readme.md └── requirements.txt
-
Truncate the
mljob
tabletruncate mljob;
- data gathering
- data preprocessing
- model training
- model evaluation
- model serving
- data integrity
- data drift
- concept drift
- comparative analysis of models