My attempt landed me the next stage of interview for Senior DS Role, which I declined to proceed further.
The bank aims to increase its customer base and revenue by identifying potential affluent customers within its Existing To Bank (ETB) segment. By upgrading these customers from normal to affluent status, the bank can offer tailored products and services to enhance satisfaction and drive revenue growth.
Focus on the Existing To Bank (ETB) customers, particularly those currently classified as normal but with the potential to become affluent.
Analyze a comprehensive dataset featuring various customer attributes and a binary label indicating 'affluent' or 'normal' status to identify potential candidates for upgrade.
The bank is poised to effectively identify and target hidden affluent customers within the ETB segment, leading to increased revenue, customer satisfaction, and a stronger market presence.
- Customer Segmentation: Utilize supervised machine learning (classification) to identify potential affluent customers within the ETB segment.
- Identification of Upgrade Candidates: Predict whether a customer should be classified as affluent or normal based on their data profile, focusing on maximizing recall for the affluent class. False positives would actually be the hidden affluent customers to target.
- Features: Customer data from the provided dataset, including demographic information, transaction history, account balances, etc.
- Labels Usage: Binary labels indicating whether each customer is affluent or normal, used for training and evaluating the classification model.
- Preprocessing and Feature Engineering: Clean and preprocess data to ensure it's suitable for classification, including handling missing values, encoding categorical variables, etc.
- Model Selection: Choose appropriate classification algorithms (e.g., XGBoost, Random Forest) to predict the target variable effectively.
- Model Evaluation: Given the best model choice, tune the model with a focus on maximizing recall for the affluent class.
- Targeted Marketing: Utilize predictions from the classification model to target potential affluent customers with tailored marketing campaigns and product offerings.
- Segment Analysis: Analyze the characteristics and behaviors of predicted affluent customers to refine marketing strategies and enhance customer engagement.
While clustering can provide valuable insights into grouping customers based on similarities in their data profiles, the decision to opt for classification instead was driven by several key factors:
-
Granular Predictions: Classification provides individual predictions for each customer, enabling precise targeting and personalized strategies.
-
Interpretability: Classification models offer feature importance metrics, helping understand the factors driving segmentation.
-
Threshold Control: With classification, the bank can set thresholds for predicting affluent customers, aligning with strategic goals.
-
Model Evaluation: Clear evaluation metrics like recall allow for measuring model effectiveness and refining approaches.
-
Strategy Development: Insights from classification aid in developing targeted marketing and product strategies.
- Folder Structure:
- The current folder structure is similar to Kedro, making it easy for MLE to deploy into an actual Kedro project.
maybank/conf
contains all the yaml configurations.maybank/conf/local
is empty as no secret credentials such as S3 keys are being used.maybank/data
contains all the saved data (raw, processed), and also the html for the EDA analysis.maybank/docs
contains the sphinx documentation for themaybanks/src
folder.maybank/notebooks/main_notebook.ipynb
contains the data processing, EDA and modeling work.maybank/src/pipelines/main_pipeline.py
runs the entire pipeline from end-to-end before inference stage.maybank/notebooks/analysis.ipynb
contains the inferencing and analysis of results.maybank/src
contains all the scripts that the notebook imports.maybank/tests
is empty at the moment, it is for pytest integration, mirrored tomaybanks/src
folder.
-
Dependency Management:
- I used
Poetry
for managing project dependencies. - It provides a reliable and efficient tool for dependency management.
- Steps to install dependencies:
pip install poetry # Inside directory of pyproject.toml poetry install # Optional to work within the virtual env that poetry automatically creates # Else inside the notebooks just need to activate the virtual env created similar to any virtual env poetry shell
- I used
-
Linting:
- I have implemented linting with
ruff
andflake8
to ensure code consistency and quality, handled byPoetry
. Ruff
is lightning fast due to it'srust
implementation.- ENSURE
Make
is installed first. If not, you can use the bash script. - Steps to run lint:
# Using Make make lint # Using Bash bash lint.sh
- I have implemented linting with
-
Documentation:
- I have set up sphinx documentation, to see my sphinx configurations, look under
docs/config.py
. - To view the documentation, look under
docs/html/index.html
to view the entire interactive HTML documentation. - Steps to rerun docs:
cd docs # Using Make make clean make html
- I have set up sphinx documentation, to see my sphinx configurations, look under
Here is a quick rundown and approach I would use for this machine learning project:
- Develop with
Kedro
andPoetry
- Integrate
MLflow
for experiment tracking and workflow - Build a
Docker
file to package the entire project - Set up CI/CD pipelines with
Jenkins
- Deploy the project to
OpenShift
- Create a
Django
REST API project to expose my deployed project endpoint - Integrate backend calls to a front-end for the bank users to quickly drop a dataset of ETB Customers for quick inference
- Set up
OpenTelemetry
and integrate it with a UI such asKibana
for logging and tracing of deployed model inference API calls
-
Develop with Kedro and Poetry:
- Organize this machine learning project using
Kedro
for project structuring and workflow management. - Use
Poetry
for dependency management, ensuring consistent environments across different machines.
- Organize this machine learning project using
-
Integrate MLflow for Experiment Tracking and Workflow:
- Incorporate
MLflow
into thisKedro
project for experiment tracking, model versioning, and workflow management.
- Incorporate
-
Build a Dockerfile to Package Entire Project:
- Create a
Dockerfile
to package thisKedro
project,MLflow
, and other dependencies into a containerized environment.
- Create a
-
Setting up CI/CD Pipelines with Jenkins:
- Write a
Jenkinsfile
defining the CI/CD pipeline for this project. - Include stages for checking out the source code, building the
Docker
image, running tests and linting, pushing the image to a container registry, and deploying toOpenShift
.
- Write a
-
Deploy Project to OpenShift:
- Set up an
OpenShift
cluster and create a project for deploying this machine learning project. - Prepare a deployment configuration file (deployment.yaml) describing how to deploy this Dockerized project in
OpenShift
. - Apply the deployment configuration to the
OpenShift
project.
- Set up an
-
Create a Django REST API Project:
- Develop a
Django
REST API project to expose endpoints for the deployed machine learning model. - Implement endpoints for inference, allowing users to send data for predictions.
- Develop a
-
Integrate Backend Call to a Front-End:
- Develop a front-end interface for bank users to interact with the
Django
REST API. - Allow users to upload datasets of ETB customers for quick inference using the deployed machine learning model.
- Develop a front-end interface for bank users to interact with the
-
Set up OpenTelemetry and Integrate with UI such as Kibana:
- Implement
OpenTelemetry
for logging and tracing of deployed model inference API calls. - Integrate
OpenTelemetry
with a UI tool likeKibana
to visualize and analyze logs and traces, enabling efficient monitoring and debugging.
- Implement
- PySpark Decision:
- The decision not to use
PySpark
for a small dataset but planning for its deployment later is a strategic choice balancing current efficiency with future scalability. - There will just be a need to refactor the
Pandas
codes toPySpark
, which is relatively simple.
- The decision not to use