A Learning to Rank Project
Obs: Business problem, company and data are fictitious.
The in-depth Python code explanation is available in this Jupyter Notebook.
Insuricare is an insurance company that has provided health insurance to its customers, and now they are willing to sell a new vehicle insurance to their clients. To achieve that, Insuricare conducted a research with around 305 thousand customers that bought the health insurance last year, asking each one if they would be interested in buying the new insurance. This data was stored in the company's database, alongside other customers' features.
Then, Insuricare Sales Team selected around 76 thousand new customers, which are people that didn't respond to the research, to offer the new vehicle insurance. However, due to a limit call restriction* Insuricare must choose a way of selecting which clients to call:
-
Either select the customers randomly, which is the baseline model previously used by the company.
-
Or, the Data Science Team will provide, by using a Machine Learning (ML) model, an ordered list of these new customers, based on their propensity score of buying the new insurance.
* Insuricare Sales Team would like to make 20,000 calls, but it can be pushed to 40,000 calls.
The training data was collected from a PostgreSQL Database. The initial features descriptions are available below:
Feature | Definition |
---|---|
id | Unique ID for the customer |
gender | Gender of the customer |
age | Age of the customer |
region_code | Unique code for the region of the customer |
policy_sales_channel | Anonymized Code for the channel of outreaching to the customer ie. Different Agents, Over Mail, Over Phone, In Person, etc |
driving_license | 0 : Customer does not have DL, 1 : Customer already has DL |
vehicle_age | Age of the Vehicle |
vehicle_damage | Yes : Customer got his/her vehicle damaged in the past. No : Customer didn't get his/her vehicle damaged in the past. |
previously_insured | 1 : Customer already has Vehicle Insurance, 0 : Customer doesn't have Vehicle Insurance |
annual_premium | The amount customer needs to pay as premium in the year |
vintage | Number of Days, Customer has been associated with the company |
response | 1 : Customer is interested in the new insurance, 0 : Customer is not interested in the new insurance |
- Cross-selling is a strategy used to sell products associated with another product already owned by the customer. In this project, health insurance and vehicle insurance are the products.
- Learning to rank is a machine learning application. In this project, we are ranking customers in a list, from the most likely customer to buy the new insurance to the least likely one. This list will be provided by the ML model.
To provide an ordered list of these new customers, based on their propensity score of buying the new insurance the following steps were performed:
-
Understanding the Business Problem: Understanding the main objective Insuricare was trying to achieve and plan the solution to it.
-
Collecting Data: Collecting data from a PostgreSQL Database, as well as from Kaggle.
-
Data Cleaning: Checking data types and Nan's. Other tasks such as: renaming columns, dealing with outliers, changing data types weren't necessary at this point.
-
Feature Engineering: Editing original features, so that those could be used in the ML model.
-
Exploratory Data Analysis (EDA): Exploring the data in order to obtain business experience, look for useful business insights and find important features for the ML model. The top business insights found are available at Section 5.
-
Data Preparation: Applying Rescaling Techniques in the data, as well as Enconding Methods, to deal with categorical variables.
-
Feature Selection: Selecting the best features to use in the ML model by using Random Forest.
-
Machine Learning Modeling: Training Classification Algorithms with cross-validation. The best model was selected to be improved via Bayesian Optimization with Optuna. More information in Section 6.
-
Model Evaluation: Evaluating the model using two metrics: Precision at K and Recall at K, as well as two curves: Cumulative Gains and Lift Curves.
-
Results: Translating the ML model to financial and business performance.
-
Propensity Score List and Model Deployment : Providing a full list of the 76 thousand customers sorted by propensity score, as well as a Google Sheets that returns propensity score and ranks customers (used for future customers). This is the project's Data Science Product, and it can be accessed from anywhere. More information in Section 7.
- Python 3.10.8, Pandas, Matplotlib, Seaborn and Sklearn.
- SQL and PostgresSQL.
- Jupyter Notebook and VSCode.
- Flask and Python API's.
- Render Cloud, Google Sheets and JavaScript.
- Git and Github.
- Exploratory Data Analysis (EDA).
- Techniques for Feature Selection.
- Classification Algorithms (KNN Classifier, Logistic Regression; Random Forest, AdaBoost, CatBoost, XGBoost and LGBM Classifiers).
- Cross-Validation Methods, Bayesian Optimization with Optuna and Learning to Rank Performance Metrics (Precision at K, Recall at K, Cumulative Gains Curve and Lift Curve).
This was the most fundamental part of this project, since it's in ML modeling where we can provide an ordered list of these new customers, based on their propensity score of buying the new insurance. Seven models were trained using cross-validation:
- KNN Classifier
- Logistic Regression
- Random Forest Classifier
- AdaBoost Classifier
- CatBoost Classifier
- XGBoost Classifier
- Light GBM Classifier
The initial performance for all seven algorithms are displayed below:
Model | Precision at K | Recall at K |
---|---|---|
LGBM Classifier | 0.2789 +/- 0.0003 | 0.9329 +/- 0.001 |
AdaBoost Classifier | 0.2783 +/- 0.0007 | 0.9309 +/- 0.0023 |
CatBoost Classifier | 0.2783 +/- 0.0005 | 0.9311 +/- 0.0018 |
XGBoost Classifier | 0.2771 +/- 0.0006 | 0.9270 +/- 0.0022 |
Logistic Regression | 0.2748 +/- 0.0009 | 0.9193 +/- 0.0031 |
Random Forest Classifier | 0.2719 +/- 0.0005 | 0.9096 +/- 0.0016 |
KNN Classifier | 0.2392 +/- 0.0006 | 0.8001 +/- 0.0019 |
K is either equal to 20,000 or 40,000, given our business problem.
The Light GBM Classifier model will be chosen for hyperparameter tuning, since it's by far the fastest algorithm to train and tune, whilst being the one with best results without any tuning.
LGBM speed in comparison to other ensemble algorithms trained in this dataset:
- 4.7 times faster than CatBoost
- 7.1 times faster than XGBoost
- 30.6 times faster than AdaBoost
- 63.2 times faster than Random Forest
At first glance the models performances don't look so great, and that's due to the short amount of variables, on which many are categorical or binary, or simply those don't contain much information.
However, for this business problem this isn't a major concern, since the goal here isn't finding the best possible prediction on whether a customer will buy the new insurance or not, but to create a score that ranks clients in a ordered list, so that the sales team can contact them in order to sell the new vehicle insurance.
After tuning LGBM's hyperparameters using Bayesian Optimization with Optuna the model performance has improved on the Precision at K, and decreased on Recall at K, which was expected:
Model | Precision at K | Recall at K |
---|---|---|
LGBM Classifier | 0.2793 +/- 0.0005 | 0.9344 +/- 0.0017 |
As we're ranking customers in a list, there's no need to look into the more traditional classification metrics, such as accuracy, precision, recall, f1-score, aoc-roc curve, confusion matrix, etc.
Instead, ranking metrics will be used:
-
Precision at K : Shows the fraction of correct predictions made until K out of all predictions.
-
Recall at K : Shows the fraction of correct predictions made until K out of all true examples.
In addition, two curves can be plotted:
-
Cumulative Gains Curve, indicating the percentage of customers, ordered by probability score, containing a percentage of all customers interested in the new insurance.
-
Lift Curve, which indicates how many times the ML model is better than the baseline model (original model used by Insuricare).
1) By making 20,000 calls how many interested customers can Insuricare reach with the new model?
-
20,000 calls represents 26.24% of our database. So if the sales team were to make all these calls Insuricare would be able to contact 71.29% of customers interested in the new vehicle insurance, since 0.7129 is our recall at 20,000.
-
As seen from the Lift Curve, our LGBM model is 2.72 times better than the baseline model at 20,000 calls.
2) Now increasing the amount of calls to 40,000 how many interested customers can Insuricare reach with the new model?
-
40,000 calls represents 52.48% of our database. So if the sales team were to make all these calls Insuricare would be able to contact 99.48% of customers interested in the new vehicle insurance, since 0.9948 is our recall at 40,000.
-
At 40,000 calls, our LGBM model is around 1.89 times better than the baseline model.
To explore the expected financial results of our model, let's consider a few assumptions:
- The customer database that will be reached out is composed of 76,222 clients.
- We expect 12.28% of these customers to be interested in the new vehicle insurance, since it's the percentage of interest people that participated in the Insuricare research.
- The annual premium for each of these new vehicle insurance customers will be US$ 2,630 yearly. *
* The annual premium of US$ 2,630 is set for realistic purposes, since it's the lowest and most common value in the dataset.
The expected financial results and comparisons are shown below:
Model | Annual Revenue - 20,000 calls | Annual Revenue - 40,000 calls | Interested clients reached out - 20,000 calls | Interested clients reached out - 40,000 calls |
---|---|---|---|---|
LGBM | US$ 17,515,800.00 | US$ 24,440,590.00 | 6660 | 9293 |
Baseline | US$ 6,446,130.00 | US$ 12,894,890.00 | 2451 | 4903 |
|
11,069,670.00 | US$ 11,545,700.00 | 4209 | 4390 |
As seen above the LGBM model can provide much better results in comparison to the baseline model, with an annual financial result around 172% better for 20,000 calls and 89% better for 40,000 calls, which is exactly what was shown in the Lift Curve.
The full list sorted by propensity score is available for download here. However, for other new future customers it was necessary to deploy the model. In this project Google Sheets and Render Cloud were chosen for that matter. The idea behind this is to facilitate the predictions access for any new given data, as those can be checked from anywhere and from any electronic device, as long as internet connection is available. The spreadsheet will return you the sorted propensity score for each client in the requested dataset, all you have to do is click on the "Propensity Score" button, then on "Get Prediction".
Because the deployment was made in a free cloud (Render) it could take a few minutes for the spreadsheet to provide a response, in the first request. In the following requests it should respond instantly.
In this project the main objective was accomplished:
We managed to provide a list of new customers ordered by their buy propensity score and a spreadsheet that returns the buy propensity score for other new future customers. Now, the Sales Team can focus their attention on the 20,000 or 40,000 first customers on the list, and in the future focus on the top K customers of the new list. In addition to that, three interesting and useful insights were found through Exploratory Data Analysis (EDA), so that those can be properly used by Insuricare, as well as Expected Financial Results.
Further on, this solution could be improved by a few strategies:
-
Conducting more market researches, so that more useful information on customers could be collected, since there was a lack of meaningful variables.
-
Applying Principal Component Analysis (PCA) in the dataset.
-
Try other classification algorithms that could better capture the phenomenon.