Put together as part of an introductory workshop series to demonstrate an example of a data science side project.
TL;DR: Goal was to predict, based on customer purchase history, whether the customer would default on their credit. Selected LGBM model which achived 0.491 score on test set with an F1 score of 0.88.
The project is organized as follows:
- Background on problem space
- Business problem (why is this important to businesses?)
- Data science problem (background on problem space from a data scientists perspective)
- Data
- Where we got the data from
- Evaluation metrics
- EDA takeaways
- Feature Engineering process
- Models
- Training different models
- Model Analysis
- Performance on different splits
- Model Interpretability
- Feature importance
- Limitations
To quote Amex themselves, credit default prediction is central to managing risk in a consumer lending business.
Credit default prediction allows lenders to optimize lending decisions, which leads to a better customer experience and sound business economics. - Amex, on Kaggle competititon overview
Thus we have a very well-defined business problem that is of utmost importance to all credit card companies.
From a data scientists perspective, this is a classifiction problem that deals with unbalanced data. This is because most people do, in fact, pay their credit card off.
The data was released for Amex's Kaggle challenge, but upon first release it was a monstrous 50GB. Fortunately, the community got together to create a post processed dataset that transformed any floats to integer wherever it could be done without information loss. This, along with storing in a parquet file, allowed for the data to be brought to a manageable 5GB, roughly.
Here is a link to the post-processed data: https://www.kaggle.com/datasets/raddar/amex-data-integer-dtypes-parquet-format
An important note about this dataset is we are never exactly told what each of the variables are, just the category they fall into. Ie:
- D_* = Delinquency variables
- S_* = Spend variables
- P_* = Payment variables
- B_* = Balance variables
- R_* = Risk variables
The evaluation metric used for this competition was the mean of two other metrics. Specifically, it was the mean of the normalized Gini Coefficient and the default rate, captured at 4%. The Normalized Gini Coefficent is best explained with an image. Suppose we have the following graph:
If the Orange area is "A" and the blue area is "B", the Gini Coefficient is calculated as
Note: the Gini Coefficient has a relationship to AUC, or Area Under the Curve. It can be calculated as
The default rate captured at 4%, on the other hand, refers to the percentage of positive labels captured in the highest ranked top 4% of predictions.
It is always good to start with null values. While we have 189 variables, not including target and customer ID, we have 67 variables with null values.
Correlations
Looking at correlations, we get the following plot of correlations with the target:
Most features are not siginificantly correlated with the target. However, there are many features that are correlated with each other.
This tells us we may want to consider removing some of these to reduce the dimensionality and improve training speed and model performance.
Categorical variables. Exploring the categorical variables, note most of the categorical variables are under the Delinquency variables category, meaning they're likely very important.
Numerical variables. There are a lot of numeric variables, so to examine each of the variables, we look at:
Note almost all variables have a very small variance. In fact, only 9 of the numeric variables have a variance bigger than 1.
The features have already been standardized. However, since we're dealing with data at the customer level, we potentially have multiple rows for each customer and we need to aggregate their data in some way. We'll aggregate numerical features by calculating their mean and categorical features by count.
Using Lazy Classifier, we tried 24 different models. Here are the results.
Note that we have a custom metric for this challenge, however. So we'll re-test with correct metric with just the top 2 classifiers. LGB had a custom metric score of 0.4909679724727166 and XGB had a custom metric score of 0.48498807067754673. So, we choose LGB.
This subpoint is a "model fairness" consideration, but the features have been anonymized so we don't have any meaningful splits. We'll analyze based on different ground truth labels instead by examining the confusion matrix.
Confusion matrix:
Actual\Predicted | Negative | Positive |
---|---|---|
Negative | 12425 | 1120 |
Positive | 1139 | 3318 |
So, we see the model had its struggles, failing to classify 1139 examples as positive in the test set. This has profound business implications, as this means 1139 customers are not paying back their loans.
Fortunately, LightGBM is an inherently interpretable model by nature of it being random forest based.
Examining model with SHAP, we see the following graph:
We have only one feature that contributes hugely to the prediction of our model!
Examining in a dependence plot:
Examining Permutation Importance as well, we have the following plot:
As we saw earlier, there is one main feature that is important in predicting the values.
This project was limited by the fact that we did not try aggregating the features for each customer in many ways. We just stuck with one, which is certainly suboptimal. For example, imagine if we had a customer who