Both borrowers and lenders view loans as a financial commitment. Successful loans enable borrowers to fulfill lifelong aspirations, while unsuccessful ones pose challenges for everyone involved. Our aim is to develop a machine learning model that empowers lenders to make wiser business decisions while ensuring borrowers avoid enduring financial hardships down the road.
This repository contains the code and documentation for our project. The project focuses on predicting loan default risk using machine learning techniques.
Our tool utilizes machine learning models to assist lenders in making informed decisions, ensuring borrowers achieve their goals without facing financial strain. By evaluating borrowing capacity, we mitigate risks and foster successful lending outcomes. To tackle the challenge of predicting loan default likelihood, we leverage historical loan data and employ classification algorithms such as Logistic Regression and Decision Trees to construct a predictive model.
If you would like to look at our in-depth analysis, please refer to our test-loan defaulter.ipynb
To run the project locally, follow these steps:
- Install Python 3.x.
- Install Anaconda Navigator.
- Clone this repository.
- Install required Python libraries using
pip install -r requirements.txt
.
-
Data Preparation: Preprocess the loan datasets to handle missing values, outliers, and feature engineering tasks such as encoding categorical variables, scaling numerical features, and creating new features.
-
Model Training: Train machine learning models using the preprocessed datasets, experimenting with different algorithms, and feature combinations. Evaluate model performance using appropriate metrics.
-
Model Interpretation: Interpret model predictions, analyze feature importance, and assess model biases and limitations. Ensure transparency and accountability in model decision-making processes.
The dataset used in this project can be found at test-loan defaulter data.csv.
As a result, the logistic and decision tree models demonstrate balanced performance across all metrics, positioning them as strong contenders. The Random Forest model follows closely behind as a reliable choice. Although Naive Bayes remains competitive but slightly less consistent, Gradient Boosting shows lower overall effectiveness in this evaluation. Ultimately, selecting the best model depends on the specific priorities of the task, whether focusing on precision, recall, accuracy, or a combination of these metrics.