Welcome to my loan classification project! This repository contains my work on predicting loan approval outcomes using various machine learning techniques. This project is based on a dataset from Kaggle.
In this project, I aim to predict whether a loan application will be approved based on a set of features provided in the dataset. This involves several steps including data cleaning, exploratory data analysis, feature engineering, model building, and evaluation.
The dataset consists of various features related to loan applicants. Here's a quick rundown of the files:
loan.csv: The main dataset containing applicant information and loan approval status. Data_Dictionary.xlsx: An Excel file that provides detailed descriptions of the features in the dataset.
To get started with the project, you'll need to set up your Python environment with the required libraries. Here's how you can do it:
Clone the repository:
git clone https://github.com/dvelkow/loan_prediction_classification.git
cd Solution
pip install pandas numpy matplotlib seaborn scikit-learn xgboost
Place the dataset files (loan.csv and Data_Dictionary.xlsx) in the project directory. Then open and run the Jupyter notebook.
jupyter notebook Loan_Classification_Solution.ipynb
In the notebook, you'll find all the steps for data preprocessing, model training, and evaluation.
Here's a brief overview of my approach to solving the loan classification problem:
Data Preprocessing:
-Checked for and handled missing values appropriately.
-Encoded categorical variables using one-hot encoding.
-Scaled numerical features to ensure uniformity.
Exploratory Data Analysis:
-Visualized the distribution of various features.
-Identified correlations and relationships between features.
Feature Engineering:
-Created new features to improve model performance.
-Selected the most relevant features based on exploratory analysis.
Model Building:
-Experimented with several algorithms including RandomForestClassifier.
-Performed hyperparameter tuning using GridSearchCV.
-Used cross-validation to ensure model stability.
Model Evaluation:
-Evaluated models using metrics such as accuracy, precision, recall, F1-score, and ROC-AUC.
-Analyzed the confusion matrix and ROC curves to assess performance.
The final model achieved an accuracy of 92% on the test dataset. For more detailed results and insights, please refer to the Jupyter notebook.