This repository contains code for cleaning and preprocessing a loan dataset using Python and pandas. The dataset is loaded into a pandas DataFrame, and various data cleaning operations are performed to ensure the dataset is suitable for analysis.
This project focuses on cleaning a loan dataset to prepare it for analysis. The dataset is loaded using pandas, and various data cleaning techniques are applied to address issues such as duplicates, inconsistent data, missing values, and outliers.
The loan dataset consists of 13 columns, including 'UID', 'Marital_status', 'Dependents', 'Is_graduate', 'Income', 'Loan_amount', 'Term_months', 'Credit_score', 'approval_status', 'Age', 'Sex', 'Purpose', and 'Hobby'.
- UID: Unique identifier for each loan application.
- Marital_status: Marital status of the loan applicant (e.g., married, single, divorced).
- Dependents: Number of dependents the applicant has.
- Is_graduate: Indicates whether the applicant is a graduate (e.g., yes or no).
- Income: The income of the loan applicant.
- Loan_amount: The amount of the loan requested by the applicant.
- Term_months: The term or duration of the loan in months.
- Credit_score: The credit score of the applicant.
- Approval_status: Indicates whether the loan was approved or not.
- Age: Age of the loan applicant.
- Sex: Gender of the loan applicant.
- Purpose: Purpose of the loan (e.g., home purchase, education, business).
- Hobby: Hobby of the loan applicant.
Duplicate rows are identified and removed from the dataset, both based on the entire row and specifically on the 'UID' column.
String values in the 'Marital_status' and 'Sex' columns are standardized by converting them to uppercase. Additionally, 'M' and 'F' values in the 'Sex' column are replaced with 'Male' and 'Female', respectively.
Negative values in the 'Age' column are replaced with a minimum valid age of 20.
Missing values in the dataset are identified and addressed:
- 'Loan_amount', 'Term_months', and 'Age' columns are filled with their mean values.
- Missing values in the 'Is_graduate' column are filled with 'Graduate'.
- Rows with any remaining missing values are dropped.
Categorical data types are assigned to the 'Marital_status', 'Sex', and 'Is_graduate' columns. The 'Income' column is converted to the float data type.
Outliers in the 'Income' column are identified and capped at the 10th and 90th percentiles.
The 'Hobby' column is dropped from the dataset.
The dataset cleaning process ensures that the data is consistent, free of duplicates, and suitable for further analysis. The cleaned dataset is ready for exploration and modeling in subsequent steps.
Colab Link: You can use following link to view and comment on the project: