Skip to content

durgeshgowdac/loan_data_processing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Loan Dataset Cleaning

This repository contains code for cleaning and preprocessing a loan dataset using Python and pandas. The dataset is loaded into a pandas DataFrame, and various data cleaning operations are performed to ensure the dataset is suitable for analysis.

Table of Contents

  1. Introduction
  2. Dataset Overview
  3. Data Cleaning Operations
    1. Dropping Duplicates
    2. Data Standardization
    3. Handling Incorrect Records
    4. Handling Missing Values
    5. Converting Data Types
    6. Outliers
    7. Dropping Irrelevant Columns
  4. Requirements
  5. Conclusion

Introduction

This project focuses on cleaning a loan dataset to prepare it for analysis. The dataset is loaded using pandas, and various data cleaning techniques are applied to address issues such as duplicates, inconsistent data, missing values, and outliers.

Dataset Overview

The loan dataset consists of 13 columns, including 'UID', 'Marital_status', 'Dependents', 'Is_graduate', 'Income', 'Loan_amount', 'Term_months', 'Credit_score', 'approval_status', 'Age', 'Sex', 'Purpose', and 'Hobby'.

  1. UID: Unique identifier for each loan application.
  2. Marital_status: Marital status of the loan applicant (e.g., married, single, divorced).
  3. Dependents: Number of dependents the applicant has.
  4. Is_graduate: Indicates whether the applicant is a graduate (e.g., yes or no).
  5. Income: The income of the loan applicant.
  6. Loan_amount: The amount of the loan requested by the applicant.
  7. Term_months: The term or duration of the loan in months.
  8. Credit_score: The credit score of the applicant.
  9. Approval_status: Indicates whether the loan was approved or not.
  10. Age: Age of the loan applicant.
  11. Sex: Gender of the loan applicant.
  12. Purpose: Purpose of the loan (e.g., home purchase, education, business).
  13. Hobby: Hobby of the loan applicant.

Data Cleaning Operations

1. Dropping Duplicates

Duplicate rows are identified and removed from the dataset, both based on the entire row and specifically on the 'UID' column.

2. Data Standardization

String values in the 'Marital_status' and 'Sex' columns are standardized by converting them to uppercase. Additionally, 'M' and 'F' values in the 'Sex' column are replaced with 'Male' and 'Female', respectively.

3. Handling Incorrect Records

Negative values in the 'Age' column are replaced with a minimum valid age of 20.

4. Handling Missing Values

Missing values in the dataset are identified and addressed:

  • 'Loan_amount', 'Term_months', and 'Age' columns are filled with their mean values.
  • Missing values in the 'Is_graduate' column are filled with 'Graduate'.
  • Rows with any remaining missing values are dropped.

5. Converting Data Types

Categorical data types are assigned to the 'Marital_status', 'Sex', and 'Is_graduate' columns. The 'Income' column is converted to the float data type.

6. Outliers

Outliers in the 'Income' column are identified and capped at the 10th and 90th percentiles.

7. Dropping Irrelevant Columns

The 'Hobby' column is dropped from the dataset.

Requirements

Conclusion

The dataset cleaning process ensures that the data is consistent, free of duplicates, and suitable for further analysis. The cleaned dataset is ready for exploration and modeling in subsequent steps.

Colab Link: You can use following link to view and comment on the project:

About

Data cleaning and preprocessing a loan dataset using Python and pandas.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published