Skip to content

MAishatLola/Python-Data-Cleaning-EDA

Repository files navigation

Diabetes Prediction 💻


Table of Contents 📖

Project Overview

This project is about data cleaning and transformation to ensure quality by delving into the fascinating world of Diabetes Prediction using a Kaggle dataset. This involved techniques like handling missing values, identifying and correcting inconsistencies, handling Outliers and ensuring data format consistency. As well as performing an Exploratory Data Analysis to get a sense of the distribution of variables and their relationships.

Distributions by BMI Classifications

About Dataset

The Diabetes Prediction Dataset contains a collection of medical and demographic features (age, BMI, hypertension, etc.) associated with patients' diabetes status (positive/negative), enabling analysis and prediction of diabetes risk.

Goal of the Project

To uncover hidden patterns and prepare the data for accurate prediction models.

Data Sourcing

A Kaggle Dataset https://www.kaggle.com/datasets/iammustafatz/diabetes-prediction-dataset

Tools

Data Cleaning/Preparation

In the initial stage of data cleaning, we performed the following tasks:

  1. Data Loading and Inspection
  2. Handling and treatinng missing values
  3. Handling the outliers

Exploratory Data Analysis

  • Are there any relationship between the Demographic Features and Diabetics? Distribution by Gender

  • What are the key Metrics for Diabetics Prediction?

Data Analysis

Some interesting codes/features we worked with:

import matplotlib.pyplot as plt

x = diabetes_prediction['age']
y = diabetes_prediction['hypertension']

plt.scatter(x, y)

plt.xlabel('Age', fontsize=16)
plt.ylabel('Hypertension (1 for yes, 0 for no)', fontsize=16)
plt.title('Relationship between Age and Hypertension', fontsize=20)

plt.show();

Results/Findings

The Analysis results are summarized as follows:

  1. No Significant Relationship Between Age and Hypertension in Diabetes
  2. There is little to no significant difference in the range of BMI values (difference between min and max) between males and females with hypertension
  3. BMI distribution across genders and hypertension groups shows minimal differences. Interestingly, no individuals with "other" gender classification have hypertension in this dataset.

Recommendations

Based on the Analysis conducted, these are the recommentions:

  • Collection of Data for other risk factors (Family History, Lifestyle, etc) beyond age for predicting hypertension in diabetic patients.

Limitations

  • Insufficient provisions of data for the size and representativeness of "Other" Gender Category

References

BMI Categorization ©️ See Here

Kaggle DataSet ©️ Link

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published