Stroke Prediction using Machine Learning and Deep Learning Techniques

This project aims to predict the likelihood of stroke based on health data using various machine learning and deep learning models. The project leverages Python, TensorFlow and other data science libraries to implement and compare different models to improve model accuracy.

Introduction

Stroke is a leading cause of death and disability worldwide. This project aims to predict the likelihood of stroke using a dataset from Kaggle that contains various health-related attributes. We employ multiple machine learning and deep learning models, including Logistic Regression, Random Forest, and Keras Sequential models, to improve the prediction accuracy.

Data Preprocessing

Methods and Techniques

Data Cleaning: Handle missing values and standardize data formats to ensure data quality.
Feature Engineering: Convert categorical data to numerical using one-hot encoding and normalize numerical features.
Data Balancing: Use techniques such as SMOTE (Synthetic Minority Over-sampling Technique) to balance the dataset.

import pandas as pd
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import KNNImputer
from imblearn.over_sampling import SMOTE

# Load data
data = pd.read_csv('final_project_data.csv')

# Impute missing values
imputer = KNNImputer(n_neighbors=5)
data_imputed = imputer.fit_transform(data)

# One-hot encoding
encoder = OneHotEncoder(sparse=False)
encoded_features = encoder.fit_transform(data[['gender', 'ever_married', 'work_type', 'Residence_type', 'smoking_status']])

# Standardization
scaler = StandardScaler()
scaled_features = scaler.fit_transform(data[['age', 'avg_glucose_level', 'bmi']])

# Combine all features
processed_data = pd.concat([pd.DataFrame(encoded_features), pd.DataFrame(scaled_features)], axis=1)

# Balance the dataset
smote = SMOTE()
X_resampled, y_resampled = smote.fit_resample(processed_data, data['stroke'])

Modeling and Evaluation

Logistic Regression

Implementation: Use sklearn.linear_model.LogisticRegression to build the model.
Evaluation: Assess the model using accuracy, precision, recall, and F1-score.

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Train-test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=42)

# Logistic Regression Model
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)

# Predictions
y_pred = log_reg.predict(X_test)

Random Forest

Implementation: Use sklearn.ensemble.RandomForestClassifier to build the model.
Evaluation: Assess the model using accuracy, precision, recall, and F1-score.

from sklearn.ensemble import RandomForestClassifier

# Random Forest Model
rf = RandomForestClassifier(n_estimators=500, max_depth=8, bootstrap=False, max_features='auto')
rf.fit(X_train, y_train)

# Predictions
y_pred_rf = rf.predict(X_test)

Deep Learning with Keras

Implementation: Use TensorFlow and Keras to build a sequential model.
Evaluation: Assess the model using accuracy, precision, recall, and F1-score.

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam

# Keras Sequential Model
model = Sequential([
    Dense(16, activation='relu', input_shape=(X_train.shape[1],)),
    Dense(16, activation='relu'),
    Dense(1, activation='sigmoid')
])

model.compile(optimizer=Adam(learning_rate=0.01), loss='binary_crossentropy', metrics=['accuracy'])

# Training the model
history = model.fit(X_train, y_train, epochs=5, batch_size=32, validation_split=0.2)

# Predictions
y_pred_keras = (model.predict(X_test) > 0.5).astype("int32")

Hyperparameter Tuning with Optuna

Implementation: Use Optuna to perform hyperparameter tuning for the Keras model.
Evaluation: Select the best model based on the optimization results.

import optuna
from optuna.integration import TFKerasPruningCallback

def objective(trial):
    model = Sequential([
        Dense(trial.suggest_int('units1', 16, 128), activation=trial.suggest_categorical('activation1', ['relu', 'tanh']), input_shape=(X_train.shape[1],)),
        Dense(trial.suggest_int('units2', 16, 128), activation=trial.suggest_categorical('activation2', ['relu', 'tanh'])),
        Dense(1, activation='sigmoid')
    ])
    
    model.compile(optimizer=Adam(learning_rate=trial.suggest_loguniform('learning_rate', 1e-5, 1e-2)),
                  loss='binary_crossentropy', metrics=['accuracy'])
    
    history = model.fit(X_train, y_train, epochs=5, batch_size=32, validation_split=0.2,
                        callbacks=[TFKerasPruningCallback(trial, 'val_accuracy')], verbose=0)
    
    return history.history['val_accuracy'][-1]

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=50)

Skills and Technologies Applied

Python Programming

Data Analysis: Used pandas for data manipulation and numpy for numerical computations.
Machine Learning: Implemented scikit-learn for building and evaluating machine learning models.
Deep Learning: Used TensorFlow and Keras for constructing and training deep learning models.
Hyperparameter Tuning: Applied Optuna for optimizing the hyperparameters of the deep learning models.

Data Preprocessing

Data Cleaning: Addressed missing values and standardized data using KNNImputer, OneHotEncoder, and StandardScaler.
Data Balancing: Used SMOTE to handle class imbalance in the dataset.

Model Evaluation

Statistical Methods: Evaluated models using accuracy, precision, recall, and F1-score.
Visualization: Employed matplotlib and seaborn to visualize data distributions and model performance.

Benefits of the Project

Enhanced Predictive Accuracy:
- Improved the accuracy of stroke prediction using advanced machine learning and deep learning techniques.
Application of Advanced Python Skills:
- Demonstrated the practical application of Python in handling real-world data, performing statistical analysis, and building predictive models.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Data		Data
Notebooks		Notebooks
Reports		Reports
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Stroke Prediction using Machine Learning and Deep Learning Techniques

Table of Contents

Introduction

Data Preprocessing

Methods and Techniques

Modeling and Evaluation

Logistic Regression

Random Forest

Deep Learning with Keras

Hyperparameter Tuning with Optuna

Skills and Technologies Applied

Python Programming

Data Preprocessing

Model Evaluation

Benefits of the Project

About

Releases

Packages

Languages

mmaghanem/ML_Stroke_Prediction

Folders and files

Latest commit

History

Repository files navigation

Stroke Prediction using Machine Learning and Deep Learning Techniques

Table of Contents

Introduction

Data Preprocessing

Methods and Techniques

Modeling and Evaluation

Logistic Regression

Random Forest

Deep Learning with Keras

Hyperparameter Tuning with Optuna

Skills and Technologies Applied

Python Programming

Data Preprocessing

Model Evaluation

Benefits of the Project

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages