This project demonstrates how to build a logistic regression model to classify between diabetic and non-diabetic patients using PySpark MLlib. The solution provided within the notebook is based on the Coursera Project Network: Diabetes Prediction With Pyspark MLLIB project-based course on Coursera.
-
You are welcome to use this repository as a reference or starting point for your own project.
-
If you choose to fork this repository, please ensure that you comply with the terms of the Apache License and give proper credit to the original authors.
This project aims to predict diabetes using a Logistic Regression classifier built with PySpark MLlib.
- Perform ETL activity
- Build a Logistic Regression classifier
- Evaluate the model's performance
- Persist it for future use
For this project, we will use a modified version of the Pima Indians Diabetes Database (diabetes.csv
), which is available in this repository.
The original dataset can be found here: https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database
This dataset contains data exclusively from female patients. It includes various symptoms of diabetes, such as age, blood pressure, BMI, insulin levels, glucose levels, etc. The dataset also has an output class label with two possible values: 0 and 1. A value of 0 indicates that the patient is non-diabetic, while a value of 1 indicates that the patient has diabetes. Please note that this dataset and the model we train in this project are for educational purposes only and cannot be used in real-life applications.
Google Colab is a free, cloud-based service provided by Google that allows you to write and execute Python code in a Jupyter Notebook directly from your browser. It requires no setup and offers free access to computing resources, including GPUs and TPUs, which are essential for machine learning and data science tasks.
Follow the steps below to execute the provided Jupyter Notebook for this project:
-
Go to Google Colab: Open your web browser and go to Google Colab.
-
Sign In: If you aren't already signed in, use your Google account to sign in.
-
Upload Notebook:
- On the
Open notebook
page, navigate to theUpload
tab. - Browse and select the notebook you want to open.
- On the
-
Clear All Outputs:
- Click on the
Edit
menu - Select
Clear all outputs
from the dropdown menu
- Click on the
-
Start Executing Code:
- Click the "Play" button (triangle icon) on the left side of the cell or press Ctrl+Enter on your keyboard to run the cell.
- Coursera Project Network © 2024 Coursera Inc. All rights reserved.