Authors: Anna Nandar, Brian Chang, Celine Habashy, Yeji Sohn
We built models using decision trees and logistic regression algorithms to predict the presence of heart disease based on health-related features. On an unseen dataset, our models achieved an overall accuracy of 84.4%. Logistic regression demonstrated better interpretability, with high precision and recall. Some features, such as fasting blood sugar, showed lower importance than anticipated. Moving forward, we plan to explore ensemble methods like Random Forest and Gradient Boosting to improve accuracy and consider incorporating additional clinical data for deeper insights.
The data set that was used in this project is from Cleveland database. It was sourced from the UCI Machine Learning Repository (R. Detrano, et al. 1989) and can be found here. This dataset includes features such as age, chest pain type, blood pressure, cholesterol, and more, alongside a binary diagnosis label (presence or absence of heart disease).
The final report can be found rendered in HTML here.
Follow the instructions below to reproduce the analysis.
-
Clone this GitHub repository.
git clone https://github.com/UBC-MDS/DSCI522-2425-25-heart_disease_predictor.git
-
Navigate to root of the project folder in your IDE where you have cloned it.
- At the root of the project in a terminal, enter
docker-compose up
- In the terminal, navigate to the URL in the docker compose logs that start with the
http://127.0.0.1:PORT_NUMBER/lab?token=
NOTE: You will need to replace the port number with PORT 34651 to navigate to the proper port inside docker
NOTE 2: If you are taken to an authentication screen, please take the token from the logs from where you saw http://127.0.0.1:PORT_NUMBER/lab?token=...token..is..here...
, and paste it into the login screen's login with token
NOTE 3: If you are getting any errors with libraries or such, you may want to make sure the docker container and image are up to date. We've found deleting the image completely from your Docker Desktop the best method to ensure it has all been deleted, and the latest image will be pulled.
NOTE 4: If you'd rather build the environment locally, you can do so either using the already provided environment.yml
file or with conda-lock
with any of the 4 provided platforms in the root of this repository.
-
To run the analysis, regenerate the data, and generate the HTML and PDFs, open a terminal (in the docker jupyter lab) and run the following commands:
make clean
(To clean up (remove) all files generated by the analysis)make all
(To generate all the files needed, including the report)
*NOTE: Please see Running individual parts of the analysis using Make to run individual parts only.
At the root of the project in a terminal, enter:
python -m pytest tests/test_validate.py
python -m pytest tests/test_create_dir_if_not_exist.py
python -m pytest tests/test_load_data.py
python -m pytest tests/test_save_classification_report.py
Or if you want to run them all at once, in the root folder enter:
pytest
- To make sure the docker container was properly cleaned up, after typing
ctrl
+c
in the terminal where you launched the docker container, typedocker-compose rm
Docker is used to create reproducible instances of this project. The docker image used is based on the quay.io/jupyter/minimal-notebook:notebook-7.0.6 image.
Additional dependencies aside from this image and the below dependencies by Conda are specified in the Dockerfile
Conda is also used to manage the software dependencies for this project.
All dependencies are specified in the environment.yml
.
Dependencies:
- python=3.11
- pip=24.3.1
- pandas=2.2.2
- ipykernel=6.29.5
- nb_conda_kernels=2.5.1
- scipy=1.14.1
- matplotlib=3.9.3
- scikit-learn=1.5.2
- requests=2.32.3
- seaborn=0.13.2
- ucimlrepo=0.0.7
- pandera=0.20.2
- quarto=1.5.57
- click=8.1.7
- tabulate=0.9.0
- lmodern (this is installed by the Dockerfile)
- make (this is installed by the Dockerfile)
- deepchecks=0.18.1
- pytest=8.34
conda
(version 24.11.0 or higher)conda-lock
(version 2.5.7 or higher)- Docker
-
Add the dependency to the
environment.yml
file on a new branch. -
Run
conda-lock -k explicit --file environment.yml -p linux-64
to update theconda-linux-64.lock
file -
Re-build the Docker image locally to ensure it still runs.
-
Test the container locally by running it and ensuring your new dependencies are working
-
Push the changes to GitHub.
-
Update your local docker-compose.yml file on your branch to use the new container image (line 3 in the docker-compose.yml file where it starts with "image:..."
Note: Right now it will always use the latest Docker Image anyways so for Milestone 2 Step 6 is not needed
- Send a pull request to merge the changes into the
main
branch.
You may also generate individual parts at a time
Make data
Make figures
Make fits
Make evals
Make report/heart_disease_predictor_report.html report/heart_disease_predictor_report.pdf
This project was created with the MIT License
Heart disease. UCI Machine Learning Repository. (n.d.). https://archive.ics.uci.edu/dataset/45/heart+disease
Detrano, R.C., Jánosi, A., Steinbrunn, W., Pfisterer, M.E., Schmid, J., Sandhu, S., Guppy, K., Lee, S., & Froelicher, V. (1989). International application of a new probability algorithm for the diagnosis of coronary artery disease. The American journal of cardiology, 64 5, 304-10 .
Van Rossum, G., & Drake, F. (2009). Python 3 Reference Manual. CreateSpace.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, E. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825–2830.
team, The pandas development. 2020. “Pandas-Dev/Pandas: Pandas.” Zenodo. https://doi.org/10.5281/zenodo.3509134.
Bantilan, Niels. 2020. “Pandera: Statistical Data Validation of Pandas Dataframes.” In Proceedings of the 19th Python in Science Conference, edited by Meghann Agarwal, Chris Calloway, Dillon Niederhut, and David Shupe, 116–24. https://doi.org/ 10.25080/Majora-342d178e-010 .