- Project Overview
- Getting started immediately
- Datasets
- Dependencies
- Workflow
- Directory structure
- Models
- Contributors
- License
- Acknowledgments
- Contributions and Feedback
Predicting whether a patient will show up for their scheduled medical appointment is a critical task for healthcare providers as it can help optimize resource allocation and improve overall patient care. This machine learning project focuses on addressing the issue of patient no-shows in medical appointments. By harnessing the power of data and machine learning, we aim to develop a predictive model that can assist healthcare facilities in identifying patients at higher risk of no-shows.
Starting with a Kaggle data of medical appointment no-shows by Joni Hoppen and Aquarela Analytics, we evaluate the performances of four different classification algorithms (Logistic Regression, Decision Trees, Random Forests, XGBoost, and LightGBM) and settle on an LGBMClassifier
model as our final model. Using the trained model we make predictions on whether a future appointment would lead to a no-show or not. Finally, we containerise this application and deploy it as an elastic beanstalk application on AWS and provide an API to access it with.
Credit: Austrian Medical Association (ΓΓK)
The application, called no-show-predictor
is located at: no-show-predictor-env.eba-hpbyckm2.eu-north-1.elasticbeanstalk.com. You can use this API to start making predictions immediately.
This project uses a Kaggle dataset of over 100,000 medical appointments characterised by 14 associated variables, including temporal details, patient information and the ultimate outcome (and the target variable of our classification task) of the appointment -- whether the patient showed up for the appointment or not. The dataset was created by Joni Hoppen and Aquarela Analytics, and can be downloaded from:
https://www.kaggle.com/datasets/joniarroba/noshowappointments/data
Further details of the datasets used can be found here.
The project requires the following dependencies to be installed:
Conda
Docker
AWSEBCLI
To run this project locally, follow these steps:
git clone https://github.com/abhirup-ghosh/medical-appointment-no-shows.git
The easiest way to set up the environment is to use Anaconda. I used the standard Machine Learning Zoomcamp conda environment ml-zoomcamp
, which you can create, activate, and install the relevant libraries in, using the following commands in your terminal:
conda create -n ml-zoomcamp python=3.9
conda activate ml-zoomcamp
conda install numpy pandas scikit-learn seaborn jupyter xgboost pipenv flask gunicorn lightgbm
Alternatively, I have also provided a conda environment.yml
file that can be directly used to create the environment:
conda env create -f opt/environment.yml
In case, you are working in a python virtual environment, I provide a list of dependencies that can be pip installed using:
pip install -r opt/optional_requirements.txt
This notebook outlines the entire investigation and consists of the following steps [π¨ Skip this step, if you want to directly want to use the final configuration for training and/or final model for predictions]:
- Data loading
- Data cleaning and preparation
- Exploratory data analysis
- Feature Engineering
- Feature importance
- Setting up a validation framwork
- Model evaluation [and hyper-parameter tuning]
- Saving the best model and encoders [in the models directory]
- Preparation of the test data
- Making predictions using the saved model
- Testing Flask framework
We encode our best model (LGBMClassifier) inside the scripts/train.py
file which can be run using:
cd scripts
python train.py
The output of this script, which includes the model and the encoder/scaler transforms, can be found in: models/LGBMClassifier_tranformers_final.bin
. It have an accuracy of 0.807 and an ROC AUC = 0.797. This is the model we use to make predictions in the next steps.
We have written a Flask code for serving the model by exposing the port:9696, which can be run using:
cd scripts
python predict.py
or gunicorn
as:
cd scripts
gunicorn --bind 0.0.0.0:9696 predict:app
We can use this to make an example prediction on the appointment:
test_appointment = {
'PatientId': 377511518121127.0,
'AppointmentID': 5629448,
'Gender': 'F',
'ScheduledDay': '2016-04-27 13:30:56+0000',
'AppointmentDay': '2016-06-07 00:00:00+0000',
'Age': 54,
'Neighbourhood': 'MARIA ORTIZ',
'Scholarship': False,
'Hipertension': False,
'Diabetes': False,
'Alcoholism': False,
'Handcap': 0,
'SMS_received': True
}
using the command:
cd scripts
python predict-test.py
# {'no_show': False, 'no_show_probability': 0.2880257379453167}
This gives us a no_show
class [0 or 1] as well as a probability.
π¨ Always remember to conda activate ml-zoomcamp
whenever opening a new terminal/tab.
Run the Dockerfile
using [make sure that the docker daemon is running?] to build the image no-show-prediction
:
docker build -t no-show-prediction .
We can access the docker container via the terminal using:
docker run -it --rm --entrypoint=bash no-show-prediction
Once the image is built, we need to expose the container port (9696) to the localhost port (9696) using:
docker run -it --rm -p 9696:9696 no-show-prediction
We can now make a request in exactly the same way as Step 5:
cd scripts
python predict-test.py
# {'no_show': False, 'no_show_probability': 0.2880257379453167}
We provide some detailed documentations about how to launch the code as an elastic beanstalk application here. It involves the following steps:
- Creating an AWS account
- Renting and configuring an EC2 instance
- Setting up the application environment using
conda
,pipenv
anddocker
- Creating the elastic beanstalk application
- Launching the application
β οΈ I have now deactivated this instance because it has reached the monthly limit allowed by AWS' FreeTier model
Application name: no-show-predictor
Host: no-show-predictor-env.eba-hpbyckm2.eu-north-1.elasticbeanstalk.com
API: ./scripts/predict-test-aws.py
We evaluated the performances of four different models. Their accuracies and ROC AUC are listed in the table below:
Model | Accuracy | ROC_AUC |
---|---|---|
LogisticRegression | 0.792 | 0.686 |
DecisionTreeClassifier | 0.793 | 0.725 |
RandomForestClassifier | 0.793 | 0.682 |
XGBClassifier | 0.796 | 0.749 |
LGBMClassifier β | 0.796 | 0.752 |
Our final model, LGBMClassifier, produced a score of 0.807 and an ROC AUC = 0.797.
./medical-appointment-no-shows
βββ scripts
βΒ Β βββ train.py
βΒ Β βββ predict.py
βΒ Β βββ predict-test.py
βΒ Β βββ predict-test-aws.py
βΒ Β βββ constants.py
βΒ Β βββ __pycache__
βββ permissions
βΒ Β βββ aws-explorer_credentials.csv
βΒ Β βββ aws-explorer_accessKeys.csv
βββ opt
βΒ Β βββ optional_requirement.txt
βΒ Β βββ environment.yml
βββ notebooks
βΒ Β βββ notebook.ipynb
βββ models
βΒ Β βββ XGBClassifier_tranformers_final.bin
βΒ Β βββ XGBClassifier_final.bin
βΒ Β βββ XGBClassifier.bin
βΒ Β βββ RandomForestClassifier.bin
βΒ Β βββ LogisticRegression.bin
βΒ Β βββ LGBMClassifier_tranformers_final.bin
βΒ Β βββ LGBMClassifier.bin
βΒ Β βββ DecisionTreeClassifier.bin
βββ jupyter.pem
βββ docs
βΒ Β βββ setting-up-ec2-eb.md
βββ data
βΒ Β βββ no-show-patients.jpg
βΒ Β βββ README.md
βΒ Β βββ KaggleV2-May-2016.csv
βββ README.md
βββ Pipfile.lock
βββ Pipfile
βββ LICENSE
βββ Dockerfile
9 directories, 28 files
Abhirup Ghosh, abhirup.ghosh.184098@gmail.com
This project is licensed under the MIT License.
We welcome contributions from the community and feedback from healthcare professionals and data scientists. Together, we can refine our model and enhance its utility in real-world healthcare settings. Feel free to explore the project, contribute, or reach out with any questions or suggestions. Together, we can work towards a healthcare system that is more efficient, patient-centered, and cost-effective.
#Classification #XGBoost #LightGBM #Conda #Pipenv #Flask #Gunicorn #Docker #AWS #ElasticBeanstalk #EC2 #API