- Project Predict Customer Churn of ML DevOps Engineer Nanodegree Udacity
Project for the Udacity course.
The aim of this part of the course is to learn about good practices for writing clean code.
This data science project predicts churn in a bank.
- Raw data needs to be provided as the .csv file
bank_data.csv
in theDATA_FOLDER
- Folders used for inputs and outputs can be specified in
constants.py
:DATA_FOLDER
: (default./data
) raw dataIMG_FOLDER
: (default./images
) exploratory data analysis (EDA) plotsMODEL_FOLDER
: (default./models
) pickled modelsRESULT_FOLDER
: (default./results
) model reports, feature importance and ROC curvesLOG_FOLDER
: (default./logs
) logs
constants.py
also allows to specify:KEEP_COLS
: features used for modelingRESULTS_LOG
: (default./logs/churn_library.log
) the file where the progress is logged
(intended to be located within theLOG_FOLDER
)TMP_TEST_FOLDER
: (default./tmp
) folder to be used for selected tests involving file creation
The following data science analysis is performed:
- Data are loaded from the
DATA_FOLDER
- EDA is performed and the resulting plots are saved in the
IMG_FOLDER
- Features are engineered, including encoding categorical columns into proportion of churned in that category
- Data are split into train and test set
- Cross-validated random forest and a logistic regression are trained and saved in the
MODEL_FOLDER
- Predictions are generated
- Model performance is evaluated and reports are saved in the
RESULT_FOLDER
A log of the progress of this analysis can be found in the RESULT_LOG
file
Make sure the raw data are found in a .csv file called bank_data.csv
in the DATA_FOLDER
(by default ./data/bank_data.csv
)
The analysis described aboved are performed by running
python churn_library.py
Unit tests for all the functions in churn_library.py
are performed by running
python churn_script_logging_and_tests.py