Table of Contents
The main purpose of this project, is to focus on the engineering part and not so on the modelling part. We will create efficient data pipelines and will adhere to coding best practices using different tools, languages, and technologies like Python, Scala, Spark, Docker, CI/CD tools, etc.
Note: This repo will be used for testing different technologies.
The dataset for this project is taken from Kaggle. It is a simple dataset regarding customer churn including both numeric and categorical features. It is a classification task with the target variable being binary (True/False), meaning that if a customer has left the company, the target variable is True/1, otherwise it is False/0.
It is necessary to save the csv file from Kaggle to the "data" (src/python/src/data/) directory renaming it as "telco_churn.csv" in order for the pipeline to work.
To run the project, you need to clone this repo and run the docker/docker-compose-shell.sh script.
This script runs the train, predict, or both phases. To run only the train phase, include the argument "train" to the script. For the predict phase, add "predict", and for both, either run it with no arguments or add "both".
Clone the repo
git clone https://github.com/SteliosGian/churn-engineering.git
Run the script
./docker/docker-compose-shell.sh
Make sure the script has the adequate permissions
chmod +x docker/docker-compose-shell.sh
or run
bash docker/docker-compose-shell.sh
The project starts a local MLflow server running in the background, which you can access at
http://127.0.0.1:5000/ .
With MLflow, you can track custom metrics and hyperparameters
as well as log artifacts such as plots and models.
Docker must be installed in order to run the project with Docker. Otherwise, it can be executed by running the python scripts (train.py/predict.py) individually.
Spark is not needed for this project because the amount of data is not that large. However, a small pipeline is created in the branch "spark" using Scala.
- Docker ☑
- Shell scripts ☑
- TravisCI ☑
- MLflow ☑
- Spark ☑
- API