This repo contains all the resources for my demo explaining how to use DVC along with other interesting tools & frameworks like PyCaret & FastAPI for data & model versioning, experimentation with ML models & finally deploying these models quickly for inferencing.
This demo was presented at the DVC Office Hours on 20th Jan 2022.
Note: We will use Azure Blob Storage as our remote storage for this demo. To follow along, it is advised to either create an Azure account or use a different remote for storage.
Create a virtual environment named dvc-demo
& install required packages
python3 -m venv dvc-demo
source dvc-demo/bin/activate
pip install dvc[azure] pycaret fastapi uvicorn python-multipart
Initialize the repo with DVC tracking & create a data/
folder
mkdir dvc-pycaret-fastapi-demo
cd dvc-pycaret-fastapi-demo
git init
dvc init
git remote add origin https://github.com/tezansahu/dvc-pycaret-fastapi-demo.git
mkdir data
We use the Heart Failure Prediction Dataset for this demo.
First, we download the heart.csv
file & retain ~800 rows from this file in the data/
folder. (We will use the file with all the rows later - this is to simulate the change/increase in data that an ML workflow sees during its lifetime)
Track this data/heart.csv
using DVC
dvc add data/heart.csv
git add data/heart.csv.dvc
git commit -m "add data - phase 1"
-
Go to the Azure Portal & create a Storage Account (here, we name it
dvcdemo
) -
Within the storage account, create a Container (here, we name it
demo20jan2022
) -
Obtain the Connection String from the storage account as follows:
-
Install the Azure CLI from here & log into Azure from within the terminal using
az login
Now, we store the tracked data in Azure:
dvc remote add -d storage azure://demo20jan2022/dvcstore
dvc remote modify --local storage connection_string <connection-string>
dvc push
git push origin main
Create the notebooks/
folders using mkdir notebook
& download the notebooks/experimentation_with_pycaret.ipynb
notebook from this repo into this notebooks/
folder.
Track this notebook with Git:
git add notebooks/
git commit -m "add ml training notebook"
Run all the cells mentioned under Phase 1 in the notebook. This involves basics of PyCaret:
- Setting up a vanilla experiment with
setup()
- Comparing various classification models with
compare_models()
- Evaluating the preformance a model with
evaluate_model()
- Making predictions on the held-out eval data using
predict_model()
- Finalizing the model by training on the full training + eval data using
finalize_model()
- Saving the model pipeline using
save_model()
This will create a model.pkl
file in the models/
folder
Now, we track the ML model using DVC & store it in our remote storage
dvc add models/model.pkl
git add models/model.pkl.dvc
git commit -m "add model - phase 1"
dvc push
git push origin main
First, delete the .dvc/cache/
& models/model.pkl
(simulate production env). Then, pull the changes from the DVC remote storage.
dvc pull
Check that the model.pkl
file is now present in models/
folder.
Now, create a server/
folder & place the main.py
file in it after downloaidng the server/main.py
file from this repo. This RESTful API server has 2 POST endpoints:
- Inferencing on an individual record
- Batch inferencing on a CSV file
We commit this to our repo:
git add server/
git commit -m "create basic fastapi server"
Now, we can run our local server on port 8000
cd server
uvicorn main:app --port=8000
Go to http://localhost:8000/docs
& play with the endpoints present in the interactive documentation.
For the individual inference, you could use teh following data:
{
"Age": 61,
"Sex": "M",
"ChestPainType": "ASY",
"RestingBP": 148,
"Cholesterol": 203,
"FastingBS": 0,
"RestingECG": "Normal",
"MaxHR": 161,
"ExerciseAngina": "N",
"Oldpeak": 0,
"ST_Slope": "Up"
}
Now, we use the full heart.csv
file to simulate the arrival of new data with time. We place it within data/
folder & upload it to DVC remote.
dvc add data/heart.csv
git add data/heart.csv.dvc
git commit -m "add data - phase 2"
dvc push
git push origin main
Now, we run the experiment in Phase 2 of the notebooks/experimentation_with_pycaret.ipynb
notebook. This involves:
- Feature engineering while setting up teh experient
- Fine-tuning of models with
tune_model()
- Creating an ensemble of models with
blend_models()
The blended model is saved as models/modl.pkl
We upload it to our DVC remote.
dvc add models/model.pkl
git add models/model.pkl.dvc
git commit -m "add model - phase 2"
dvc push
git push origin main
Now, we again start the server (no code changes required, because the model file has same name) & perform inference.
cd server
uvicorn main:app --port=8000
With this, we demonstrate how DVC can be used in conjunction with PyCaret & FastAPI for iterating & experimenting efficiently with ML models & deploying them with minimal effort.
- Fundamentals of MLOps: A 4-blog series
- DVC Documentation
- PyCaret Documentation
- FastAPI Documentation
Created with ❤️ by Tezan Sahu