In this project, we are dealing with a multi-class classification problem where we are given a set of songs, their metadata (check out data/README
provided by the author OR my analytics notebook notebook/exploratory_data_analysis.ipynb
for more information) and their genres (target label).
A simple 4-Fold LightGBM model with selected & engineered features was trained (cross validation test accuracy of ~69.6%), and a simple web service with the following APIs was build:
- [POST] classifier/predict-batch : Classify input data (in the form of .csv files) and persist the song's trackid, title and genre (prediction results) to the database.
- [GET] genre/list : Returns a list of classified genres in the database.
- [GET] genre/title-list : Returns a list of titles, given a genre.
Demo video of the web service:
music_genre_classification.mov
- Install docker.
- Clone this repo, change directory to project root
cd <wherever-this-repo-is>
. - Build docker containers (web service and database) using
docker-compose build
. - Start docker containers
docker-compose up
. - By default, the database is empty, so you need to use the
predict-batch
API to upload the first batch of data. The simplest way to do this is via the FastAPI Swagger UI of the web service (one of the containers you started inStep 4
), which can be accessed on http://0.0.0.0:5000/docs :- Go to the Swagger UI.
- Click on
[POST] /api/v1/predict-batch
->Try it out
. - Input the
client-id
,client-secret
(test
for both, unless you made changes toenv/api-service.env
). - Under
csv_file
, select either the trainingdata/features.csv
or testdata/test.csv
files - Execute and you should see some data in your database now (use the other GET APIs to confirm this).
- Install poetry and jupyter notebook (or if you prefer to install this in the current environment only, do
poetry add notebook
afterStep 4
) - Clone this repo, change directory to project root
cd <wherever-this-repo-is>
. - Install project dependencies
poetry install
- Activate poetry environment
poetry shell
- Install Jupyter kernel
python -m ipykernel install --user --name "music-clf" --display-name "music-clf"
- Activate jupyter notebook
jupyter notebook
- Navigate to the notebook folder and select the notebook to run interactively (remember to change the kernel to
music-clf
created inStep 5
)
If you intend to use the pretrained transformer to generate embeddings for the UMAP (in analytics notebook), you need to:
- Create a new folder
model/paraphrase_tinybert
- Download the model files from HuggingFace Model Hub
- Move the contents to the folder created in
Step 1
.
To confirm that the web services work (especially if you made changes to the logics in the future), we run unit tests on the codebase. This is how you can do it:
- Make sure the database container is up (otherwise, repeat
Section 2.1
). - Make a new
.env
file in project root, and copy the contents fromenv/api-service.env
to it, with the following changes:- IS_LOCAL=true
- MODEL_PATH=/change-this-to-wherever-the-repo-is/model/deployment/
- Change directory to project root, run
pytest
. - Check if all tests passed, otherwise, figure out why.
Check out the notebook notebook/exploratory_data_analysis.ipynb
. Every section of the notebook is organised in the similar manner:
- Observations
- Insights/ Decisions
- Bunchs of code & plots related to the section
In each section, I plotted the features, explored their characteristics across genres and made notes on these observations as well as the subsequent modelling & feature engineering decisions derived from them.
In addition to the FastAPI Swagger UI, the OpenAPI documentation can be found in openapi.json
too. This can be imported to Postman or Swagger Editor directly
In this project, I attempted to modularize the code base, and as much as possible, use similar folder structure to organize the web service.
There are currently 3 resources (classifier, genre, healthcheck). Each resource:
- Is organised into a separate folder & added to the main app using
api_router
- Has a
schema
folder which stores the Pydantic models for input & output validation (generated from sample responses, which are stored inschema/response_json
) - Has a
router.py
which stores the API logics
Something like this:
src/genre
├── __init__.py
├── router.py
└── schema
├── response_json
│ └── genre_list.json # Sample model of /genre/list API output
├── input_data.py # Model for all /genre/ API input params
└── genre_list.py # Model for /genre/list API
These are handled using env/*.env
files, so that we can easily switch between local/ staging/ production environment, just by changing & deploying the right .env
files. These can also be modified into configmap.yaml
/ secrets.yaml
easily, if a managed Kubernetes cluster is used in staging/ production.
With the Dockerfile
, an image can be easily built and deployed on cloud services (e.g. AWS Elastic Beanstalk using image stored on AWS ECR), which then enables configurable automatic load-balancing & scaling.
In terms of database, we can also easily swap out the existing one with a better-spec database (e.g. AWS RDS Postgres w mutli-AZ deployment for enhanced data durability & availability) , by simply changing the POSTGRES_URI
in /env/api-service.env
.
Lastly, in terms of scaling the web service securely in production environment, the client-id
and client-secret
are the required authentication headers for all the APIs (except healthcheck). These can be set to rotate on a scheduled interval, using something like AWS Secrets Managers to better secure the APIs.