The following is an anomaly detection data pipeline on Azure Databricks. This solution was built to demonstrate how to build Advance Analytics Pipelines on Azure Databricks, with a particular focus on the Spark MLLib library. This solution includes:
- Initial ETL Data loading process into SparkSQL tables
- Model training and scoring
- Explanation of Pipelines, Transformer and Estimators
- Sample Custom Estimator (PCAAnomaly)
- Persisting trained models
- Productionizing models through
- Batch inference
- Streaming
- Ensure you are in the root of the repository
- To deploy the solution, use one of the following commands:
- (Easiest) Using pre-built docker container:
docker run -it devlace/azdatabricksanomaly
- Build and run the container locally:
make deploy_w_docker
- Deploy using local environment (see requirements below):
make deploy
- (Easiest) Using pre-built docker container:
- Follow the prompts to login to Azure, name of resource group, deployment location, etc.
- When prompted for a Databricks Host, enter the full name of your databricks workspace host, e.g.
https://southeastasia.azuredatabricks.net
- When prompted for a token, you can generate a new token in the databricks workspace.
To view additional make commands run make
- Azure CLI 2.0+
- Python virtualenv or Anaconda
- jq tool
- Check the requirements.txt for list of necessary Python packages. (will be installed by
make requirements
)
- The following works with Windows Subsystem for Linux
- Clone this repository
cd azure-databricks-anomaly
- Create a python environment (Virtualenv or Conda). The following uses virtualenv.
virtualenv .
This creates a python virtual environment to work in.source bin/activate
This activates the virtual environment.
make requirements
. This installs python dependencies in the virtual environment.
├── LICENSE
├── Makefile <- Makefile with commands like `make data` or `make train`
├── README.md <- The top-level README for developers using this project.
├── deploy <- Deployment artifacts
│ │
│ └── databricks <- Deployment artifacts in relation to the Databricks workspace
│ │
│ └── deploy.sh <- Deployment script to deploy all Azure Resources
│ │
│ └── azuredeploy.json <- Azure ARM template w/ .parameters file
│ │
│ └── Dockerfile <- Dockerfile for deployment
│
├── models <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks <- Jupyter notebooks. Naming convention is a number (for ordering),
│ the creator's initials, and a short `-` delimited description, e.g.
│ `1.0-jqp-initial-data-exploration`.
│
├── references <- Contains the powerpoint presentation, and other reference materials.
│
├── requirements.txt <- The requirements file for reproducing the analysis environment, e.g.
│ generated with `pip freeze > requirements.txt`
│
├── setup.py <- makes project pip installable (pip install -e .) so src can be imported
Project based on the cookiecutter data science project template. #cookiecutterdatascience