-
Data Extraction
- Utilizes the
requests
library to fetch datasets from specified URLs. - Stores the extracted data in the Databricks FileStore for further processing.
- Utilizes the
-
Databricks Environment Setup
- Establishes a connection to the Databricks environment using environment variables for authentication (
SERVER_HOSTNAME
andACCESS_TOKEN
). - Configures Databricks clusters to support PySpark workflows.
- Establishes a connection to the Databricks environment using environment variables for authentication (
-
Data Transformation and Load
- Converts CSV files into Spark DataFrames for processing.
- Transforms and stores the processed data as Delta Lake Tables in the Databricks environment.
-
Query Transformation and Visualization
- Performs predefined Spark SQL queries to transform the data.
- Creates visualizations from the transformed Spark DataFrames to analyze various metrics.
-
File Path Validation
- Implements a function to check if specified file paths exist in the Databricks FileStore.
- Verifies connectivity with the Databricks API for automated workflows.
-
Automated Job Trigger via GitHub Push
- Configures a GitHub workflow to trigger a job run in the Databricks workspace whenever new commits are pushed to the repository.
- Create a Databricks workspace on Azure.
- Connect your GitHub account to the Databricks workspace.
- Set up a global init script for cluster start to store environment variables.
- Create a Databricks cluster that supports PySpark operations.
- Set up a Databricks workspace and cluster on Azure.
- Clone this repository into your Databricks workspace.
- Configure environment variables (
SERVER_HOSTNAME
andACCESS_TOKEN
) for API access. - Create a Databricks job to build and run the pipeline:
- Extract Task:
mylib/extract.py
- Transform and Load Task:
mylib/transform_load.py
- Query and Visualization Task:
mylib/query.py
- Extract Task:
- Create a Databricks workspace on Azure.
- Connect your GitHub account to the Databricks workspace.
- Set up a global init script for cluster start to store environment variables.
- Create a Databricks cluster that supports PySpark operations.