This project builds upon an earlier PySpark implementation to leverage the Databricks ecosystem. It demonstrates the creation and execution of a Databricks pipeline using PySpark for managing and analyzing datasets. The primary focus of this project is on extracting, transforming, and diving into urbanization metrics data from FiveThirtyEight, utilizing Databricks APIs and Python libraries.
-
Data Extraction
- Utilizes the
requests
library to fetch datasets from specified URLs. - Stores the extracted data in the Databricks FileStore for further processing.
- Utilizes the
-
Databricks Environment Setup
- Establishes a connection to the Databricks environment using environment variables for authentication (
SERVER_HOSTNAME
andACCESS_TOKEN
). - Configures Databricks clusters to support PySpark workflows.
- Establishes a connection to the Databricks environment using environment variables for authentication (
-
Data Transformation and Load
- Converts CSV files into Spark DataFrames for processing.
- Transforms and stores the processed data as Delta Lake Tables in the Databricks environment.
-
Query Transformation and Visualization
- Performs predefined Spark SQL queries to transform the data.
- Creates visualizations from the transformed Spark DataFrames to analyze various metrics.
-
File Path Validation
- Implements a function to check if specified file paths exist in the Databricks FileStore.
- Verifies connectivity with the Databricks API for automated workflows.
-
Automated Job Trigger via GitHub Push
- Configures a GitHub workflow to trigger a job run in the Databricks workspace whenever new commits are pushed to the repository.
- Create a Databricks workspace on Azure.
- Connect your GitHub account to the Databricks workspace.
- Set up a global init script for cluster start to store environment variables.
- Create a Databricks cluster that supports PySpark operations.
-
Data Extraction
- File:
mylib/extract.py
- Retrieves data from the source and stores it in Databricks FileStore.
- File:
-
Data Transformation and Load
- File:
mylib/transform_load.py
- Converts raw data into Delta Lake Tables and loads them into the Databricks environment.
- File:
-
Query and Visualization
- File:
mylib/query.py
- Defines SQL queries and generates visualizations using Spark.
- File:
- Set up a Databricks workspace and cluster on Azure.
- Clone this repository into your Databricks workspace.
- Configure environment variables (
SERVER_HOSTNAME
andACCESS_TOKEN
) for API access. - Create a Databricks job to build and run the pipeline:
- Extract Task:
mylib/extract.py
- Transform and Load Task:
mylib/transform_load.py
- Query and Visualization Task:
mylib/query.py
- Extract Task:
- This project is specifically designed for use within the Databricks environment. Due to the dependency on Databricks' infrastructure, some functionalities cannot be replicated outside the workspace (e.g., testing data access in a GitHub environment).
- A YouTube video demonstration of the pipeline implementation and data analysis is available for further insight.
- Expand the pipeline to handle additional datasets.
- Integrate advanced visualizations and analytics using tools like Tableau or Power BI.
- Optimize the pipeline for larger datasets and higher computational efficiency.
Feel free to explore, contribute, and reach out with suggestions or questions!