GitHub

Data Pipeline with Databricks

By Tursunai Turumbekova

()

This project builds upon an earlier PySpark implementation to leverage the Databricks ecosystem. It demonstrates the creation and execution of a Databricks pipeline using PySpark for managing and analyzing datasets. The primary focus of this project is on extracting, transforming, and diving into urbanization metrics data from FiveThirtyEight, utilizing Databricks APIs and Python libraries.

Demo video

Key Features

Data Extraction
- Utilizes the requests library to fetch datasets from specified URLs.
- Stores the extracted data in the Databricks FileStore for further processing.
Databricks Environment Setup
- Establishes a connection to the Databricks environment using environment variables for authentication (SERVER_HOSTNAME and ACCESS_TOKEN).
- Configures Databricks clusters to support PySpark workflows.
Data Transformation and Load
- Converts CSV files into Spark DataFrames for processing.
- Transforms and stores the processed data as Delta Lake Tables in the Databricks environment.
Query Transformation and Visualization
- Performs predefined Spark SQL queries to transform the data.
- Creates visualizations from the transformed Spark DataFrames to analyze various metrics.
File Path Validation
- Implements a function to check if specified file paths exist in the Databricks FileStore.
- Verifies connectivity with the Databricks API for automated workflows.
Automated Job Trigger via GitHub Push
- Configures a GitHub workflow to trigger a job run in the Databricks workspace whenever new commits are pushed to the repository.

Project Components

Environment Setup

Create a Databricks workspace on Azure.
Connect your GitHub account to the Databricks workspace.
Set up a global init script for cluster start to store environment variables.
Create a Databricks cluster that supports PySpark operations.

Job Run from Automated Trigger:

Pipeline Workflow

Data Extraction
- File: mylib/extract.py
- Retrieves data from the source and stores it in Databricks FileStore.
Data Transformation and Load
- File: mylib/transform_load.py
- Converts raw data into Delta Lake Tables and loads them into the Databricks environment.
Query and Visualization
- File: mylib/query.py
- Defines SQL queries and generates visualizations using Spark.

Sample Visualizations from Query

Preparation Steps

Set up a Databricks workspace and cluster on Azure.
Clone this repository into your Databricks workspace.
Configure environment variables (SERVER_HOSTNAME and ACCESS_TOKEN) for API access.
Create a Databricks job to build and run the pipeline:
- Extract Task: mylib/extract.py
- Transform and Load Task: mylib/transform_load.py
- Query and Visualization Task: mylib/query.py

Additional Notes

This project is specifically designed for use within the Databricks environment. Due to the dependency on Databricks' infrastructure, some functionalities cannot be replicated outside the workspace (e.g., testing data access in a GitHub environment).
A YouTube video demonstration of the pipeline implementation and data analysis is available for further insight.

Future Enhancements

Expand the pipeline to handle additional datasets.
Integrate advanced visualizations and analytics using tools like Tableau or Power BI.
Optimize the pipeline for larger datasets and higher computational efficiency.

Feel free to explore, contribute, and reach out with suggestions or questions!

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
.devcontainer		.devcontainer
.github/workflows		.github/workflows
img		img
mylib		mylib
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.sh		setup.sh
test_main.py		test_main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Pipeline with Databricks

By Tursunai Turumbekova

Key Features

Project Components

Environment Setup

Job Run from Automated Trigger:

Pipeline Workflow

Sample Visualizations from Query

Preparation Steps

Additional Notes

Future Enhancements

About

Releases

Packages

Languages

License

nogibjj/Databricks_Data_Pipeline

Folders and files

Latest commit

History

Repository files navigation

Data Pipeline with Databricks

By Tursunai Turumbekova

Key Features

Project Components

Environment Setup

Job Run from Automated Trigger:

Pipeline Workflow

Sample Visualizations from Query

Preparation Steps

Additional Notes

Future Enhancements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages