GitHub - nogibjj/AfagR_data_pipeline_week11: AfagR_data_pipeline_week11

Data Pipeline using Databricks

Key Features

Data Extraction
- Utilizes the requests library to fetch datasets from specified URLs.
- Stores the extracted data in the Databricks FileStore for further processing.
Databricks Environment Setup
- Establishes a connection to the Databricks environment using environment variables for authentication (SERVER_HOSTNAME and ACCESS_TOKEN).
- Configures Databricks clusters to support PySpark workflows.
Data Transformation and Load
- Converts CSV files into Spark DataFrames for processing.
- Transforms and stores the processed data as Delta Lake Tables in the Databricks environment.
Query Transformation and Visualization
- Performs predefined Spark SQL queries to transform the data.
- Creates visualizations from the transformed Spark DataFrames to analyze various metrics.
File Path Validation
- Implements a function to check if specified file paths exist in the Databricks FileStore.
- Verifies connectivity with the Databricks API for automated workflows.
Automated Job Trigger via GitHub Push
- Configures a GitHub workflow to trigger a job run in the Databricks workspace whenever new commits are pushed to the repository.

Project Components

Environment Setup

Create a Databricks workspace on Azure.
Connect your GitHub account to the Databricks workspace.
Set up a global init script for cluster start to store environment variables.
Create a Databricks cluster that supports PySpark operations.

Job Run results:

Successful job runs

Preparation Steps

Set up a Databricks workspace and cluster on Azure.
Clone this repository into your Databricks workspace.
Configure environment variables (SERVER_HOSTNAME and ACCESS_TOKEN) for API access.
Create a Databricks job to build and run the pipeline:
- Extract Task: mylib/extract.py
- Transform and Load Task: mylib/transform_load.py
- Query and Visualization Task: mylib/query.py

Project Components

Environment Setup

Create a Databricks workspace on Azure.
Connect your GitHub account to the Databricks workspace.
Set up a global init script for cluster start to store environment variables.
Create a Databricks cluster that supports PySpark operations.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
.devcontainer		.devcontainer
.github/workflows		.github/workflows
data		data
mylib		mylib
results		results
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt
setup.sh		setup.sh
test_main.py		test_main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Pipeline using Databricks

Key Features

Project Components

Environment Setup

Job Run results:

Successful job runs

Preparation Steps

Project Components

Environment Setup

About

Releases

Packages

Languages

License

nogibjj/AfagR_data_pipeline_week11

Folders and files

Latest commit

History

Repository files navigation

Data Pipeline using Databricks

Key Features

Project Components

Environment Setup

Job Run results:

Successful job runs

Preparation Steps

Project Components

Environment Setup

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages