This project implements a robust ETL (Extract, Transform, Load) pipeline for processing airline safety data. Using Databricks and Python, the pipeline demonstrates a modular design that separates each ETL step into dedicated scripts.
The pipeline integrates seamlessly with Databricks File Store (DBFS), providing a scalable and efficient environment for data processing. The data is transformed and stored using Delta tables, ensuring high performance for querying and analysis. Additionally, the project leverages PySpark, a Python API for Apache Spark, to handle large-scale data transformations and computations efficiently.
The pipeline fetches raw airline safety data from a remote CSV file. PySpark's integration ensures that the data is loaded and cleaned effectively, converting it into a Spark DataFrame for further processing.
Using PySpark SQL and DataFrame operations, the pipeline performs complex transformations such as calculating totals for incidents and fatalities across different periods. PySpark ensures these transformations are distributed, making the process highly scalable and performant.
The transformed data is saved into Delta tables, leveraging Databricks' support for Delta Lake to provide ACID transactions and efficient querying capabilities.
The final stage involves filtering and querying specific columns of interest. PySpark SQL enables seamless querying of Delta tables, supporting interactive data exploration and downstream analytics.
The dataset contains information about incidents, accidents, and fatalities for major airlines over two periods:
- 1985-1999
- 2000-2014
It is sourced from FiveThirtyEight's Airline Safety Dataset.
├── .devcontainer
│ ├── devcontainer.json
│ └── Dockerfile
├── .github
│ └── workflows
│ └── cicd.yml
├── data
│ └── airline-safety.csv
├── mylib
│ ├── __init__.py
│ ├── extract.py
│ ├── load.py
│ ├── query.py
│ └── transform.py
├── .coverage
├── .env
├── .gitignore
├── Compute_Cluster_Config.PNG
├── Compute_Cluster_Libraries.PNG
├── Compute_Cluster.PNG
├── Job_Runs_ETL_Workflow.PNG
├── Job_Runs_ETL_Workflow1.PNG
├── main.py
├── Makefile
├── query_log.md
├── README.md
├── requirements.txt
└── test_main.py
- ETL Pipeline: Implements a distributed ETL pipeline using PySpark for large-scale data processing.
- Modular Design: Separate Python modules for extraction, transformation, loading, and querying.
- Delta Tables: Utilizes Databricks Delta tables for optimized data storage and querying.
- PySpark Integration: Handles large datasets using PySpark's distributed processing capabilities, ensuring scalability and efficiency.
- CI/CD Integration: Automates testing and linting using GitHub Actions.
- DBFS Integration: Supports Databricks File Store for data storage.
Follow these steps to set up and execute the ETL pipeline on Databricks:
- Navigate to the Compute tab in Databricks.
- Create a new cluster and Configure the cluster settings as shown:
- Databricks Runtime Version: 16.0 ML (includes Apache Spark 3.5.0, Scala 2.12)
- Node Type: i3.xlarge (30.5 GB Memory, 4 Cores)
- Terminate After: 100 minutes of inactivity.
- Navigate to the Libraries tab under the cluster.
- Install the following libraries:
databricks-sql-connector
pandas
python-dotenv
- Refer below for the setup.
- Go to your Databricks User Profile (click your profile icon).
- Click on Settings > Git Integration > Linked Accounts.
- Select GitHub and follow the prompts to authenticate and link your account.
-
Navigate to the Workflows section in Databricks.
-
Create a new workflow for the ETL pipeline with the following tasks:
- Task 1: Extract
- Name: Extract
- Script Path: Point to
extract.py
in your repository. - Cluster: Use the cluster configured in Step 1.
- Task 2: Transform
- Name: Transform
- Script Path: Point to
transform.py
in your repository. - Cluster: Use the same cluster.
- Depends On:
Extract
- Task 3: Load
- Name: Load
- Script Path: Point to
load.py
in your repository. - Cluster: Use the same cluster.
- Depends On:
Transform
- Task 4: Query
- Name: Query
- Script Path: Point to
query.py
in your repository. - Cluster: Use the same cluster.
- Depends On:
Load
- Task 1: Extract
-
After adding all tasks, review the dependency graph as shown in below:
- Trigger the workflow and monitor progress.
- Review task logs and ensure the pipeline runs successfully.
- Save and commit changes in Databricks.
- Push the updated scripts and workflow configuration to your GitHub repository.
Execute the main script to run the complete ETL pipeline:
python main.py
The pipeline performs the following steps:
- Extracts data from the source and saves it to a Delta table.
- Transforms the data and creates a transformed Delta table.
- Filters the data and saves it as a new Delta table.
- Loads and displays the data, including schema and summary insights.
To test DBFS path and other configurations, use the test script:
python test_main.py
Run all tests with:
make test
Check for linting issues:
make lint
Format the code:
make format
- Orchestrates the ETL pipeline.
- Coordinates extraction, transformation, querying, and loading steps.
- Downloads the airline safety dataset.
- Cleans column names and saves the data as a Delta table.
- Performs transformations, including calculating totals for incidents and fatalities.
- Saves the transformed data as a new Delta table and exports it to CSV.
- Filters specific columns and saves the results to a new Delta table.
- Loads and displays data from a Delta table, including schema and summary statistics.
- Verifies the accessibility of DBFS paths and ensures proper configurations.
- Ensures code adheres to style guidelines using
ruff
.
- Executes unit tests with
pytest
.
- Formats code with
black
.
.github/workflows/cicd.yml
automates the above steps onpush
orpull_request
.