PySpark ETL Pipeline with Apache Airflow

Overview

This project sets up an ETL pipeline using PySpark and Apache Airflow to extract data from a PostgreSQL database, transform it, and load it into a Railway PostgreSQL cloud database. The PySpark script handles the ETL logic, while Apache Airflow manages the workflow orchestration.

Project Structure

pyspark-airflow-postgres-etl/
├── airflow/
│   └── dags/
│       └── airflow_pyspark_railway.py      # Airflow DAG to trigger PySpark ETL
├── pyspark_files/
│   └── pyspark_airflow_railway.py          # PySpark ETL script
│   └── postgresql-42.7.3.jar               # JDBC driver for PostgreSQL
├── data_files/
│   └── airbnb.csv                          # Sample data file (if used)
├── requirements.txt                        # Python dependencies for the project
├── README.md                               # Project overview and setup instructions
└── .gitignore                              # Git ignore file to exclude unnecessary files

Tech Stack

Apache Airflow: Workflow orchestration and scheduling.
PySpark: Data processing and transformations.
PostgreSQL: Source and target database for the ETL process.
Railway PostgreSQL: Cloud-hosted PostgreSQL database for storing transformed data.
JDBC: PostgreSQL JDBC driver for database connection.

Prerequisites

Apache Airflow installed
PySpark installed
PostgreSQL database (local or cloud)
Railway PostgreSQL cloud database
Java Runtime Environment (JRE) for PostgreSQL JDBC driver
Python 3.x

Setup

Clone the repository:

git clone https://github.com/shahidmalik4/pyspark-airflow-postgres-etl.git
cd pyspark-airflow-postgres-etl

Install Python Dependencies:
```
pip install -r requirements.txt
```
Configure PostgreSQL Connection:
- Update the jdbc_url and properties in pyspark_airflow_railway.py with your local PostgreSQL connection details.
- Update the railway_jdbc_url and railway_properties with your Railway PostgreSQL connection details.
Setup Airflow:
- Initialize Airflow metadata database:
```
airflow db init
```

Run the Project:

airflow webserver -p 8080
airflow scheduler

Trigger the DAG:
- Navigate to http://localhost:8080 to access the Airflow web interface.
- Trigger the DAG to execute the ETL process.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PySpark ETL Pipeline with Apache Airflow

Overview

Project Structure

Tech Stack

Prerequisites

Setup

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
airflow/dags		airflow/dags
data_files		data_files
pyspark_files		pyspark_files
README.md		README.md
requirements.txt		requirements.txt

shahidmalik4/pyspark-airflow-postgres-etl

Folders and files

Latest commit

History

Repository files navigation

PySpark ETL Pipeline with Apache Airflow

Overview

Project Structure

Tech Stack

Prerequisites

Setup

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages