This project sets up an ETL pipeline using PySpark and Apache Airflow to extract data from a PostgreSQL database, transform it, and load it into a Railway PostgreSQL cloud database. The PySpark script handles the ETL logic, while Apache Airflow manages the workflow orchestration.
pyspark-airflow-postgres-etl/
├── airflow/
│ └── dags/
│ └── airflow_pyspark_railway.py # Airflow DAG to trigger PySpark ETL
├── pyspark_files/
│ └── pyspark_airflow_railway.py # PySpark ETL script
│ └── postgresql-42.7.3.jar # JDBC driver for PostgreSQL
├── data_files/
│ └── airbnb.csv # Sample data file (if used)
├── requirements.txt # Python dependencies for the project
├── README.md # Project overview and setup instructions
└── .gitignore # Git ignore file to exclude unnecessary files
- Apache Airflow: Workflow orchestration and scheduling.
- PySpark: Data processing and transformations.
- PostgreSQL: Source and target database for the ETL process.
- Railway PostgreSQL: Cloud-hosted PostgreSQL database for storing transformed data.
- JDBC: PostgreSQL JDBC driver for database connection.
- Apache Airflow installed
- PySpark installed
- PostgreSQL database (local or cloud)
- Railway PostgreSQL cloud database
- Java Runtime Environment (JRE) for PostgreSQL JDBC driver
- Python 3.x
-
Clone the repository:
git clone https://github.com/shahidmalik4/pyspark-airflow-postgres-etl.git cd pyspark-airflow-postgres-etl
-
Install Python Dependencies:
pip install -r requirements.txt
-
Configure PostgreSQL Connection:
- Update the jdbc_url and properties in pyspark_airflow_railway.py with your local PostgreSQL connection details.
- Update the railway_jdbc_url and railway_properties with your Railway PostgreSQL connection details.
-
Setup Airflow:
- Initialize Airflow metadata database:
airflow db init
-
Run the Project:
airflow webserver -p 8080 airflow scheduler
-
Trigger the DAG:
- Navigate to http://localhost:8080 to access the Airflow web interface.
- Trigger the DAG to execute the ETL process.