skill_etl

ETL Data pipeline project developed to process online job posts using Airflow, Spark, Postgres and Tableau.

Project Architecture

Components of the ETL Pipeline

Extract

Job Posts is fetched from Google Jobs API based on job role and the data for each job role is stored as individual JSON file in raw folder.

Transform

JSON files for each job role is read using PySpark and the data is processed to fetch required data columns in required format. The cleaned data is then stored in parquet and CSV format in the processed folder.

Load

The CSV files are loaded into PostgreSQL database using Airflow copy_expert operation using Postgres Hook.

Pipelines

The project uses the following two DAGs (Directed Acyclic Graph)

Skill ETL setup pipeline

Setup pipelines runs the SQL script to create destination table in postgres db if it does not exist.

Skill ETL pipeline

This pipeline extract the Extract, Transform and Load tasks to process the data from API ingestion layer to data storage layer.

Reporting

A tableau dashboard is created in order to visualise the insights from the data stored in PostgreSQL. A PL/SQL function is created to convert the postgres table to csv which is then converted to excel file. This excel file is used as the source for Tableau visualisation. It is stored in reporting folder.

Link to Dashboard - https://public.tableau.com/app/profile/antony.prince.j/viz/skill_etl/Dashboard1?publish=yes

Configuration

The Config file in the project under config/ is used to configure the job roles to be queried and the path in which each stage of the pipeline should store the data during the Airflow DAG execution.

{
    "job_roles" : 
        [ 
            "data%20engineer%20india",
            "backend%developer%20india",
            "blockchain%developer%20india",
            "data%scientist%20india",
            "fullstack%developer%20india"
        ],
    "project_path": "/Users/antonyprincej/airflow/dags/skill_graph/",
    "raw" : {
        "folder" : "raw",
        "type" : "json"
    },
    "processed" : {
        "folder" : "processed",
        "type" : ["parquet","csv"]
    },
    "destination" : {
        "type" : "sql"
    }

}

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
Notebooks		Notebooks
assets		assets
config		config
dags		dags
processed		processed
processed_archive_2023-04-24_0:26:37		processed_archive_2023-04-24_0:26:37
raw		raw
raw_archive_2023-04-24_0:26:29		raw_archive_2023-04-24_0:26:29
reporting		reporting
sql		sql
tasks		tasks
utils		utils
.gitignore		.gitignore
README.md		README.md
env.example		env.example
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

skill_etl

Project Architecture

Components of the ETL Pipeline

Extract

Transform

Load

Pipelines

Skill ETL setup pipeline

Skill ETL pipeline

Reporting

Configuration

About

Releases

Packages

Languages

antoprince001/skill_etl

Folders and files

Latest commit

History

Repository files navigation

skill_etl

Project Architecture

Components of the ETL Pipeline

Extract

Transform

Load

Pipelines

Skill ETL setup pipeline

Skill ETL pipeline

Reporting

Configuration

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages