Data Engineering Job Market

Ingest, process, analyse and recommend new data engineer jobs from Welcome to the Jungle website.

Background

The initial problem for this project was to filter and analyse jobs based on technologies.
Data engineering can be technology-centered, but the focus of this project is to apply Platform and Pipeline concepts and best practices, using O'Reilly Fundamentals of Data Engineering as a reference. Below is the final Data Flow of the project.

Data Flow Miro board

Summary

Architecture
Running the project
Results
Testing the project
Pipelines
Data Lifecycle
Improvements

Architecture

A minimal pipeline using Python, Postgres, Airflow and Tableau.

C4 model diagrams: https://structurizr.com/workspace/79499/diagrams

Running the project

Prerequesites

Setup environment
- Activate venv and install requirements
- Execute playwright install to download chromium
Install Airflow with pypi: official instructions
Create job_market database and export environment variables JOB_MARKET_DB_USER and JOB_MARKET_DB_PWD

Run locally

airflow standalone will initialise Airflow database, make a user, and start all components (development phase).
Airflow UI is at localhost:8080 with username admin and password in standalone_admin_password.txt
In DAGs tab, toggle on job-market-batch and trigger manually if not running.

Results

The aim result is to enrich and refine a list of jobs considered interesting for the user's profile. We narrowed down from 5000 jobs initially, to 1900 relevant and then apply to a couple of matching jobs.

Choose jobs that are attractive. Here a rating is given out of 10:
Request the API by providing a list of id(s) for attractive jobs. The output is a job recommendation board with most similar jobs at the top:

Remark

The weights of features influences the result, which are by default:

remote: 1
title: 0.8
stack: 0.8
text: 0.7
experience: 0.6

Testing the project

These tests comprise the project overall. Individual pipelines data-job-crawler and data-job-etl have tests in their respective repositories.

Airflow Tests

Multiple errors can arise using Airflow. The recommended order for testing is the DAG file then individual tasks and finally backfill to take dependencies into account.

DAG file python3 job_market_etl_dag.py
Import time time python3 job_market_etl_dag.py (default max 30 seconds)
DAG loading pytest tests/
List tasks and test them individually
- airflow tasks list job-market-batch
- airflow tasks test job-market-batch TASK 2022-01-01
Backfill airflow dags backfill job-market-batch --start-date 2023-01-01

End-to-end Test

Ensure that the data flow is maintained. For this, we compare jobs on the website and find them in successive tables before and after processing.

Look at latest job postings
Query raw_jobs, processed_jobs, apply table and compare results

Example output in psql:

Data Quality Testing

Even with a correct data flow, the data can lack relevance or accuracy. Some quality checks:

All technologies must be present
- Some technologies are written differently (eg. Google BigQuery, Google Big Query)
- Have to be added manually in config/definitions.py from the ETL package
Eliminate non data engineer jobs
- About half the jobs despite the filters
- Verify data engineer roles with regex_data_engineer_roles.sql
Eliminate duplicate jobs
- Some jobs are reposted multiple times and have a different url each time
- Removed by removing the last part of the url, see: sql/truncate_urls.sql
Processing errors
- To be added in the unit tests input
- Some queries to spot error in tests/processing_errors.sql
Missing values
- Only id, url, title and company have the NOT NULL constraint
- Check other fields in tests/missing_values.sql

Unit Testing

Using Pytest, the coverage is not optimum. The unit testing of the crawler is delicate because the web pages DOM changes regularly. However, these errors are easily detectable.

Pipelines

This project comprises pipelines that can run individually. They are written in python packages to follow Airflow's best practice for importing custom code.

Ingestion pull pipeline

This first pipeline takes raw data, web pages, into a Postgres table. It bypasses the dynamic content by using 2 successive spiders.

Repo: data-job-crawler

ETL pipeline

This is the processing pipeline, using Python and SQLAlchemy to connect with PostgreSQL.

Repo: data-job-etl

API pipeline

The Flask RESTful API implements a recommendation algorithm that can be used through a web UI. data-job-api

Data Lifecycle

Source system

The data sources are the web pages containing results for the latest data engineering jobs. The project started with Spotify and Welcome To The Jungle where I found my first tech job.

Characteristics:

Javascript based pages require additional scraping library Playwright
HTML will eventually change and requires detection as well as manual update of the XPath
Different websites can show more or less fields (eg. remote policy) which will result in null values
Schema evolution can occur
If a job offer appears on at least 2 scraped websites, need to deal with duplicate

Ingestion

The ingestion is made with Scrapy framework which adds a pipeline feature that conform the data into predefined fields and writes them to the database.

Transformation

All transformations are made with Python in a specific ETL pipeline data-job-etl. The processing consists of cleaning and reformatting fields and extracting the technology names from the text field into a new column. The data is finally modelled to be loaded in a new database.

Storage

PostgreSQL is used as OLTP and OLAP stores. The data is uniformely structured with a small read and write workload.

Schema

The raw data is stored in a mega relation. The field text is the biggest but should not reach the 1 GB capacity limitation. Once transformed, the data is loaded in a new table without normalization. A future implementation would differentiate 2 databases with the schemas:

Db diagram )

Orchestration

Airflow is run locally with LocalExecutor. Following best practices, the custom code is encapsulated in python packages then imported in the DAG.

Improvements

Deploy API with Docker
Unit tests coverage
API web UI with more arguments such as feature weights

Back to top

Name		Name	Last commit message	Last commit date
Latest commit History 162 Commits
airflow		airflow
docs		docs
sql		sql
tests		tests
.coverage		.coverage
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
file.log		file.log
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Engineering Job Market

Architecture

Running the project

Results

Testing the project

Airflow Tests

End-to-end Test

Data Quality Testing

Unit Testing

Pipelines

Ingestion pull pipeline

ETL pipeline

API pipeline

Data Lifecycle

Source system

Ingestion

Transformation

Storage

Schema

Orchestration

Improvements

About

Releases

Packages

Languages

License

FelitaD/job-radar-1.0

Folders and files

Latest commit

History

Repository files navigation

Data Engineering Job Market

Architecture

Running the project

Results

Testing the project

Airflow Tests

End-to-end Test

Data Quality Testing

Unit Testing

Pipelines

Ingestion pull pipeline

ETL pipeline

API pipeline

Data Lifecycle

Source system

Ingestion

Transformation

Storage

Schema

Orchestration

Improvements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages