Ingest, process, analyse and recommend new data engineer jobs from Welcome to the Jungle website.
Background
The initial problem for this project was to filter and analyse jobs based on technologies.
Data engineering can be technology-centered, but the focus of this project is to apply Platform and Pipeline concepts and best practices,
using O'Reilly Fundamentals of Data Engineering as a reference.
Below is the final Data Flow of the project.
Summary
A minimal pipeline using Python, Postgres, Airflow and Tableau.
C4 model diagrams: https://structurizr.com/workspace/79499/diagrams
Prerequesites
- Setup environment
- Activate venv and install requirements
- Execute
playwright install
to download chromium
- Install Airflow with pypi: official instructions
- Create
job_market
database and export environment variablesJOB_MARKET_DB_USER
andJOB_MARKET_DB_PWD
Run locally
airflow standalone
will initialise Airflow database, make a user, and start all components (development phase).- Airflow UI is at
localhost:8080
with usernameadmin
and password instandalone_admin_password.txt
- In DAGs tab, toggle on job-market-batch and trigger manually if not running.
The aim result is to enrich and refine a list of jobs considered interesting for the user's profile. We narrowed down from 5000 jobs initially, to 1900 relevant and then apply to a couple of matching jobs.
-
Choose jobs that are attractive. Here a rating is given out of 10:
-
Request the API by providing a list of id(s) for attractive jobs. The output is a job recommendation board with most similar jobs at the top:
Remark
The weights of features influences the result, which are by default:
- remote: 1
- title: 0.8
- stack: 0.8
- text: 0.7
- experience: 0.6
These tests comprise the project overall. Individual pipelines data-job-crawler and data-job-etl have tests in their respective repositories.
Multiple errors can arise using Airflow. The recommended order for testing is the DAG file then individual tasks and finally backfill to take dependencies into account.
- DAG file
python3 job_market_etl_dag.py
- Import time
time python3 job_market_etl_dag.py
(default max 30 seconds) - DAG loading
pytest tests/
- List tasks and test them individually
airflow tasks list job-market-batch
airflow tasks test job-market-batch TASK 2022-01-01
- Backfill
airflow dags backfill job-market-batch --start-date 2023-01-01
Ensure that the data flow is maintained. For this, we compare jobs on the website and find them in successive tables before and after processing.
- Look at latest job postings
- Query
raw_jobs
,processed_jobs
,apply
table and compare results
Even with a correct data flow, the data can lack relevance or accuracy. Some quality checks:
- All technologies must be present
- Some technologies are written differently (eg. Google BigQuery, Google Big Query)
- Have to be added manually in
config/definitions.py
from the ETL package
- Eliminate non data engineer jobs
- About half the jobs despite the filters
- Verify data engineer roles with
regex_data_engineer_roles.sql
- Eliminate duplicate jobs
- Some jobs are reposted multiple times and have a different url each time
- Removed by removing the last part of the url, see:
sql/truncate_urls.sql
- Processing errors
- To be added in the unit tests input
- Some queries to spot error in
tests/processing_errors.sql
- Missing values
- Only id, url, title and company have the
NOT NULL
constraint - Check other fields in
tests/missing_values.sql
- Only id, url, title and company have the
Using Pytest, the coverage is not optimum. The unit testing of the crawler is delicate because the web pages DOM changes regularly. However, these errors are easily detectable.
This project comprises pipelines that can run individually. They are written in python packages to follow Airflow's best practice for importing custom code.
This first pipeline takes raw data, web pages, into a Postgres table. It bypasses the dynamic content by using 2 successive spiders.
Repo: data-job-crawler
This is the processing pipeline, using Python and SQLAlchemy to connect with PostgreSQL.
Repo: data-job-etl
The Flask RESTful API implements a recommendation algorithm that can be used through a web UI. data-job-api
The data sources are the web pages containing results for the latest data engineering jobs. The project started with Spotify and Welcome To The Jungle where I found my first tech job.
Characteristics:
- Javascript based pages require additional scraping library Playwright
- HTML will eventually change and requires detection as well as manual update of the XPath
- Different websites can show more or less fields (eg. remote policy) which will result in null values
- Schema evolution can occur
- If a job offer appears on at least 2 scraped websites, need to deal with duplicate
The ingestion is made with Scrapy framework which adds a pipeline feature that conform the data into predefined fields and writes them to the database.
All transformations are made with Python in a specific ETL pipeline data-job-etl. The processing consists of cleaning and reformatting fields and extracting the technology names from the text field into a new column. The data is finally modelled to be loaded in a new database.
PostgreSQL is used as OLTP and OLAP stores. The data is uniformely structured with a small read and write workload.
The raw data is stored in a mega relation. The field text
is the biggest but should not reach the 1 GB capacity limitation.
Once transformed, the data is loaded in a new table without normalization. A future implementation would differentiate 2 databases with the schemas:
Airflow is run locally with LocalExecutor. Following best practices, the custom code is encapsulated in python packages then imported in the DAG.
- Deploy API with Docker
- Unit tests coverage
- API web UI with more arguments such as feature weights