The purpose of this repository is to give Data Engineers the chance to complete an end-to-end Data Engineering project from start to finish. Complete instructions will be given on the desired architecture and steps to take to complete each project.
The expectation of these Project(s) is that you will do everything, including Bash
, Dockerfiles
, README
's, coding
, etc.
Nothing is going to be done for you, it forces you to not rely on others and skip
things you might not be familiar with. Growth comes with struggle.
Similar to how work for a project might be handed down in a Data Team, some of the instructions will be specific, some will be ambiguous, and the solution you choose will be generally up to you.
This project(s) will test a Data Engineers abilities across multiple techs and concepts not limited to, but including
Docker
Bash
Python
Airflow
Async
Data Modeling
Postgres
Delta Lake
PySpark
Parquet/CSV
BytesIO
Lazy Evaluation
SQL
Analytics
Dashboards
AWS Cloud
Good Data Engineers are well-rounded and are able to work across multiple techs and concepts, as well as the ability to understand clear and unclear directions, and develop architecture to support the requirements.
In this first Data Engineering project the idea is to setup a Data Platform
that will provide the ability to visually build a data pipeline capable of
downloading some raw TSV data, processing it, and depositing the results into
a Lake House, then displaying a Dashboard of the results.
This project tests your skills to understand high level requirements and turn them
into a technical details without much guidance.
It also tests your ability to work on the entire Data Engineering stack from `bash`,
to `Python` and `Docker` as well as various tools.