Skip to content

mdrmuhaimin/de_e2e

Repository files navigation

Data Engineering with airflow and Kubernetes

Prerequisite

  • Install terraform, this would be used to provision kubernetes cluster, for development we will use a local kubernetes cluster; and we can also use terraform to provision kubernetes cluster in cloud
  • Install a local Kubernetes cluster, here we have used kind
  • Install kubectl and helm if you use a mac you can simply install them using brew install kubectl helm

Setup

For each of them navigate to terraform directory and simply terraform apply

Provision Database

Deploy Data Pipeline

  • Navigate to Data_processing

  • Build the docker image by running

    poetry export -f requirements.txt --output requirements.txt --without-hashes
    docker build -t airflow-custom:latest .
  • Add two env variable

    KAGGLE_USERNAME=<user>
    KAGGLE_KEY=<pass>

    at data_processing/dev.env.list

  • Run docker-compose up it will launch an airflow cluster locally, this will demonstrate a local server you need to configure connectivity to host postgres

  • Alternatively run airflow webserver and navigate to airflow connections and change default postgres as defined here

  • Use password q1w2e3r4

  • You can load the docker image to kind cluster and also deploy the airflow using helm chart. A terraform script is provided here

  • You also need to update the env here

Datapipelines

Productionization

  • How would you package the pipeline code for deployment?
    • Using CI/CD pipeline as a docker image
  • How would you schedule a pipeline that runs the ingestion and the transformation tasks sequentially every day?
    • Using airflow this is demonstrated here
  • How would you ensure the quality of the data generated from the pipeline?

About

End to end lifecycle for Data Engineering

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published