- Install terraform, this would be used to provision kubernetes cluster, for development we will use a local kubernetes cluster; and we can also use terraform to provision kubernetes cluster in cloud
- Install a local Kubernetes cluster, here we have used kind
- Install kubectl and helm if you use a mac you can simply install them using
brew install kubectl helm
For each of them navigate to terraform directory and simply terraform apply
- Deploy kind cluster
- Deploy kubernetes namespaces
- Deploy postgres helm chart
- Open up a running postgres by running
kubectl port-forward svc/my-release-postgresql 5432:5432 -n airflow
-
Navigate to Data_processing
-
Build the docker image by running
poetry export -f requirements.txt --output requirements.txt --without-hashes docker build -t airflow-custom:latest .
-
Add two env variable
KAGGLE_USERNAME=<user> KAGGLE_KEY=<pass>
-
Run
docker-compose up
it will launch an airflow cluster locally, this will demonstrate a local server you need to configure connectivity to host postgres -
Alternatively run
airflow webserver
and navigate to airflow connections and change default postgres as defined here -
Use password
q1w2e3r4
-
You can load the docker image to kind cluster and also deploy the airflow using helm chart. A terraform script is provided here
-
You also need to update the env here
- All the code for the data pipelines are data_processing/airflow/dags/imdb
- This is part of an airflow dag
- All the required SQL can be found data_processing/airflow/dags/imdb/sql
- How would you package the pipeline code for deployment?
- Using CI/CD pipeline as a docker image
- How would you schedule a pipeline that runs the ingestion and the transformation tasks sequentially every day?
- Using airflow this is demonstrated here
- How would you ensure the quality of the data generated from the pipeline?
- Using a data validation tool like Great expections