A sample data pipeline for transforming invoice images and CSV files into BI Service Dashboards
Raw data (images and CSV) from repo's /k8s/object_store will be transformed into beautiful numbers displayed in Apache Superset.
- Invoice images are sampled from CORDv2 dataset
- CSV file is from Kaggle
This is a simplified data pipeline, meant to be run on a single machine (e.g. your laptop). In a production environment, the Airflow would only act as a scheduler to trigger jobs on a separate Spark Cluster. Trino is probably not needed in this case, and can be replaced with SparkSQL.
- Docker for Desktop (Enable Kubernetes and WSL2) or minikube
- Helm
- Python 3.12 (Microsoft store)
- openssl: generate secrets for SuperSet and cert for Trino
- For Windows users: just install Git for Windows, it'll be included in Git Bash console
- >16GB RAM. Preferably 32GB
TL;DR
(cd ./k8s && ./deploy.sh)
Many services are of type NodePort, run kubectl get svc -n everest
to get their exposed port numbers. Go to defaults.sh
to see default login credentials.
Step-by-step guide