Skip to content

A sample data pipeline for transforming invoice images and CSV files into beautiful numbers

License

Notifications You must be signed in to change notification settings

tcd93/invoice-data-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Invoice Data Platform

A sample data pipeline for transforming invoice images and CSV files into BI Service Dashboards

Summary

TL;DR
Data flow

Raw data (images and CSV) from repo's /k8s/object_store will be transformed into beautiful numbers displayed in Apache Superset.

  • Invoice images are sampled from CORDv2 dataset
  • CSV file is from Kaggle

This is a simplified data pipeline, meant to be run on a single machine (e.g. your laptop). In a production environment, the Airflow would only act as a scheduler to trigger jobs on a separate Spark Cluster. Trino is probably not needed in this case, and can be replaced with SparkSQL.

Requirement

  • Docker for Desktop (Enable Kubernetes and WSL2) or minikube
  • Helm
  • Python 3.12 (Microsoft store)
  • openssl: generate secrets for SuperSet and cert for Trino
    • For Windows users: just install Git for Windows, it'll be included in Git Bash console
  • >16GB RAM. Preferably 32GB

Quick Start

TL;DR

(cd ./k8s && ./deploy.sh)

Many services are of type NodePort, run kubectl get svc -n everest to get their exposed port numbers. Go to defaults.sh to see default login credentials.

Step-by-step guide

About

A sample data pipeline for transforming invoice images and CSV files into beautiful numbers

Topics

Resources

License

Stars

Watchers

Forks