Apache data system written for development in a local cluster and incramental deployment modifications for production environments. Great for learning about cloud-native development in the apache software ecosystem.
Kafka
event source to ingest realtime application dataSpark
framework for microbatch and batch processesDelta
ACID-compliant storage layer on file storageHive
metadata store for the delta schemasTrino
analytics query engine for ad-hoc analysis
An end-to-end test of the system can be run in kubernetes. The test:
- publishes sample data to a topic on the kafka cluster
- ingests the data into a staging table on delta using structured-streaming
- performs windowed aggregations on the data and saves the results
- triggers a sql analytics query through trino to simulate an analyst
The tests are triggered through github actions, although you will need to use a self-hosted runner.
Ensure docker desktop is running
open -a Docker;
kubectl config use-context docker-desktop
Build docker images for the kafka, spark, and trino jobs
make kafka-producer-image;
make spark-jobs-image;
make trino-queries-image
Run the system test
cd scripts;
./01-install-operators.sh;
./02-deploy-kafka.sh;
./03-deploy-delta-trino.sh;
./04-run-spark-jobs.sh;
./05-run-trino-query.sh
- Scale the kafka cluster and spark streaming jobs
- Change the batch job deployment to run on an ephemeral cluster
- Change the storage configuration to cloud storage
- Update the hive metastore to postgresql
- Scale the trino cluster for more performant queries
- Setup an orchestrator such as Airflow or Prefect
- Configure security and monitoring for the application
Checkout this open-source project and company: stackable.