Ballista: Distributed Compute with Rust, Apache Arrow, and DataFusion

Ballista is a distributed SQL query engine primarily implemented in Rust, and powered by Apache Arrow and DataFusion. It is built on an architecture that allows other programming languages (such as Python, C++, and Java) to be supported as first-class citizens without paying a penalty for serialization costs.

The foundational technologies in Ballista are:

Apache Arrow memory model and compute kernels for efficient processing of data.
DataFusion Query Engine for query execution
Apache Arrow Flight Protocol for efficient data transfer between processes.
Google Protocol Buffers for serializing query plans, with plans to eventually use substrait.io here.

Ballista implements a similar design to Apache Spark (particularly Spark SQL), but there are some key differences:

The choice of Rust as the main execution language avoids the overhead of GC pauses.
Ballista is designed from the ground up to use columnar data, enabling a number of efficiencies such as vectorized processing (SIMD and GPU) and efficient compression. Although Spark does have some columnar support, it is still largely row-based today.
The combination of Rust and Arrow provides excellent memory efficiency and memory usage can be 5x - 10x lower than Apache Spark in some cases, which means that more processing can fit on a single node, reducing the overhead of distributed compute.
The use of Apache Arrow as the memory model and network protocol means that data can be exchanged between executors in any programming language with minimal serialization overhead.

Ballista can be deployed as a standalone cluster and also supports Kubernetes. In either case, the scheduler can be configured to use etcd as a backing store to (eventually) provide redundancy in the case of a scheduler failing.

Project Status and Roadmap

Ballista is currently a proof-of-concept and provides batch execution of SQL queries. Although it is already capable of executing complex queries, it is not yet scalable or robust.

There is an excellent discussion in apache#30 about the future of the project and we encourage you to participate and add your feedback there if you are interested in using or contributing to Ballista.

The current initiatives being considered are:

Continue to improve the current batch-based execution
Add support for low-latency query execution based on a streaming model
Adopt substrait.io to allow other query engines to be integrated

Getting Started

The easiest way to get started is to run one of the standalone or distributed examples. After that, refer to the Getting Started Guide.

Architecture Overview

Refer to the developer documentation for the Architecture Overview
Watch the Ballista: Distributed Compute with Rust and Apache Arrow talk from the New York Open Statistical Programming Meetup (Feb 2021)

Contribution Guide

Please see Contribution Guide for information about contributing to DataFusion.

Name		Name	Last commit message	Last commit date
Latest commit History 3,950 Commits
.github		.github
ballista-cli		ballista-cli
ballista		ballista
benchmarks		benchmarks
ci		ci
conbench		conbench
dev		dev
docs		docs
examples		examples
python		python
.asf.yaml		.asf.yaml
.dir-locals.el		.dir-locals.el
.dockerignore		.dockerignore
.editorconfig		.editorconfig
.env		.env
.gitattributes		.gitattributes
.github_changelog_generator		.github_changelog_generator
.gitignore		.gitignore
.gitmodules		.gitmodules
.hadolint.yaml		.hadolint.yaml
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yml		.readthedocs.yml
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.toml		Cargo.toml
LICENSE.txt		LICENSE.txt
NOTICE.txt		NOTICE.txt
README.md		README.md
header		header
pre-commit.sh		pre-commit.sh
rustfmt.toml		rustfmt.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ballista: Distributed Compute with Rust, Apache Arrow, and DataFusion

Project Status and Roadmap

Getting Started

Architecture Overview

Contribution Guide

About

Releases

Packages

Languages

License

yahoNanJing/arrow-ballista

Folders and files

Latest commit

History

Repository files navigation

Ballista: Distributed Compute with Rust, Apache Arrow, and DataFusion

Project Status and Roadmap

Getting Started

Architecture Overview

Contribution Guide

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages