holocene
is a follow up to eocene
where we implement
a vectorized, push based query engine using Arrow as the data format.
Vectorized execution in the context of database workloads means batches of records, most
often when speaking about vectorized execution the meaning is along the lines of the Volcano
model, but instead of next()
returning a single record, next()
returns multiple records.
Actual vectorization as in, SIMD instructions, is sometimes used to implement faster compute kernels but they don't mean the entire query plan is vectorized, but the plan can indeed be executed in parallel.
Push-based in this context describes a paradigm different from the Volcano model , where operators push their results down the pipeline. This approach has the benefit that the query plan becomes a DAG that can be executed in parallel, except for pipeline breakers that can be seen as join points.
Vectorized + push-based models are extremely good for OLAP workloads and represent the union of two ideas, push-based models and vectorized models.
Parallelizable part of the pipeline
each step pushes, multiple records
down the pipeline
Pipeline breaking, since LIMIT
will be applied over all records
+--------+ |
| Batch | +--------+ +------+ +--------+ +------------+ | +-------+
+--------+--->| Source |--->| Scan |--->| Filter |--->| Projection | | | |
+--------+ +------+ +--------+ +------------+ | | |
+--------+ | | |
| Batch | +--------+ +------+ +--------+ +------------+ | | |
+--------+--->| Source |--->| Scan |--->| Filter |--->| Projection |---+--->| Limit |
+--------+ +------+ +--------+ +------------+ | | |
+--------+ | | |
| Batch | +--------+ +------+ +--------+ +------------+ | | |
+--------+--->| Source |--->| Scan |--->| Filter |--->| Projection | | | |
+--------+ +------+ +--------+ +------------+ | +-------+
|
+--------+ |
| Batch | |
+--------+ |