GitHub - jiahuizz/databend: A modern Elasticity and Performance cloud data warehouse, activate your object storage for real-time insights.

A Modern Cloud Data Warehouse with the Elasticity and Performance both on Object Storage

Documentation | Benchmarking | Roadmap(v0.8)

What is Databend?
Design Overview
Getting Started
Community
Roadmap

What is Databend?

Databend is an open-source Elastic and Workload-Aware Modern Cloud Data Warehouse.

Databend uses the latest techniques in vectorized query processing to allow you to do blazing-fast data analytics on Object Storage.

Instant Elasticity

Databend separates the storage and compute, which allows you easily scale up or scale down based on your application's needs.
Blazing Performance

Databend leverages data-level parallelism(Vectorized Query Execution) and instruction-level parallelism(SIMD) technology, offering blazing performance data analytics.
Support for Semi-Structured Data

Databend supports ingestion of semi-structured data in various formats like CSV, JSON, and Parquet, which are located in the cloud or your local file system; Databend also supports semi-structured data types:VARIANT, OBJECT, ARRAY, which is easy to import and operate on semi-structured data (JSON).
MySQL/ClickHouse Compatible

Databend is ANSI SQL compliant and MySQL/ClickHouse wire protocol compatible, making it easy to connect with existing tools.
Easy to Use

Databend has no indexes to build, no manual tuning required, no manual figuring out partitions or shard data, it’s all done for you as data is loaded into the table.

Design Overview

This is the high-level architecture of Databend. It consists of three components:

meta service layer
compute layer
storage layer

Meta Service Layer

The meta service is a layer to service multiple tenants. This layer implements a persistent key-value store to store each tenant's state. In the current implementation, the meta service has many components:

Metadata, which manages all metadata of databases, tables, clusters, the transaction, etc.
Administration, which stores user info, user management, access control information, usage statistics, etc.
Security, which performs authorization and authentication to protect the privacy of users' data.

The code of Meta Service Layer mainly resides in the metasrv directory of the repository.

Compute Layer

The compute layer is the layer that carries out computation for query processing. This layer may consist of many clusters, and each cluster may consist of many nodes. Each node is a computing unit and is a collection of components:

Planner

The query planner builds an execution plan from the user's SQL statement and represents the query with different types of relational operators (such as Projection, Filter, Limit, etc.).

For example:

databend :) EXPLAIN SELECT avg(number) FROM numbers(100000) GROUP BY number % 3
┌─explain─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ Projection: avg(number):Float64                                                                                                                                                         │
│   AggregatorFinal: groupBy=[[(number % 3)]], aggr=[[avg(number)]]                                                                                                                       │
│     AggregatorPartial: groupBy=[[(number % 3)]], aggr=[[avg(number)]]                                                                                                                   │
│       Expression: (number % 3):UInt8, number:UInt64 (Before GroupBy)                                                                                                                    │
│         ReadDataSource: scan schema: [number:UInt64], statistics: [read_rows: 100000, read_bytes: 800000, partitions_scanned: 11, partitions_total: 11], push_downs: [projections: [0]] │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Optimizer

A rule-based optimizer, some rules like predicate push down or pruning of unused columns.

Processors

A Pull&Push-Based query execution pipeline, which is built by planner instructions. Each pipeline executor is a processor(such as SourceTransform, FilterTransform, etc.), it has zero or more inputs and zero or more outputs, and connected as a pipeline, it also can be distributed on multiple nodes judged by your query workload.

For example:

databend :) EXPLAIN PIPELINE SELECT avg(number) FROM numbers(100000) GROUP BY number % 3
┌─explain────────────────────────────────────────────────────────────────────────────────┐
│ ProjectionTransform × 16 processors                                                    │
│   Mixed (GroupByFinalTransform × 1 processor) to (ProjectionTransform × 16 processors) │
│     GroupByFinalTransform × 1 processor                                                │
│       Merge (GroupByPartialTransform × 16 processors) to (GroupByFinalTransform × 1)   │
│         GroupByPartialTransform × 16 processors                                        │
│           ExpressionTransform × 16 processors                                          │
│             SourceTransform × 16 processors                                            │
└────────────────────────────────────────────────────────────────────────────────────────┘

Node is the smallest unit of the compute layer. A set of nodes can be registered as one cluster via namespace. Many clusters can attach the same database, so they can serve the query in parallel by different users. When you add new nodes to a cluster, the currently running computational tasks can be scaled(known as work-stealing) guarantee.

The Compute Layer codes are mainly in the query directory.

Storage Layer

Databend stores data in an efficient, columnar format as Parquet files. Each Parquet file is sorted by the primary key before being written to the underlying shared storage. For efficient pruning, Databend also creates indexes for each Parquet file:

min_max.idx The index file stores the minimum and maximum value of this Parquet file.
sparse.idx The index file store the <key, parquet-page> mapping for every [N] records granularity.

With the indexes, we can speed up the queries by reducing the I/O and CPU costs. Imagine that Parquet file f1 has min_max.idx of [3, 5) and Parquet file f2 has min_max.idx of [4, 6) in column x if the query predicate is WHERE x < 4, only f1 needs to be accessed and processed.

Getting Started

Deployment

Connect

Users

Tables

Views

Load Data

Use Case

Performance

How to Benchmark Databend

Community

For general help in using Databend, please refer to the official documentation. For additional help, you can use one of these channels to ask a question:

Slack (For live discussion with the Community)
Github (Feature/Bug reports, Contributions)
Twitter (Get the news fast)
Weekly (A weekly newsletter about Databend)

Roadmap

License

Databend is licensed under Apache 2.0.

Acknowledgement

Databend is inspired by ClickHouse and Snowflake, its computing model is based on apache-arrow.
The documentation website hosted by Vercel.

Name		Name	Last commit message	Last commit date
Latest commit History 11,707 Commits
.cargo		.cargo
.github		.github
common		common
docker		docker
docs		docs
metasrv		metasrv
query		query
scripts		scripts
tests		tests
tools		tools
website		website
.gitattributes		.gitattributes
.gitignore		.gitignore
.licenserc.yaml		.licenserc.yaml
.style.yapf		.style.yapf
.yamllint.yml		.yamllint.yml
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Cross.toml		Cross.toml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
rust-toolchain.toml		rust-toolchain.toml
rustfmt.toml		rustfmt.toml
taplo.toml		taplo.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Documentation | Benchmarking | Roadmap(v0.8)

What is Databend?

Design Overview

Meta Service Layer

Compute Layer

Storage Layer

Getting Started

Deployment

Connect

Users

Tables

Views

Load Data

Use Case

Performance

Community

Roadmap

License

Acknowledgement

About

Releases

Packages

Languages

License

jiahuizz/databend

Folders and files

Latest commit

History

Repository files navigation

Documentation | Benchmarking | Roadmap(v0.8)

What is Databend?

Design Overview

Meta Service Layer

Compute Layer

Storage Layer

Getting Started

Deployment

Connect

Users

Tables

Views

Load Data

Use Case

Performance

Community

Roadmap

License

Acknowledgement

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages