Skip to content

Commit

Permalink
Add doc
Browse files Browse the repository at this point in the history
  • Loading branch information
akiradeveloper committed Apr 24, 2024
1 parent 95bb4d9 commit df382f6
Show file tree
Hide file tree
Showing 20 changed files with 518 additions and 144 deletions.
3 changes: 3 additions & 0 deletions .github/actions/build/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
FROM rust:latest
RUN cargo install mdbook --no-default-features --features search --vers "^0.4" --locked
CMD ["mdbook", "build", "doc"]
24 changes: 24 additions & 0 deletions .github/workflows/doc.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
name: Documentation
on:
workflow_dispatch:
push:
branches:
- master
paths:
- "doc/src/**"
- "doc/book.toml"

jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Build
uses: ./.github/actions/build
- name: Deploy
uses: peaceiris/actions-gh-pages@v4
with:
github_token: ${{ secrets.GITHUB_TOKEN }}
PUBLISH_BRANCH: gh-pages
PUBLISH_DIR: ./doc/book

4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -12,4 +12,6 @@ Cargo.lock
#Added by cargo
/target

.vscode
.vscode

.DS_Store
11 changes: 3 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,14 @@
# lolraft

[![Crates.io](https://img.shields.io/crates/v/lolraft.svg)](https://crates.io/crates/lolraft)
[![Documentation](https://docs.rs/lolraft/badge.svg)](https://docs.rs/lolraft)
[![API doc](https://docs.rs/lolraft/badge.svg)](https://docs.rs/lolraft)
![CI](https://github.com/akiradeveloper/lolraft/workflows/CI/badge.svg)
[![MIT licensed](https://img.shields.io/badge/license-MIT-blue.svg)](https://github.com/akiradeveloper/lolraft/blob/master/LICENSE)

A Multi-Raft implementation in Rust language.

[Documentation](https://akiradeveloper.github.io/lolraft/)

![146726060-63b12378-ecb7-49f9-8025-a65dbd37e9b2](https://github.com/akiradeveloper/lolraft/assets/785824/12a016fe-35a0-4d12-8ffa-955ef61b25b9)

## Features
Expand All @@ -18,13 +20,6 @@ A Multi-Raft implementation in Rust language.
- Based on [Tonic](https://github.com/hyperium/tonic) and efficient gRPC streaming is exploited in log replication and snapshot.
- [Phi Accrual Failure Detector](https://github.com/akiradeveloper/phi-detector) is used for leader failure detection. The adaptive algorithm allows you to not choose a fixed timeout number in prior to deployment and makes it possible to deploy Raft node in even geo-distributed environment.

## Architecture

To implement Multi-Raft, the architecture is split into two spaces. One in the lower side is called "Pure Raft" layer which is totally unaware of
gRPC and Multi-Raft. Therefore, it is called pure. The other side translates gRPC requests into pure requests and vice versa.

![スクリーンショット 2024-03-11 8 00 03](https://github.com/akiradeveloper/lolraft/assets/785824/28fd68c4-6aa9-4bc8-a1ac-c44eb8563751)

## Development

- `docker compose build` to build test application.
Expand Down
1 change: 1 addition & 0 deletions doc/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
book
6 changes: 6 additions & 0 deletions doc/book.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
[book]
authors = ["Akira Hayakawa"]
language = "en"
multilingual = false
src = "src"
title = "lolraft documentation"
9 changes: 9 additions & 0 deletions doc/src/SUMMARY.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# Summary

- [Multi-Raft](multi-raft.md)
- [Raft Process](raft-process.md)
- [Orbit pattern](orbit-pattern.md)
- [Application State](application-state.md)
- [Client Interaction](client-interaction.md)
- [Cluster Management](cluster-management.md)
- [Leadership](leadership.md)
37 changes: 37 additions & 0 deletions doc/src/application-state.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# Application State

In RaftCore, `RaftApp` and `RaftLogStore` are especially important because
these two compose the application state.
This section explains the conceptual aspect of it.

![](images/application-state.png)

## RaftApp

`RaftApp` is an abstraction that represents the FSM (Finite State Machine) of the application
and the snapshot repository.
The snapshot repository isn't separated because the contents of the snapshot is
strongly coupled with the application state.

The application state can be updated by applying the log entries.
In this figure, the last applied entry is of index 55.

`RaftApp` can generate a snapshot arbitrarily to compact the log.
When `RaftApp` makes a snapshot and stores it in the snapshot repository.
The latest snapshot is immediately picked up by the Raft process and it
manipulates the log (right in the figure) by replacing the snapshot entry.
In this figure, the snapshot index is 51.

Old snapshots will be garbage collected.

Snapshots may be fetched from other nodes.
This happens when the Raft process is far behind the leader and
the leader doesn't have the log entries as they are garbage collected.

## RaftLogStore

`RaftLogStore` is an abstraction that represents the log entries.
In the figure, log entries from 45 to 50 are scheduled for garbage collection.
snapshot entry is of index 51 and it is guaranteed that the corresponding snapshot
exists in the snapshot repository. 52 to 55 are applied.
56 or later are not applied yet. They are either uncommitted or committed.
44 changes: 44 additions & 0 deletions doc/src/client-interaction.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# Client Interaction

You can send application-defined commands to the Raft cluster.

lolraft distinguishes the command into two types: read-write and read-only.
The read-only command is called **query** and queries can be processed through the optimized path.

## R/W commnand

R/W command is a normal application command that is inserted into the log to be applied later.

You can send a R/W command to the cluster with this API.

```proto
message WriteRequest {
uint32 lane_id = 1;
bytes message = 2;
string request_id = 3;
}
```

**request_id** is to avoid doubly application.
Let's think about this scenario:

1. The client sends a R/W command to add 1 to the value.
2. The leader server replicates the command to the followers but crashes before application (+response).
3. The client resends the command to a new leader after a timeout.
4. The result is adding 2 to the value whereas the expectation is 1.

To avoid this issue, request_id is added to identify the commands.

## R/O command

The R/O command can bypass the log because it is safe to execute the query after the
commit index at query time is applied. This is called **read_index** optimization.

You can send a R/O command to the cluster with the following API.

```proto
message ReadRequest {
uint32 lane_id = 1;
bytes message = 2;
}
```
33 changes: 33 additions & 0 deletions doc/src/cluster-management.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# Cluster Management

## Single server change approach

There are two approaches to membership change.
One is by joint consensus and the other is what is called **single server change approach**.
lolraft implements the second one.

This approach exploits the log replication mechanism in membership change
and therefore can be implemented as an extension of the normal log replication
by defining AddServer and RemoveServer commands.
From the admin client, the following API can add or remove a Raft process in the cluster.

```proto
message AddServerRequest {
uint32 lane_id = 1;
string server_id = 2;
}
message RemoveServerRequest {
uint32 lane_id = 1;
string server_id = 2;
}
```

## Cluster bootstrapping

In Raft, any command is directed to the leader.
So the question is how to add a Raft process to the empty cluster.

The answer is so-called **cluster bootstrapping**.
On receiving the AddServer command and the receiving node recognizes the cluster is empty,
the node tries to form a new cluster with itself and immediately becomes a leader as a result of a self election.
Binary file added doc/src/images/application-state.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added doc/src/images/multi-raft-tikv.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added doc/src/images/multi-raft.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added doc/src/images/orbit-pattern.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added doc/src/images/raft-process.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
43 changes: 43 additions & 0 deletions doc/src/leadership.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# Leadership

## Command redirection

Raft is a leader-based consensus algorithm.
Only a single leader can exist in the cluster at a time and
all commands are directed to the leader.

In lolraft, if the receiving Raft process isn't the leader,
the command is redirected to the leader.

## Adaptive leader failure detection

Detecting the leader's failure is a very important issue in Raft algorithm.
The naive implementation can send heartbeats to the followers periodically and
followers can detect the leader's failure by timeout.
However, this approach requires the heartbeat interval and the timeout duration
to be set properly before deployment. This brings another complexity.
Not only that, these times can't be fixed to a single value when
the distance between nodes is heterogeneous such as geo-distributed environment.

To solve this problem, lolraft uses an adaptive failure detection algorithm called
**Phi accrual failure detector**.
With this approach, users are free from setting the timing parameters.

## Leadership transfer extension

In multi-raft, changing the cluster members is not a rare case.
An example of this case is rebalancing:
To balance the CPU/disk usage between nodes, Raft process may be
moved to other nodes.

If the Raft process to be removed is the leader, the cluster will not have a
leader until a new leader is elected which causes downtime.

To mitigate this problem, the admin client can send a TimeoutNow command to
any remaining Raft process to forcibly start a new election (by promoting to a candidate in Raft term).

```proto
message TimeoutNow {
uint32 lane_id = 1;
}
```
36 changes: 36 additions & 0 deletions doc/src/multi-raft.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# Multi-Raft

[Raft](https://raft.github.io/) consensus algorithm is nowadays a widely used building block of distributed applications.
The algorithm is about consensus on the sequence of log entries, which are applied to a state machine.
When the two log entries are identical, then the resulting state is identical too.
In this sense, it is also called **replicated state machine**.

The typical usage of the Raft algorithm is to implement a distributed key-value store.
However, naive implementation will suffer from scalability issues due to the seriality of the algorithm.
Let's consider putting (k1,v1) and (k2,v2) in the data store, naive implementation can't handle these two concurrently
because the operations are serialized.

Sharding is a common solution to this kind of problem.
If the k1 and k2 are enough distant in the key space, then the operations can be handled concurrently
by sharded Raft clusters.
Since we can split the single Raft cluster into multiple small Raft clusters,
it naturally addresses the capacity scalability problem at the same time.
TiKV uses this technique ([ref](https://tikv.org/deep-dive/scalability/multi-raft/)).

![](images/multi-raft-tikv.png)

There are two ways of implementing a multi-raft.
One is deploying multiple Raft clusters independently.
One of the problems of this approach is that resources like IP addresses or ports must be
allocated per node to identify them.
This introduces unnecessary complexity in deployment.
Another problem is that shards can't share the resources efficiently.
In the implementation of a key-value store, the embedded datastore should be shared among the shards
for efficient I/O like write batching.
For these problems, I will take the second approach.

lolraft implements **in-process multi-raft**.
As the name implies, lolraft allows you to place multiple **Raft processes** in a single gRPC server process.
These Raft processes can form a Raft cluster on independent **lanes**.

![](images/multi-raft.png)
9 changes: 9 additions & 0 deletions doc/src/orbit-pattern.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# Orbit Pattern

Inside RaftProcess, there is a state called `RaftCore` and
many threads to process it concurrently on an event-driven basis,
most of them are driven by timers.
Since this is like threads are orbiting around the core,
I call this **Orbit Pattern**.

![](images/orbit-pattern.png)
26 changes: 26 additions & 0 deletions doc/src/raft-process.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# Raft Process

The core of multi-raft is Raft process.
Each multi-raft server has one or more Raft processes.

To implement multi-raft,
lolraft implements Raft process as it is agnostic to
detailed node communications through gRPC.
Since the Raft process doesn't know about the IO,
we call it **Pure Raft**.

![](images/raft-process.png)

To make Raft process to communicate with other Raft processes
through network, `RaftDriver` must be provided.
Everything about actual network communication is encapsulated under `RaftDriver`.

```rust
impl RaftProcess {
pub async fn new(
app: impl RaftApp,
log_store: impl RaftLogStore,
ballot_store: impl RaftBallotStore,
driver: RaftDriver,
) -> Result<Self> {
```
Loading

0 comments on commit df382f6

Please sign in to comment.