-
Notifications
You must be signed in to change notification settings - Fork 18
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
95bb4d9
commit df382f6
Showing
20 changed files
with
518 additions
and
144 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
FROM rust:latest | ||
RUN cargo install mdbook --no-default-features --features search --vers "^0.4" --locked | ||
CMD ["mdbook", "build", "doc"] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
name: Documentation | ||
on: | ||
workflow_dispatch: | ||
push: | ||
branches: | ||
- master | ||
paths: | ||
- "doc/src/**" | ||
- "doc/book.toml" | ||
|
||
jobs: | ||
deploy: | ||
runs-on: ubuntu-latest | ||
steps: | ||
- uses: actions/checkout@v4 | ||
- name: Build | ||
uses: ./.github/actions/build | ||
- name: Deploy | ||
uses: peaceiris/actions-gh-pages@v4 | ||
with: | ||
github_token: ${{ secrets.GITHUB_TOKEN }} | ||
PUBLISH_BRANCH: gh-pages | ||
PUBLISH_DIR: ./doc/book | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -12,4 +12,6 @@ Cargo.lock | |
#Added by cargo | ||
/target | ||
|
||
.vscode | ||
.vscode | ||
|
||
.DS_Store |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
book |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
[book] | ||
authors = ["Akira Hayakawa"] | ||
language = "en" | ||
multilingual = false | ||
src = "src" | ||
title = "lolraft documentation" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
# Summary | ||
|
||
- [Multi-Raft](multi-raft.md) | ||
- [Raft Process](raft-process.md) | ||
- [Orbit pattern](orbit-pattern.md) | ||
- [Application State](application-state.md) | ||
- [Client Interaction](client-interaction.md) | ||
- [Cluster Management](cluster-management.md) | ||
- [Leadership](leadership.md) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,37 @@ | ||
# Application State | ||
|
||
In RaftCore, `RaftApp` and `RaftLogStore` are especially important because | ||
these two compose the application state. | ||
This section explains the conceptual aspect of it. | ||
|
||
![](images/application-state.png) | ||
|
||
## RaftApp | ||
|
||
`RaftApp` is an abstraction that represents the FSM (Finite State Machine) of the application | ||
and the snapshot repository. | ||
The snapshot repository isn't separated because the contents of the snapshot is | ||
strongly coupled with the application state. | ||
|
||
The application state can be updated by applying the log entries. | ||
In this figure, the last applied entry is of index 55. | ||
|
||
`RaftApp` can generate a snapshot arbitrarily to compact the log. | ||
When `RaftApp` makes a snapshot and stores it in the snapshot repository. | ||
The latest snapshot is immediately picked up by the Raft process and it | ||
manipulates the log (right in the figure) by replacing the snapshot entry. | ||
In this figure, the snapshot index is 51. | ||
|
||
Old snapshots will be garbage collected. | ||
|
||
Snapshots may be fetched from other nodes. | ||
This happens when the Raft process is far behind the leader and | ||
the leader doesn't have the log entries as they are garbage collected. | ||
|
||
## RaftLogStore | ||
|
||
`RaftLogStore` is an abstraction that represents the log entries. | ||
In the figure, log entries from 45 to 50 are scheduled for garbage collection. | ||
snapshot entry is of index 51 and it is guaranteed that the corresponding snapshot | ||
exists in the snapshot repository. 52 to 55 are applied. | ||
56 or later are not applied yet. They are either uncommitted or committed. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,44 @@ | ||
# Client Interaction | ||
|
||
You can send application-defined commands to the Raft cluster. | ||
|
||
lolraft distinguishes the command into two types: read-write and read-only. | ||
The read-only command is called **query** and queries can be processed through the optimized path. | ||
|
||
## R/W commnand | ||
|
||
R/W command is a normal application command that is inserted into the log to be applied later. | ||
|
||
You can send a R/W command to the cluster with this API. | ||
|
||
```proto | ||
message WriteRequest { | ||
uint32 lane_id = 1; | ||
bytes message = 2; | ||
string request_id = 3; | ||
} | ||
``` | ||
|
||
**request_id** is to avoid doubly application. | ||
Let's think about this scenario: | ||
|
||
1. The client sends a R/W command to add 1 to the value. | ||
2. The leader server replicates the command to the followers but crashes before application (+response). | ||
3. The client resends the command to a new leader after a timeout. | ||
4. The result is adding 2 to the value whereas the expectation is 1. | ||
|
||
To avoid this issue, request_id is added to identify the commands. | ||
|
||
## R/O command | ||
|
||
The R/O command can bypass the log because it is safe to execute the query after the | ||
commit index at query time is applied. This is called **read_index** optimization. | ||
|
||
You can send a R/O command to the cluster with the following API. | ||
|
||
```proto | ||
message ReadRequest { | ||
uint32 lane_id = 1; | ||
bytes message = 2; | ||
} | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,33 @@ | ||
# Cluster Management | ||
|
||
## Single server change approach | ||
|
||
There are two approaches to membership change. | ||
One is by joint consensus and the other is what is called **single server change approach**. | ||
lolraft implements the second one. | ||
|
||
This approach exploits the log replication mechanism in membership change | ||
and therefore can be implemented as an extension of the normal log replication | ||
by defining AddServer and RemoveServer commands. | ||
From the admin client, the following API can add or remove a Raft process in the cluster. | ||
|
||
```proto | ||
message AddServerRequest { | ||
uint32 lane_id = 1; | ||
string server_id = 2; | ||
} | ||
message RemoveServerRequest { | ||
uint32 lane_id = 1; | ||
string server_id = 2; | ||
} | ||
``` | ||
|
||
## Cluster bootstrapping | ||
|
||
In Raft, any command is directed to the leader. | ||
So the question is how to add a Raft process to the empty cluster. | ||
|
||
The answer is so-called **cluster bootstrapping**. | ||
On receiving the AddServer command and the receiving node recognizes the cluster is empty, | ||
the node tries to form a new cluster with itself and immediately becomes a leader as a result of a self election. |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,43 @@ | ||
# Leadership | ||
|
||
## Command redirection | ||
|
||
Raft is a leader-based consensus algorithm. | ||
Only a single leader can exist in the cluster at a time and | ||
all commands are directed to the leader. | ||
|
||
In lolraft, if the receiving Raft process isn't the leader, | ||
the command is redirected to the leader. | ||
|
||
## Adaptive leader failure detection | ||
|
||
Detecting the leader's failure is a very important issue in Raft algorithm. | ||
The naive implementation can send heartbeats to the followers periodically and | ||
followers can detect the leader's failure by timeout. | ||
However, this approach requires the heartbeat interval and the timeout duration | ||
to be set properly before deployment. This brings another complexity. | ||
Not only that, these times can't be fixed to a single value when | ||
the distance between nodes is heterogeneous such as geo-distributed environment. | ||
|
||
To solve this problem, lolraft uses an adaptive failure detection algorithm called | ||
**Phi accrual failure detector**. | ||
With this approach, users are free from setting the timing parameters. | ||
|
||
## Leadership transfer extension | ||
|
||
In multi-raft, changing the cluster members is not a rare case. | ||
An example of this case is rebalancing: | ||
To balance the CPU/disk usage between nodes, Raft process may be | ||
moved to other nodes. | ||
|
||
If the Raft process to be removed is the leader, the cluster will not have a | ||
leader until a new leader is elected which causes downtime. | ||
|
||
To mitigate this problem, the admin client can send a TimeoutNow command to | ||
any remaining Raft process to forcibly start a new election (by promoting to a candidate in Raft term). | ||
|
||
```proto | ||
message TimeoutNow { | ||
uint32 lane_id = 1; | ||
} | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,36 @@ | ||
# Multi-Raft | ||
|
||
[Raft](https://raft.github.io/) consensus algorithm is nowadays a widely used building block of distributed applications. | ||
The algorithm is about consensus on the sequence of log entries, which are applied to a state machine. | ||
When the two log entries are identical, then the resulting state is identical too. | ||
In this sense, it is also called **replicated state machine**. | ||
|
||
The typical usage of the Raft algorithm is to implement a distributed key-value store. | ||
However, naive implementation will suffer from scalability issues due to the seriality of the algorithm. | ||
Let's consider putting (k1,v1) and (k2,v2) in the data store, naive implementation can't handle these two concurrently | ||
because the operations are serialized. | ||
|
||
Sharding is a common solution to this kind of problem. | ||
If the k1 and k2 are enough distant in the key space, then the operations can be handled concurrently | ||
by sharded Raft clusters. | ||
Since we can split the single Raft cluster into multiple small Raft clusters, | ||
it naturally addresses the capacity scalability problem at the same time. | ||
TiKV uses this technique ([ref](https://tikv.org/deep-dive/scalability/multi-raft/)). | ||
|
||
![](images/multi-raft-tikv.png) | ||
|
||
There are two ways of implementing a multi-raft. | ||
One is deploying multiple Raft clusters independently. | ||
One of the problems of this approach is that resources like IP addresses or ports must be | ||
allocated per node to identify them. | ||
This introduces unnecessary complexity in deployment. | ||
Another problem is that shards can't share the resources efficiently. | ||
In the implementation of a key-value store, the embedded datastore should be shared among the shards | ||
for efficient I/O like write batching. | ||
For these problems, I will take the second approach. | ||
|
||
lolraft implements **in-process multi-raft**. | ||
As the name implies, lolraft allows you to place multiple **Raft processes** in a single gRPC server process. | ||
These Raft processes can form a Raft cluster on independent **lanes**. | ||
|
||
![](images/multi-raft.png) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
# Orbit Pattern | ||
|
||
Inside RaftProcess, there is a state called `RaftCore` and | ||
many threads to process it concurrently on an event-driven basis, | ||
most of them are driven by timers. | ||
Since this is like threads are orbiting around the core, | ||
I call this **Orbit Pattern**. | ||
|
||
![](images/orbit-pattern.png) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
# Raft Process | ||
|
||
The core of multi-raft is Raft process. | ||
Each multi-raft server has one or more Raft processes. | ||
|
||
To implement multi-raft, | ||
lolraft implements Raft process as it is agnostic to | ||
detailed node communications through gRPC. | ||
Since the Raft process doesn't know about the IO, | ||
we call it **Pure Raft**. | ||
|
||
![](images/raft-process.png) | ||
|
||
To make Raft process to communicate with other Raft processes | ||
through network, `RaftDriver` must be provided. | ||
Everything about actual network communication is encapsulated under `RaftDriver`. | ||
|
||
```rust | ||
impl RaftProcess { | ||
pub async fn new( | ||
app: impl RaftApp, | ||
log_store: impl RaftLogStore, | ||
ballot_store: impl RaftBallotStore, | ||
driver: RaftDriver, | ||
) -> Result<Self> { | ||
``` |
Oops, something went wrong.