From 195720f9fb0e4b5e27d0de67163f7d913e7cee69 Mon Sep 17 00:00:00 2001
From: Aaron Lehmann <aaron.lehmann@docker.com>
Date: Tue, 18 Jul 2017 16:00:49 -0700
Subject: [PATCH] design: Add raft.md

This document explains how the Raft implementation works.

Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
---
 design/raft.md | 240 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 240 insertions(+)
 create mode 100644 design/raft.md

diff --git a/design/raft.md b/design/raft.md
new file mode 100644
index 0000000000..e086fd40b3
--- /dev/null
+++ b/design/raft.md
@@ -0,0 +1,240 @@
+= Raft implementation =
+
+SwarmKit uses the Raft consensus protocol to synchronize state between manager
+nodes and support high availability. The lowest level portions of this are
+provided by the `github.com/coreos/etcd/raft` package. SwarmKit's
+`github.com/docker/swarmkit/manager/state/raft` package builds a complete
+solution on top of this, adding things like saving and loading state on disk,
+an RPC layer so nodes can pass Raft messages over a network, and dynamic cluster
+membership.
+
+== A quick review of Raft ==
+
+The details of the Raft protocol are outside the scope of this document, but
+it's well worth reviewing the [raft paper](https://raft.github.io/raft.pdf).
+
+Essentially, Raft gives us two things. It provides the mechanism to elect a
+leader, which serves as the arbiter or all consensus decisions. It also provides
+a distributed log that we can append entries to, subject to the leader's
+approval. The distributed log is the basic building block for agreeing on and
+distributing state. Once an entry in the log becomes *committed*, it becomes an
+immutable part of the log that will survive any future leader elections and
+changes to the cluster. We can think of a committed log entry as piece of state
+that the cluster has reached agreement on.
+
+== Role of the leader ==
+
+The leader has special responsibilities in the Raft protocol, but we also assign
+it special functions in SwarmKit outside the context of Raft. For example, the
+scheduler, orchestrators, dispatcher, and CA run on the leader node. This is not
+a design requirement, but simplifies things somewhat. If these components ran in
+a distributed fashion, we would need some mechanism to resolve conflicts between
+writes made by different nodes. Limiting decision-making to the leader avoids
+the need for this, since we can be certain that there is at most one leader at
+any time.
+
+The basic rule is that anything which writes to the Raft-backed data store needs
+to run on the manager. If a follower node tries to write to the data store, the
+write will fail. Writes will also fail on a node that starts out as the leader
+but loses its leadership position before the write finishes.
+
+== Raft IDs vs. node IDs ==
+
+Nodes in SwarmKit are identified by alphanumeric strings, but `etcd/raft` uses
+integers to identify Raft nodes. Thus, managers have two distinct IDs. The Raft
+IDs are assigned dynamically when a node joins the Raft consensus group. A node
+could potentially leave the Raft consensus group (through demotion), then later
+get promoted and rejoin under a different Raft ID. In this case, the node ID
+would stay the same, because it's a cryptographically-verifiable property of the
+node's certificate, but the Raft ID is assigned arbitrarily and would change.
+
+It's important to note that a Raft ID can't be reused after a node that was
+using the ID leaves the consensus group. These Raft IDs of nodes that are no
+longer part of the cluster are saved in a list to make sure they aren't reused.
+If a node with a Raft ID on this list tries to use Raft RPCs, other nodes won't
+honor these requests.
+
+== Logs and snapshots ==
+
+There are two sets of files on disk that provide persistent state for Raft.
+There is a set of WAL (write-ahead log files). These store a series of log
+entries and Raft metadata, such as the current term, index, and committed index.
+WAL files are automatically rotated when they reach a certain size.
+
+To avoid having to retain every entry in the history of the log, snapshots
+serialize a view of the state at a particular point in time. After a snapshot
+gets taken, logs that predate the snapshot are no longer necessary, because the
+snapshot captures all the information that's needed from the log up to that
+point. The number of old snapshots and WALs to retain is configurable.
+
+In SwarmKit's usage, WALs mostly contain protobuf-serialized data store
+modifications. A log entry can contain a batch of creations, updates, and
+deletions of objects from the data store. Some log entries contain other kinds
+of metadata, like node additions or removals. Snapshots contain a complete dump
+of the store, as well as any metadata from the log entries that needs to be
+preserved. The saved metadata includes the Raft term and index, a list of nodes
+in the cluster, and a list of nodes that have been removed from the cluster.
+
+WALs and snapshots are both stored encrypted, even if the autolock feature is
+disabled. With autolock turned off, the data encryption key is stored on disk in
+plaintext, in a header inside the TLS key. When autolock is turned on, the data
+encryption key is encrypted with a key encryption key.
+
+== Initializing a Raft cluster ==
+
+The first manager of a cluster (`swarm init`) assigns itself a random Raft ID.
+It creates a new WAL with its own Raft identity stored in the metadata field.
+The metadata field is the only part of the WAL that differs between nodes. By
+storing information such as the local Raft ID, it's easy to restore this
+node-specific information after a restart. In principle it could be stored in a
+separate file, but embedding it inside the WAL is most convenient.
+
+The node then starts the Raft state machine. From this point, it's a fully
+functional single-node Raft instance. Writes to the data store actually go
+through Raft, though this is a trivial case because reaching consensus doesn't
+involve communicating with any other nodes. The `Run` loop sees these writes and
+serializes them to disk as requested by the `etcd/raft` package.
+
+== Adding and removing nodes ==
+
+New nodes can join an existing Raft consensus group by invoking the `Join` RPC
+on the leader node. This corresponds to joining a swarm with a manager-level
+token, or promoting a worker node to a manager. If successful, `Join` returns a
+Raft ID for the new node and a list of other members of the consensus group.
+
+On the leader side, `Join` tries to append a configuration change entry to the
+Raft log, and waits until that entry becomes committted.
+
+A new node creates an empty Raft log with its own node information in the
+metadata field. Then it starts the state machine. By running the Raft consensus
+protocol, the leader will discover that the new node doesn't have any entries in
+its log, and will synchronize these entries to the new node through some
+combination of sending snapshots and log entries. It can take a little while for
+a new node to become a functional member of the consensus group, because it
+needs to receive this data first.
+
+Removing a node through demotion is a bit different. This requires two
+coordinated changes: the node must renew its certificate to get a worker
+certificate, and it should also be cleanly removed from the Raft consensus
+group. To avoid inconsistent states, particularly in cases like demoting the
+leader, there is a reconciliation loop that handles this in
+`manager/role_manager.go`. To initiate demotion, the user changes a node's
+`DesiredRole` to `Worker`. The role manager detects any nodes that have been
+demoted but are still acting as managers, and first removes them from the
+consensus group by calling `RemoveMember`. Only once this has happened is it
+safe to change the `Role` field to get a new certificate issued, because issuing
+a worker certificate to a node participating in the Raft group could cause loss
+of quorum.
+
+`RemoveMember` works similarly to `Join`. It appends an entry to the Raft log
+removing the member from the consensus group, and waits until this entry becomes
+committed. Once a member is removed, its Raft ID can never be reused.
+
+There is a special case when the leader is being demoted. It cannot reliably
+remove itself, because this involves informing the other nodes that the removal
+log entry has been committed, and if any of those messages are lost in transit,
+the leader won't have an opportunity to retry sending them, since demotion
+causes the Raft state machine to shut down. To solve this problem, the leader
+demotes itself simply by transferring leadership to a different manager node.
+When another node becomes the leader, the role manager will start up on that
+node, and it will be able to demote the former leader without this complication.
+
+== The main Raft loop ==
+
+The `Run` method acts as a main loop. It receives ticks from a ticker, and
+forwards these to the `etcd/raft` state machine, which relies on external code
+for timekeeping. It also receives `Ready` structures from the `etcd/raft` state
+machine on a channel.
+
+A `Ready` message conveys the current state of the system, provides a set of
+messages to send to peers, and includes any items that need to be acted on or
+written to disk. It is basically `etcd/raft`'s mechanism for communicating with
+the outside world and expressing its state to higher-level code.
+
+There are five basic functions the `Run` function performs when it receives a
+`Ready` message:
+
+1. Write new entries or a new snapshot to disk.
+2. Forward any messages for other peers to the right destinations over gRPC.
+3. Update the data store based on new snapshots or newly-committed log entries.
+4. Evaluate the current leadership status, and signal to other code if it
+   changes (for example, so that components like the orchestrator can be started
+   or stopped).
+5. If enough entries have accumulated between snapshots, create a new snapshot
+   to compact the WALs. The snapshot is written asynchronously and notifies the
+   `Run` method on completion.
+
+== Communication between nodes ==
+
+The `etcd/raft` package does not implement communication over a network. It
+references nodes by IDs, and it is up to higher-level code to convey messages to
+the correct places.
+
+SwarmKit uses gRPC to transfer these messages. The interface for this is very
+simple. Messages are only conveyed through a single RPC named
+`ProcessRaftMessage`.
+
+There is an additional RPC called `ResolveAddress` that deals with a corner case
+that can happen when nodes are added to a cluster dynamically. If a node was
+down while the current cluster leader was added, or didn't mark the log entry
+that added the leader as committed (which is done lazily), this node won't have
+the leader's address. It would receive RPCs from the leader, but not be able to
+invoke RPCs on the leader, so the communication would only happen in one
+direction. It would normally be impossible for the node to catch up. With
+`ResolveAddress`, it can query other cluster members for the leader's address,
+and restore two-way communication. See
+https://github.com/docker/swarmkit/issues/436 more details on this situation.
+
+SwarmKit's `raft/transport` package abstracts the mechanism for keeping track of
+peers, and sending messages to them over gRPC in a specific message order.
+
+== Integration between Raft and the data store ==
+
+The Raft `Node` object implements the `Proposer` interface which the data store
+uses to propagate changes across the cluster. The key method is `ProposeValue`,
+which appends information to the distributed log.
+
+The guts of `ProposeValue` are inside `processInternalRaftRequest`. This method
+appends the message to the log, and then waits for it to become committed. There
+is only one way `ProposeValue` can fail, which is the node where it's running
+losing its position as the leader. If the node remains the leader, there is no
+way a proposal can fail, since the leader controls which new entries are added
+to the log, and can't retract an entry once it has been appended. It can,
+however, take an indefinitely long time for a quorum of members to acknowledge
+the new entry. There is no timeout on `ProposeValue` because a timeout wouldn't
+retract the log entry, so having a timeout could put us in a state where a
+write timed out, but ends up going through later on. This would make the data
+store inconsistent with what's actually in the Raft log, which would be very
+bad.
+
+When the log entry successfully becomes committed, `processEntry` triggers the
+wait associatd with this entry, which allows `processInternalRaftRequest` to
+return. On a leadership change, all outstanding waits get cancelled.
+
+== The Raft RPC proxy ==
+
+As mentioned above, writes to the data store are only allowed on the leader
+node. But any manager node can receive gRPC requests, and workers don't even
+attempt to route those requests to the leaders. Somehow, requests that involve
+writing to the data store or seeing a consistent view of it need to be
+redirected to the leader.
+
+We generate wrappers around RPC handlers using the code in
+`protobuf/plugin/raftproxy`. These wrappers check if the current node is the
+leader, and serve the RPC locally in that case. In the case where some other
+node is the leader, the wrapper invokes the same RPC on the leader instead,
+acting as a proxy. The proxy inserts identity information for the client node in
+the gRPC headers of the request, so that clients can't achieve privilege
+escalation by going through the proxy.
+
+If one of these wrappers is registered with gRPC instead of the generated server
+code itself, the server in question will automatically proxy its requests to the
+leader. We use this for most APIs such as the dispatcher, control API, and CA.
+However, there are some cases where RPCs need to be invoked directly instead of
+being proxied to the leader, and in these cases, we don't use the wrappers. Raft
+itself is a good example of this - if `ProcessRaftMessage` was always forwarded
+to the leader, it would be impossible for the leader to communicate with other
+nodes. Incidentally, this is why the Raft RPCs are split between a `Raft`
+service and a `RaftMembership` service. The membership RPCs `Join` and `Leave`
+need to run on the leader, but RPCs such as `ProcessRaftMessage` must not be
+forwarded to the leader.