From 195720f9fb0e4b5e27d0de67163f7d913e7cee69 Mon Sep 17 00:00:00 2001 From: Aaron Lehmann Date: Tue, 18 Jul 2017 16:00:49 -0700 Subject: [PATCH] design: Add raft.md This document explains how the Raft implementation works. Signed-off-by: Aaron Lehmann --- design/raft.md | 240 +++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 240 insertions(+) create mode 100644 design/raft.md diff --git a/design/raft.md b/design/raft.md new file mode 100644 index 0000000000..e086fd40b3 --- /dev/null +++ b/design/raft.md @@ -0,0 +1,240 @@ += Raft implementation = + +SwarmKit uses the Raft consensus protocol to synchronize state between manager +nodes and support high availability. The lowest level portions of this are +provided by the `github.com/coreos/etcd/raft` package. SwarmKit's +`github.com/docker/swarmkit/manager/state/raft` package builds a complete +solution on top of this, adding things like saving and loading state on disk, +an RPC layer so nodes can pass Raft messages over a network, and dynamic cluster +membership. + +== A quick review of Raft == + +The details of the Raft protocol are outside the scope of this document, but +it's well worth reviewing the [raft paper](https://raft.github.io/raft.pdf). + +Essentially, Raft gives us two things. It provides the mechanism to elect a +leader, which serves as the arbiter or all consensus decisions. It also provides +a distributed log that we can append entries to, subject to the leader's +approval. The distributed log is the basic building block for agreeing on and +distributing state. Once an entry in the log becomes *committed*, it becomes an +immutable part of the log that will survive any future leader elections and +changes to the cluster. We can think of a committed log entry as piece of state +that the cluster has reached agreement on. + +== Role of the leader == + +The leader has special responsibilities in the Raft protocol, but we also assign +it special functions in SwarmKit outside the context of Raft. For example, the +scheduler, orchestrators, dispatcher, and CA run on the leader node. This is not +a design requirement, but simplifies things somewhat. If these components ran in +a distributed fashion, we would need some mechanism to resolve conflicts between +writes made by different nodes. Limiting decision-making to the leader avoids +the need for this, since we can be certain that there is at most one leader at +any time. + +The basic rule is that anything which writes to the Raft-backed data store needs +to run on the manager. If a follower node tries to write to the data store, the +write will fail. Writes will also fail on a node that starts out as the leader +but loses its leadership position before the write finishes. + +== Raft IDs vs. node IDs == + +Nodes in SwarmKit are identified by alphanumeric strings, but `etcd/raft` uses +integers to identify Raft nodes. Thus, managers have two distinct IDs. The Raft +IDs are assigned dynamically when a node joins the Raft consensus group. A node +could potentially leave the Raft consensus group (through demotion), then later +get promoted and rejoin under a different Raft ID. In this case, the node ID +would stay the same, because it's a cryptographically-verifiable property of the +node's certificate, but the Raft ID is assigned arbitrarily and would change. + +It's important to note that a Raft ID can't be reused after a node that was +using the ID leaves the consensus group. These Raft IDs of nodes that are no +longer part of the cluster are saved in a list to make sure they aren't reused. +If a node with a Raft ID on this list tries to use Raft RPCs, other nodes won't +honor these requests. + +== Logs and snapshots == + +There are two sets of files on disk that provide persistent state for Raft. +There is a set of WAL (write-ahead log files). These store a series of log +entries and Raft metadata, such as the current term, index, and committed index. +WAL files are automatically rotated when they reach a certain size. + +To avoid having to retain every entry in the history of the log, snapshots +serialize a view of the state at a particular point in time. After a snapshot +gets taken, logs that predate the snapshot are no longer necessary, because the +snapshot captures all the information that's needed from the log up to that +point. The number of old snapshots and WALs to retain is configurable. + +In SwarmKit's usage, WALs mostly contain protobuf-serialized data store +modifications. A log entry can contain a batch of creations, updates, and +deletions of objects from the data store. Some log entries contain other kinds +of metadata, like node additions or removals. Snapshots contain a complete dump +of the store, as well as any metadata from the log entries that needs to be +preserved. The saved metadata includes the Raft term and index, a list of nodes +in the cluster, and a list of nodes that have been removed from the cluster. + +WALs and snapshots are both stored encrypted, even if the autolock feature is +disabled. With autolock turned off, the data encryption key is stored on disk in +plaintext, in a header inside the TLS key. When autolock is turned on, the data +encryption key is encrypted with a key encryption key. + +== Initializing a Raft cluster == + +The first manager of a cluster (`swarm init`) assigns itself a random Raft ID. +It creates a new WAL with its own Raft identity stored in the metadata field. +The metadata field is the only part of the WAL that differs between nodes. By +storing information such as the local Raft ID, it's easy to restore this +node-specific information after a restart. In principle it could be stored in a +separate file, but embedding it inside the WAL is most convenient. + +The node then starts the Raft state machine. From this point, it's a fully +functional single-node Raft instance. Writes to the data store actually go +through Raft, though this is a trivial case because reaching consensus doesn't +involve communicating with any other nodes. The `Run` loop sees these writes and +serializes them to disk as requested by the `etcd/raft` package. + +== Adding and removing nodes == + +New nodes can join an existing Raft consensus group by invoking the `Join` RPC +on the leader node. This corresponds to joining a swarm with a manager-level +token, or promoting a worker node to a manager. If successful, `Join` returns a +Raft ID for the new node and a list of other members of the consensus group. + +On the leader side, `Join` tries to append a configuration change entry to the +Raft log, and waits until that entry becomes committted. + +A new node creates an empty Raft log with its own node information in the +metadata field. Then it starts the state machine. By running the Raft consensus +protocol, the leader will discover that the new node doesn't have any entries in +its log, and will synchronize these entries to the new node through some +combination of sending snapshots and log entries. It can take a little while for +a new node to become a functional member of the consensus group, because it +needs to receive this data first. + +Removing a node through demotion is a bit different. This requires two +coordinated changes: the node must renew its certificate to get a worker +certificate, and it should also be cleanly removed from the Raft consensus +group. To avoid inconsistent states, particularly in cases like demoting the +leader, there is a reconciliation loop that handles this in +`manager/role_manager.go`. To initiate demotion, the user changes a node's +`DesiredRole` to `Worker`. The role manager detects any nodes that have been +demoted but are still acting as managers, and first removes them from the +consensus group by calling `RemoveMember`. Only once this has happened is it +safe to change the `Role` field to get a new certificate issued, because issuing +a worker certificate to a node participating in the Raft group could cause loss +of quorum. + +`RemoveMember` works similarly to `Join`. It appends an entry to the Raft log +removing the member from the consensus group, and waits until this entry becomes +committed. Once a member is removed, its Raft ID can never be reused. + +There is a special case when the leader is being demoted. It cannot reliably +remove itself, because this involves informing the other nodes that the removal +log entry has been committed, and if any of those messages are lost in transit, +the leader won't have an opportunity to retry sending them, since demotion +causes the Raft state machine to shut down. To solve this problem, the leader +demotes itself simply by transferring leadership to a different manager node. +When another node becomes the leader, the role manager will start up on that +node, and it will be able to demote the former leader without this complication. + +== The main Raft loop == + +The `Run` method acts as a main loop. It receives ticks from a ticker, and +forwards these to the `etcd/raft` state machine, which relies on external code +for timekeeping. It also receives `Ready` structures from the `etcd/raft` state +machine on a channel. + +A `Ready` message conveys the current state of the system, provides a set of +messages to send to peers, and includes any items that need to be acted on or +written to disk. It is basically `etcd/raft`'s mechanism for communicating with +the outside world and expressing its state to higher-level code. + +There are five basic functions the `Run` function performs when it receives a +`Ready` message: + +1. Write new entries or a new snapshot to disk. +2. Forward any messages for other peers to the right destinations over gRPC. +3. Update the data store based on new snapshots or newly-committed log entries. +4. Evaluate the current leadership status, and signal to other code if it + changes (for example, so that components like the orchestrator can be started + or stopped). +5. If enough entries have accumulated between snapshots, create a new snapshot + to compact the WALs. The snapshot is written asynchronously and notifies the + `Run` method on completion. + +== Communication between nodes == + +The `etcd/raft` package does not implement communication over a network. It +references nodes by IDs, and it is up to higher-level code to convey messages to +the correct places. + +SwarmKit uses gRPC to transfer these messages. The interface for this is very +simple. Messages are only conveyed through a single RPC named +`ProcessRaftMessage`. + +There is an additional RPC called `ResolveAddress` that deals with a corner case +that can happen when nodes are added to a cluster dynamically. If a node was +down while the current cluster leader was added, or didn't mark the log entry +that added the leader as committed (which is done lazily), this node won't have +the leader's address. It would receive RPCs from the leader, but not be able to +invoke RPCs on the leader, so the communication would only happen in one +direction. It would normally be impossible for the node to catch up. With +`ResolveAddress`, it can query other cluster members for the leader's address, +and restore two-way communication. See +https://github.com/docker/swarmkit/issues/436 more details on this situation. + +SwarmKit's `raft/transport` package abstracts the mechanism for keeping track of +peers, and sending messages to them over gRPC in a specific message order. + +== Integration between Raft and the data store == + +The Raft `Node` object implements the `Proposer` interface which the data store +uses to propagate changes across the cluster. The key method is `ProposeValue`, +which appends information to the distributed log. + +The guts of `ProposeValue` are inside `processInternalRaftRequest`. This method +appends the message to the log, and then waits for it to become committed. There +is only one way `ProposeValue` can fail, which is the node where it's running +losing its position as the leader. If the node remains the leader, there is no +way a proposal can fail, since the leader controls which new entries are added +to the log, and can't retract an entry once it has been appended. It can, +however, take an indefinitely long time for a quorum of members to acknowledge +the new entry. There is no timeout on `ProposeValue` because a timeout wouldn't +retract the log entry, so having a timeout could put us in a state where a +write timed out, but ends up going through later on. This would make the data +store inconsistent with what's actually in the Raft log, which would be very +bad. + +When the log entry successfully becomes committed, `processEntry` triggers the +wait associatd with this entry, which allows `processInternalRaftRequest` to +return. On a leadership change, all outstanding waits get cancelled. + +== The Raft RPC proxy == + +As mentioned above, writes to the data store are only allowed on the leader +node. But any manager node can receive gRPC requests, and workers don't even +attempt to route those requests to the leaders. Somehow, requests that involve +writing to the data store or seeing a consistent view of it need to be +redirected to the leader. + +We generate wrappers around RPC handlers using the code in +`protobuf/plugin/raftproxy`. These wrappers check if the current node is the +leader, and serve the RPC locally in that case. In the case where some other +node is the leader, the wrapper invokes the same RPC on the leader instead, +acting as a proxy. The proxy inserts identity information for the client node in +the gRPC headers of the request, so that clients can't achieve privilege +escalation by going through the proxy. + +If one of these wrappers is registered with gRPC instead of the generated server +code itself, the server in question will automatically proxy its requests to the +leader. We use this for most APIs such as the dispatcher, control API, and CA. +However, there are some cases where RPCs need to be invoked directly instead of +being proxied to the leader, and in these cases, we don't use the wrappers. Raft +itself is a good example of this - if `ProcessRaftMessage` was always forwarded +to the leader, it would be impossible for the leader to communicate with other +nodes. Incidentally, this is why the Raft RPCs are split between a `Raft` +service and a `RaftMembership` service. The membership RPCs `Join` and `Leave` +need to run on the leader, but RPCs such as `ProcessRaftMessage` must not be +forwarded to the leader.