Implement snapshot restore #8131

notnoop · 2020-06-07T21:04:45Z

Part 2 of #8047 . This implements snapshot restore API and CLI. This allows uploading the snapshot file to /v1/operator/snapshot endpoint.

Exposing an API ease operator experience and follow the pattern established by Consul and Vault. It uses raft's raft.Restore.

The overall approach is similar to Consul and Vault. One significant difference is RPC format. In Consul, each streaming RPC must start a TCP connection, emit a request header, then streams snapshot file in request body without any encoding. The streaming RPC relies on being able to half-close the connection to indicate snapshot file end; thus cannot use a multiplex Yamux session. This quirk complicates forwarding RPCs and rate limiting and hurts reusability with other APIs.

Instead this implementation uses StreamErrWrapper (used for alloc fs operations) to stream snapshot file. While it adds some overhead, it significantly simples the implementation allowing us to reuse the current RPC infrastructure.

I'd encourage review commit by commit - I've made the commits into logical units.

Abandoned ideas

I have considered an alternative approach: to restore a cluster to a previous state, the cluster agents must be restarted and provided a snapshot at launch time. The hope is that this approach would be simpler to reason about and has less unexpected behavior (without needing to rollback a running state).

I've abandoned it for few reasons:

First, it's a departure from other HashiCorp tooling that would unnecessarily confuse operators and require them to do custom handling for Nomad.

Second, the raft library does some clever handling to both ease the snapshot restoring (single API to update the entire cluster), and to ensure safety of operation (e.g. updating index so that clients can detect a rollback). Reimplementing the logic in CLI will be somewhat tricky.

Third, restoring a snapshot at launch time without using the Restore method above isn't trivial and may require changes to the Raft library to be effective.

As such, this approach is require more manual steps for operators, inconsistent with other products, and is more difficult to implement.

command/agent/operator_endpoint.go

tgross · 2020-06-08T13:36:57Z

nomad/leader.go

@@ -189,6 +189,23 @@ WAIT:
 			goto RECONCILE
 		case member := <-reconcileCh:
 			s.reconcileMember(member)
+		case errCh := <-s.reassertLeaderCh:


It's not clear to me why leadership fixes are being made as part of this PR. Can you give some context there?

Thanks for pointing it out. Once a snapshot is restored (potentially to represent a previous state of the cluster), we must recreate all leader stateful components (e.g. eval brokers, periodic jobs schedule, etc) as the existing state are no longer valid.

We archive this by having a simulated leadership establishment event, as establishLeadership will properly recreate the state. operatorSnapshotRestore triggers such event using the reassertLeaderCh in

nomad/nomad/operator_endpoint.go

Lines 555 to 558 in d13a63c

select {

// Tell the leader loop to reassert leader actions since we just

// replaced the state store contents.

case op.srv.reassertLeaderCh <- lerrCh:

.

I'll add more clarifying comments in that flow.

Totally makes sense, thanks!

nomad/testing.go

command/operator_snapshot_restore.go

command/agent/testagent.go

tgross · 2020-06-08T13:51:49Z

api/api.go

@@ -945,6 +945,10 @@ func decodeBody(resp *http.Response, out interface{}) error {

 // encodeBody is used to encode a request body
 func encodeBody(obj interface{}) (io.Reader, error) {
+	if reader, ok := obj.(io.Reader); ok {


It might be a good idea to leave a comment here as to why this is bypassing the JSON encoding. It's obvious in the context of the PR, but I could see this confusing someone later (particular if they're writing tests and have a reader body they want to encode).

command/agent/operator_endpoint.go

notnoop · 2020-06-08T14:36:20Z

helper/pool/pool.go

@@ -305,8 +305,12 @@ func (p *ConnPool) acquire(region string, addr net.Addr, version int) (*Conn, er
 	return nil, fmt.Errorf("rpc error: lead thread didn't get connection")
 }

-// getNewConn is used to return a new connection
-func (p *ConnPool) getNewConn(region string, addr net.Addr, version int) (*Conn, error) {
+type HalfCloserConn interface {


Ooops - the changes in this file are unnecessary and should be reverted. This is left over from the other approach of requiring new raw connections I mentioned in the PR description.

Co-authored-by: Tim Gross <tgross@hashicorp.com>

schmichael

Overall approach looks great. Feel free to merge once my comments are addressed or dismissed!

command/operator_snapshot_restore.go

schmichael · 2020-06-12T22:10:54Z

nomad/operator_endpoint.go

+	// Forward to appropriate region
+	if args.Region != op.srv.Region() {
+		err := op.forwardStreamingRPC(args.Region, "Operator.SnapshotRestore", args, conn)


If we recorded a metric for operator bravery, we would increment it here. 😅 (Not that it shouldn't work fine... just that running a sensitive raft operation over the Internet is... brave.)

schmichael · 2020-06-12T22:23:16Z

nomad/operator_endpoint.go

+	}
+
+	// This'll be used for feedback from the leader loop.
+	timeoutCh := time.After(time.Minute)


Why 1 minute though? I'm unclear why this is 1m instead of 10s or 10m.

Here are my 3 assumptions, please correct me where I make mistakes:

This timeout spans the local step-down/step-up process, not any Raft operations (eg the Barrier is outside of establishLeader so is not covered by this process).

IIRC as part of Speed up leadership establishment #8036 we made sure all remote operations (eg Vault APIs) were done asynchronously, so this operation should only be bound by local CPU.

Failing this timeout is Pretty Bad and leaves the cluster in an unclear state. I'm assuming the safest thing for an operator to do would be to start restoration from scratch?

If I'm right, I think 1 minute is fine because while the operation is CPU bound and therefore should be "fast" -- failing incurs a large cost (and seems likely to fail again if retried?) and therefore we should give the operation ample time to complete.

Also note that if I'm wrong about 3 and failure isn't costly, then we can just ignore my worrying about it entirely. :)

Update: Hm, I just noticed the timeout includes sending on the reassertLeaderCh, so it's more than just establishLeader. Since Restore is called on the leader, unless leadership has been lost, the leader should be selecting on reassertLeaderCh by the time this code tries to send on it? Just making sure I understand what operations this timeout covers.

Cargo-culted 1 minute (and this code logic) from Consul without full understanding why it's chosen: https://github.com/hashicorp/consul/blob/v1.7.4/agent/consul/snapshot_endpoint.go#L108 .

Indeed, this somewhat relies on establishLeader being "fast" - though establishLeader does make some raft transactions (e.g. Barrier, scheduler/autopilot/clusterid creation if none is present). I don't know why two Barriers are involved here.

That being said, the timeout is mostly for RPC purposes, so we return a 500 error rather than waiting an unbounded time. At this point the snapshot got committed to the raft state, and any future leader will honor it.

The timeout here is mostly for RPC managing to avoid unbounded waiting time, rather than used for correctness or managing Raft State - if we reached here, restarting restore from scratch shouldn't be necessary. Once we reach this code, Increasing or reducing timeout will only dictate how fast the operator gets 500 error - the internal Raft operations can proceed as expected. Operators can wait until establishLeadership completes, or force a leader transition and hope that the next leader establishLeader will take a shorter time.

Technically there are two possibilities for the timeout to be reached:

A timeout is reached because we lost leadership immediately after the Barrier call. The operator gets 500 error, but the restore has succeeded and the new leader will honor the restore function.

I'll follow up in a subsequent PR to make leader evictions lead to immediate failures rather than timeouts. The logic my be tricky and don't want to hold this PR for it.

A timeout is reached because establishingLeader takes a long time. In this case, the leader needs to fully wait for establishLeadership to finish - leader must wait otherwise we risk operating with pre-restore state. Operators can force evict leader and hope the issue is transient and that the new leader will be faster. If the underlying issue is real (e.g. bug with vault taking a long time), sadly there is no way to escape waiting until it succeeds - such bug isn't specific to restore and will impact all leader transitions - hence we need to be very diligent about making leadership transitions be super fast.

Since Restore is called on the leader, unless leadership has been lost, the leader should be selecting on reassertLeaderCh by the time this code tries to send on it?

Yes - I'd expect the reassertLeaderCh to be received from quickly. If not, that'd be a serious bug in leader.go logic and that may prevent us from quickly detecting leadership transitions or membership changes, a very serious bug on its rite.

In my follow up in the next PR, I'll make this handling a bit more graceful.

At this point the snapshot got committed to the raft state, and any future leader will honor it.

Perfect! I appreciate the detailed response.

I'll follow up in a subsequent PR to make leader evictions lead to immediate failures rather than timeouts. The logic my be tricky and don't want to hold this PR for it.

Yeah, I wouldn't worry about it. This should be a rare operation and a leader eviction happening during it seems exceedingly rare unless you're trying to restore in the middle of an outage with a flaky network or something -- in which case this timeout is probably the least of your worries.

I don't think waiting up to a minute for an exceedingly rare error case is worth worrying about.

nomad/operator_endpoint.go

schmichael · 2020-06-12T22:29:48Z

nomad/operator_endpoint.go

+				}
+			}
+
+			if errW := wrapper.Error; errW != nil {


Can wrapper have both a Payload and an Error? I suppose that makes sense for signalling EOF while also sending the final payload.

It doesn't. A wrapper is currently either a Payload or an error. I ensured that we handle EOF by sending payload separately. However, I used the "be flexible in what you take" approach in the rpc client side just in case of future changes, and to ease anxiety of future readers of the code (who may ask the other question and think there is a bug).

Co-authored-by: Michael Schurter <mschurter@hashicorp.com>

github-actions · 2023-01-03T02:14:10Z

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

Mahmood Ali added 7 commits June 7, 2020 15:46

basic snapshot restore

97fb054

reassert leadership

b73e7b0

api: add snapshot restore

867147c

implement snapshot restore CLI

669b75d

tests: prefix agent logs to identify agent sources

519447d

tests: client already disabled

fad0faf

loosen raft timeout

d13a63c

notnoop requested review from schmichael and tgross June 7, 2020 21:04

tgross reviewed Jun 8, 2020

View reviewed changes

notnoop commented Jun 8, 2020

View reviewed changes

Mahmood Ali and others added 3 commits June 9, 2020 11:26

clarify error message

bd48db2

Co-authored-by: Tim Gross <tgross@hashicorp.com>

clarify ccomments, esp related to leadership code

1f6cb15

revert changes from earlier change

fbb1092

schmichael approved these changes Jun 12, 2020

View reviewed changes

Apply suggestions from code review

14cd3da

Co-authored-by: Michael Schurter <mschurter@hashicorp.com>

notnoop merged commit c52a290 into master Jun 15, 2020

notnoop deleted the f-snapshot-restore branch June 15, 2020 12:32

github-actions bot locked as resolved and limited conversation to collaborators Jan 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement snapshot restore #8131

Implement snapshot restore #8131

notnoop commented Jun 7, 2020

tgross Jun 8, 2020

notnoop Jun 8, 2020

tgross Jun 8, 2020

tgross Jun 8, 2020

notnoop Jun 8, 2020 •

edited

Loading

schmichael left a comment

schmichael Jun 12, 2020

schmichael Jun 12, 2020

notnoop Jun 15, 2020

schmichael Jun 15, 2020

schmichael Jun 12, 2020

notnoop Jun 15, 2020

github-actions bot commented Jan 3, 2023

	select {
	// Tell the leader loop to reassert leader actions since we just
	// replaced the state store contents.
	case op.srv.reassertLeaderCh <- lerrCh:

Implement snapshot restore #8131

Implement snapshot restore #8131

Conversation

notnoop commented Jun 7, 2020

Abandoned ideas

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

notnoop Jun 8, 2020 • edited Loading

Choose a reason for hiding this comment

schmichael left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Jan 3, 2023

notnoop Jun 8, 2020 •

edited

Loading