Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement snapshot restore #8131

Merged
merged 11 commits into from
Jun 15, 2020
Merged

Implement snapshot restore #8131

merged 11 commits into from
Jun 15, 2020

Conversation

notnoop
Copy link
Contributor

@notnoop notnoop commented Jun 7, 2020

Part 2 of #8047 . This implements snapshot restore API and CLI. This allows uploading the snapshot file to /v1/operator/snapshot endpoint.

Exposing an API ease operator experience and follow the pattern established by Consul and Vault. It uses raft's raft.Restore.

The overall approach is similar to Consul and Vault. One significant difference is RPC format. In Consul, each streaming RPC must start a TCP connection, emit a request header, then streams snapshot file in request body without any encoding. The streaming RPC relies on being able to half-close the connection to indicate snapshot file end; thus cannot use a multiplex Yamux session. This quirk complicates forwarding RPCs and rate limiting and hurts reusability with other APIs.

Instead this implementation uses StreamErrWrapper (used for alloc fs operations) to stream snapshot file. While it adds some overhead, it significantly simples the implementation allowing us to reuse the current RPC infrastructure.

I'd encourage review commit by commit - I've made the commits into logical units.

Abandoned ideas

I have considered an alternative approach: to restore a cluster to a previous state, the cluster agents must be restarted and provided a snapshot at launch time. The hope is that this approach would be simpler to reason about and has less unexpected behavior (without needing to rollback a running state).

I've abandoned it for few reasons:

First, it's a departure from other HashiCorp tooling that would unnecessarily confuse operators and require them to do custom handling for Nomad.

Second, the raft library does some clever handling to both ease the snapshot restoring (single API to update the entire cluster), and to ensure safety of operation (e.g. updating index so that clients can detect a rollback). Reimplementing the logic in CLI will be somewhat tricky.

Third, restoring a snapshot at launch time without using the Restore method above isn't trivial and may require changes to the Raft library to be effective.

As such, this approach is require more manual steps for operators, inconsistent with other products, and is more difficult to implement.

@notnoop notnoop requested review from schmichael and tgross June 7, 2020 21:04
command/agent/operator_endpoint.go Show resolved Hide resolved
@@ -189,6 +189,23 @@ WAIT:
goto RECONCILE
case member := <-reconcileCh:
s.reconcileMember(member)
case errCh := <-s.reassertLeaderCh:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me why leadership fixes are being made as part of this PR. Can you give some context there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing it out. Once a snapshot is restored (potentially to represent a previous state of the cluster), we must recreate all leader stateful components (e.g. eval brokers, periodic jobs schedule, etc) as the existing state are no longer valid.

We archive this by having a simulated leadership establishment event, as establishLeadership will properly recreate the state. operatorSnapshotRestore triggers such event using the reassertLeaderCh in

select {
// Tell the leader loop to reassert leader actions since we just
// replaced the state store contents.
case op.srv.reassertLeaderCh <- lerrCh:
.

I'll add more clarifying comments in that flow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally makes sense, thanks!

nomad/testing.go Show resolved Hide resolved
command/operator_snapshot_restore.go Outdated Show resolved Hide resolved
command/agent/testagent.go Show resolved Hide resolved
@@ -945,6 +945,10 @@ func decodeBody(resp *http.Response, out interface{}) error {

// encodeBody is used to encode a request body
func encodeBody(obj interface{}) (io.Reader, error) {
if reader, ok := obj.(io.Reader); ok {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be a good idea to leave a comment here as to why this is bypassing the JSON encoding. It's obvious in the context of the PR, but I could see this confusing someone later (particular if they're writing tests and have a reader body they want to encode).

command/agent/operator_endpoint.go Show resolved Hide resolved
@@ -305,8 +305,12 @@ func (p *ConnPool) acquire(region string, addr net.Addr, version int) (*Conn, er
return nil, fmt.Errorf("rpc error: lead thread didn't get connection")
}

// getNewConn is used to return a new connection
func (p *ConnPool) getNewConn(region string, addr net.Addr, version int) (*Conn, error) {
type HalfCloserConn interface {
Copy link
Contributor Author

@notnoop notnoop Jun 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooops - the changes in this file are unnecessary and should be reverted. This is left over from the other approach of requiring new raw connections I mentioned in the PR description.

Copy link
Member

@schmichael schmichael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall approach looks great. Feel free to merge once my comments are addressed or dismissed!

command/operator_snapshot_restore.go Outdated Show resolved Hide resolved
command/operator_snapshot_restore.go Outdated Show resolved Hide resolved
command/operator_snapshot_restore.go Outdated Show resolved Hide resolved
Comment on lines +497 to +499
// Forward to appropriate region
if args.Region != op.srv.Region() {
err := op.forwardStreamingRPC(args.Region, "Operator.SnapshotRestore", args, conn)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we recorded a metric for operator bravery, we would increment it here. 😅 (Not that it shouldn't work fine... just that running a sensitive raft operation over the Internet is... brave.)

}

// This'll be used for feedback from the leader loop.
timeoutCh := time.After(time.Minute)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 1 minute though? I'm unclear why this is 1m instead of 10s or 10m.

Here are my 3 assumptions, please correct me where I make mistakes:

  1. This timeout spans the local step-down/step-up process, not any Raft operations (eg the Barrier is outside of establishLeader so is not covered by this process).
  2. IIRC as part of Speed up leadership establishment #8036 we made sure all remote operations (eg Vault APIs) were done asynchronously, so this operation should only be bound by local CPU.
  3. Failing this timeout is Pretty Bad and leaves the cluster in an unclear state. I'm assuming the safest thing for an operator to do would be to start restoration from scratch?

If I'm right, I think 1 minute is fine because while the operation is CPU bound and therefore should be "fast" -- failing incurs a large cost (and seems likely to fail again if retried?) and therefore we should give the operation ample time to complete.

Also note that if I'm wrong about 3 and failure isn't costly, then we can just ignore my worrying about it entirely. :)

Update: Hm, I just noticed the timeout includes sending on the reassertLeaderCh, so it's more than just establishLeader. Since Restore is called on the leader, unless leadership has been lost, the leader should be selecting on reassertLeaderCh by the time this code tries to send on it? Just making sure I understand what operations this timeout covers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cargo-culted 1 minute (and this code logic) from Consul without full understanding why it's chosen: https://github.com/hashicorp/consul/blob/v1.7.4/agent/consul/snapshot_endpoint.go#L108 .

Indeed, this somewhat relies on establishLeader being "fast" - though establishLeader does make some raft transactions (e.g. Barrier, scheduler/autopilot/clusterid creation if none is present). I don't know why two Barriers are involved here.

That being said, the timeout is mostly for RPC purposes, so we return a 500 error rather than waiting an unbounded time. At this point the snapshot got committed to the raft state, and any future leader will honor it.

The timeout here is mostly for RPC managing to avoid unbounded waiting time, rather than used for correctness or managing Raft State - if we reached here, restarting restore from scratch shouldn't be necessary. Once we reach this code, Increasing or reducing timeout will only dictate how fast the operator gets 500 error - the internal Raft operations can proceed as expected. Operators can wait until establishLeadership completes, or force a leader transition and hope that the next leader establishLeader will take a shorter time.

Technically there are two possibilities for the timeout to be reached:

  1. A timeout is reached because we lost leadership immediately after the Barrier call. The operator gets 500 error, but the restore has succeeded and the new leader will honor the restore function.
  • I'll follow up in a subsequent PR to make leader evictions lead to immediate failures rather than timeouts. The logic my be tricky and don't want to hold this PR for it.
  1. A timeout is reached because establishingLeader takes a long time. In this case, the leader needs to fully wait for establishLeadership to finish - leader must wait otherwise we risk operating with pre-restore state. Operators can force evict leader and hope the issue is transient and that the new leader will be faster. If the underlying issue is real (e.g. bug with vault taking a long time), sadly there is no way to escape waiting until it succeeds - such bug isn't specific to restore and will impact all leader transitions - hence we need to be very diligent about making leadership transitions be super fast.

Since Restore is called on the leader, unless leadership has been lost, the leader should be selecting on reassertLeaderCh by the time this code tries to send on it?

Yes - I'd expect the reassertLeaderCh to be received from quickly. If not, that'd be a serious bug in leader.go logic and that may prevent us from quickly detecting leadership transitions or membership changes, a very serious bug on its rite.

In my follow up in the next PR, I'll make this handling a bit more graceful.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point the snapshot got committed to the raft state, and any future leader will honor it.

Perfect! I appreciate the detailed response.

I'll follow up in a subsequent PR to make leader evictions lead to immediate failures rather than timeouts. The logic my be tricky and don't want to hold this PR for it.

Yeah, I wouldn't worry about it. This should be a rare operation and a leader eviction happening during it seems exceedingly rare unless you're trying to restore in the middle of an outage with a flaky network or something -- in which case this timeout is probably the least of your worries.

I don't think waiting up to a minute for an exceedingly rare error case is worth worrying about.

nomad/operator_endpoint.go Show resolved Hide resolved
}
}

if errW := wrapper.Error; errW != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can wrapper have both a Payload and an Error? I suppose that makes sense for signalling EOF while also sending the final payload.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't. A wrapper is currently either a Payload or an error. I ensured that we handle EOF by sending payload separately. However, I used the "be flexible in what you take" approach in the rpc client side just in case of future changes, and to ease anxiety of future readers of the code (who may ask the other question and think there is a bug).

Co-authored-by: Michael Schurter <mschurter@hashicorp.com>
@notnoop notnoop merged commit c52a290 into master Jun 15, 2020
@notnoop notnoop deleted the f-snapshot-restore branch June 15, 2020 12:32
@github-actions
Copy link

github-actions bot commented Jan 3, 2023

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 3, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants