Transfer leadership when establishLeadership fails #5247

hanshasselberg · 2019-01-22T15:25:01Z

Fixes #5047. I am throwing this out here to have a place to discuss the approaches.

TLDR; transition leadership is now implemented in raft which is being used in this PR.

Background:

Leadership is established in raft and there is basically no way to influence that from consul. And there has been no need to. But when we setup leadership in consul in establishLeadership things can go wrong and in the past we retried to fix it. This is problematic since even though raft considers this server the leader, it is not fully prepared! Consistent reads for example are not possible.

Abandoned Ideas:

calling revokeLeadership: that tears down some consul things, but has no influence on raft leadership
quitting leaderLoop: that resets the consul side of things, but doesn't change the raft leadership

Solutions:

this PR: calling DemoteVoter as the leader, will make this server step down by marking the current leader as nonvoter. The same server is promoted to a voter afterwards again by autopilot from the new leader:

consul/agent/consul/autopilot/autopilot.go

Lines 304 to 329 in 884b2e0

    
           func (a *Autopilot) handlePromotions(promotions []raft.Server) error { 
        
           	// This used to wait to only promote to maintain an odd quorum of 
        
           	// servers, but this was at odds with the dead server cleanup when doing 
        
           	// rolling updates (add one new server, wait, and then kill an old 
        
           	// server). The dead server cleanup would still count the old server as 
        
           	// a peer, which is conservative and the right thing to do, and this 
        
           	// would wait to promote, so you could get into a stalemate. It is safer 
        
           	// to promote early than remove early, so by promoting as soon as 
        
           	// possible we have chosen that as the solution here. 
        
           	for _, server := range promotions { 
        
           		a.logger.Printf("[INFO] autopilot: Promoting %s to voter", fmtServer(server)) 
        
           		addFuture := a.delegate.Raft().AddVoter(server.ID, server.Address, 0, 0) 
        
           		if err := addFuture.Error(); err != nil { 
        
           			return fmt.Errorf("failed to add raft peer: %v", err) 
        
           		} 
        
           	} 
        
           	// If we promoted a server, trigger a check to remove dead servers. 
        
           	if len(promotions) > 0 { 
        
           		select { 
        
           		case a.removeDeadCh <- struct{}{}: 
        
           		default: 
        
           		} 
        
           	} 
        
           	return nil 
        
           }

The idea to that approach came from Suggestion: control over leader identity raft#218.

alternatively we could implement leadership transfer in raft as specified in https://ramcloud.stanford.edu/~ongaro/thesis.pdf chapter 3.10.

Concerns with demotevoter:

in an environment where not all servers are running autopilot, it is possible that the leader doesn't run it and doesn't promote the demoted server to be a voter again. Lets assume we would be left with 2 voters and 1 nonvoter. That would make electing a leader harder, but raft would figure it out eventually. During an upgrade, the same could happen one last time: 1 voter and 2 nonvoters left which is the worst case. But when this happens, electing a leader is easy again. As soon as the leader runs autopilot, all nonvoters will be promoted to a voter again. establishLeadership would have to fail often during that process to make this scenario possible. Without this fix though, we would be stuck with a leader that is not performing all leader tasks, without a chance of recovery.
raft state is modified for this and we rely on consul for promoting the previous leader to be a voter again - feels fishy.

Concerns with leadership transfer:

none

This PR uses the leadership transfer feature. It requires a revendoring though and fails because of it.

hashicorp-cla · 2019-01-22T15:25:09Z

All committers have signed the CLA.

This PR is implementing the leadership transfer extension described in the thesis chap 3.10. Background: Consul is performing some setup after acquiring leadership. It is possible that the setup fails, but there is no good way to step down as a leader. It is possible to use DemoteVoter as show in hashicorp/consul#5247, but this is suboptimal because it relies on Consul's autopilot to promote the old leader to a voter again. Since there is a perfectly fine way described in the thesis: leadership transfer extension, we decided to implement that instead. Doing it this way also helps other teams, since it is more generic. The necessary steps to perform are: 1. Leader picks target to transition to 2. Leader stops accepting client requests 3. Leader makes sure to replicate logs to the target 4. Leader sends TimeoutNow RPC request 5. Target receives TimeoutNow request, which triggers an election 6a. If the election is successful, a message with the new term will make the old leader step down 6b. if after electiontimeout the leadership transfer did not complete, the old leader resumes operation Resources: https://github.com/etcd-io/etcd/tree/master/raft

hanshasselberg · 2019-06-07T09:09:10Z

agent/consul/server.go

+		Level:  hclog.LevelFromString(s.config.LogLevel),
+		Output: s.config.LogOutput,
+	})
+	s.config.RaftConfig.Logger = raftLogger


Raft logging changed and we need to provide the hclogger now.

hanshasselberg · 2019-06-07T09:17:22Z

agent/consul/leader.go

@@ -178,7 +195,7 @@ RECONCILE:
 			if err := s.revokeLeadership(); err != nil {
 				s.logger.Printf("[ERR] consul: failed to revoke leadership: %v", err)
 			}
-			goto WAIT
+			return err


This is the first usecase: we couldn't establish leadership and instead of retrying it after interval in WAIT, an error is returned which leads to a raft leadership transfer.

hanshasselberg · 2019-06-07T09:18:43Z

agent/consul/leader.go

+			err := reassert()
+			errCh <- err
+			if err != nil {
+				return err


This is the second usecase: we couldn't reassert and instead of retrying it after interval, an error is returned which leads to a raft leadership transfer.

Hmmm, when does this happen? I don't recall the specifics and wonder if by failing and stepping down we might end up causing new leadership stability issues?

consul/agent/consul/server.go

Lines 245 to 246 in 09a1a32

// reassertLeaderCh is used to signal the leader loop should re-run

// leadership actions after a snapshot restore.

This happens after snapshot restore. Since the agent revokes leadership and immediately tries to establish it again, there is the possibility that it fails. When it does we are in the same situation as above - raft thinks this agent is a leader, but consul disagrees.

freddygv

Just have a couple questions inline 🔍

agent/consul/leader.go

banks

Not sure about these Hans but just wondering if this solves the problem fully vs just making it more likely to get out of the bad state.

Please let me know if I'm missing something though - it's all pretty subtle!

banks · 2019-06-12T12:11:40Z

agent/consul/leader.go

+			err := reassert()
+			errCh <- err
+			if err != nil {
+				return err


Hmmm, when does this happen? I don't recall the specifics and wonder if by failing and stepping down we might end up causing new leadership stability issues?

banks · 2019-06-12T12:16:46Z

agent/consul/leader.go

-					s.leaderLoop(ch)
+					err := s.leaderLoop(ch)
+					if err != nil {
+						s.leadershipTransfer()


What if leadership transfer fails? We exit the leader loop anyway and no longer act as the leader, but does raft still think we are the leader? That seems like a bad case to be in - roughly the same as the bug this is meant to be fixing although we at least made an attempt at telling raft we didn't want to be leader any more.

I can think of a couple of possible options here:

Make a method in Raft lib that allows the leader to "StepDown" - it will stop running heartbeats etc. but not stop being a voter - basically, force a leader into follower state.

Keep trying the leadership Transfer indefinitely because the cluster is broken until it works anyway.

Keep retrying for a limited length of time and then crash the whole process to force a step down.

I made a couple of changes to my PR, so that leaderLoop is only left if leadershipTransfer was successful. I think that mitigates the issues you are talking about.

What happens now is that in case leadershipTransfer fails, the agent stays in leaderLoop and waits until ReconcileInterval to retry to establishLeadership (and to transferLeadership in case it fails).

I think this is OK although ReconcileInterval is 60 seconds which seems a lot to wait around for when we know the cluster is down. 🤔

Should we instead consider looping indefinitely retrying every 5 seconds or something?

I think this is way better than before and eventually it should recover so it's good just wondering if it's easy to make that better quicker?

I reset the interval to 5 seconds when transfer leadership fails so that it retries establishLeadership again faster than before.

banks · 2019-06-12T20:22:33Z

@i0rek could you also double check how the hclog integration pans out. It came up today that Vault and Nomad have different date formats in logs which tripped someone up. Can you make sure we don't make it even worse by having raft log lines log with a different date format to the rest of Consul's logs please? I'm not sure if that is the case or not as I didn't look hard at where the plumbing/config sits for that but would just like to sanity check it.

banks · 2019-06-13T14:37:46Z

agent/consul/leader.go

+			// leader, but consul disagrees.
+			if err != nil {
+				if err := s.leadershipTransfer(); err != nil {
+					goto WAIT


So now we attempt to transfer 3 times but if it fails we still hang out in non-leader limbo land for a bit before retrying?

I guess this is what i mentioned as "retry indefinitiely" and it should really work immediately if rest of the cluster is in an OK state so I think this is good.

My thinking was that since leadershipTransfer failed, we try to establishLeadership again. I think in general establishLeadership is more likely to succeed than transferLeadership. I think if I make the interval smaller - like 5 seconds - before it retries it is the better solution than trying to transfer leadership indefinitely.

banks

Oh I'll leave it at request changes to make sure we don't forget to check out the log formatting for raft before merge.

hanshasselberg · 2019-06-13T21:29:04Z

@banks I checked the logs and fixed the timestamps, but sometimes there are two whitespaces before raft:

master:

2019/06/13 22:11:45 [INFO] agent: Started DNS server 127.0.0.1:8600 (tcp)
2019/06/13 22:11:45 [INFO] agent: Started DNS server 127.0.0.1:8600 (udp)
2019/06/13 22:11:45 [INFO] agent: Started HTTP server on 127.0.0.1:8500 (tcp)
2019/06/13 22:11:45 [INFO] agent: started state syncer
2019/06/13 22:11:45 [INFO] agent: Started gRPC server on 127.0.0.1:8502 (tcp)
2019/06/13 22:11:45 [WARN] raft: Heartbeat timeout from "" reached, starting election
2019/06/13 22:11:45 [INFO] raft: Node at 127.0.0.1:8300 [Candidate] entering Candidate state in term 2
2019/06/13 22:11:45 [DEBUG] raft: Votes needed: 1
2019/06/13 22:11:45 [DEBUG] raft: Vote granted from 2ef30f59-536b-3efa-80d8-5bd3afb67585 in term 2. Tally: 1
2019/06/13 22:11:45 [INFO] raft: Election won. Tally: 1

pr:

2019/06/13 22:10:58 [INFO] agent: Started DNS server 127.0.0.1:8600 (tcp)
2019/06/13 22:10:58 [INFO] agent: Started DNS server 127.0.0.1:8600 (udp)
2019/06/13 22:10:58 [INFO] agent: Started HTTP server on 127.0.0.1:8500 (tcp)
2019/06/13 22:10:58 [INFO] agent: started state syncer
2019/06/13 22:10:58 [INFO] agent: Started gRPC server on 127.0.0.1:8502 (tcp)
2019/06/13 22:10:58 [WARN]  raft: Heartbeat timeout from "" reached, starting election
2019/06/13 22:10:58 [INFO]  raft: Node at 127.0.0.1:8300 [Candidate] entering Candidate state in term 2
2019/06/13 22:10:58 [DEBUG] raft: Votes needed: 1
2019/06/13 22:10:58 [DEBUG] raft: Vote granted from bd52484a-2349-6478-2e1f-4eb4c64c924c in term 2. Tally: 1
2019/06/13 22:10:58 [INFO]  raft: Election won. Tally: 1

banks

Nice!

banks · 2019-06-14T09:57:44Z

agent/consul/leader.go

+					// establishedLeader needs to be set to
+					// false.
+					establishedLeader = false
+					interval = time.After(5 * time.Second)


I think this is OK because we'll wait for 5 seconds in next wait loop and then timeout and goto RECONCILE. Right after that label we will hit interval := time.After(s.config.ReconcileInterval) again which presumably will set the interval back again.

I must admit I'm a little confused about how := works after a GOTO - it's reassigning the same variable name but in the same scope on subsequent jumps? I wonder if this is some strange variant of variable shadowing even though they are in the same scope? Maybe Go just has a special case to allow this when using GOTO but not in serial code? If it works I guess it's fine 😄

I am not sure what you mean. WAIT is well after the RECONCILE and the interval variable declaration and the code should just be using the same variable.

I reproed your question in a playground: https://play.golang.org/p/6AqssHXg3Wt. If you jump before a variable declaration golang will create a new variable. Or did you mean something else?

Yeah I did a similar repro and it's fine, just was a strange one.

The path here is:

we set interval to be a timer chan that will go off in 5 seconds

we goto WAIT which enters a select on a few things including that chan

when the chan fires, that select branch does goto RECONCILE which immeidately re-assigns a timer chan for the original ReconcileInterval (using :=).

My original concern was that we might end up regaining leadership and then doing reconcile every 5 seconds after that but it's not the case due to the path mentioned above.

It also occurs to me that we always have had a re-assignment after goto RECONCILE so it's not really any different than before, it's just that was the only assignment before and I wondered if some strange form of shadowing might cause issues. That appears not to be the case so I think this is fine!

For fun - Go does create a new variable with the same name.

You can see that here: https://play.golang.org/p/FU0ZxictDXE capturing the variable in a lambda and then looping with GOTO leaves the lamda holding the original value not the redefined one after the GOTO jump.

It's just weird to me because it's in the same scope - shadowing across scopes seems fine but this seems to be a special case you can't normally do outside of GOTO.

wow, now I see what you mean.

silenceper · 2019-06-22T06:04:20Z

@mkeeler
This pr add vendoring of github.com/hashicorp/consul
conflict with #5943

mkeeler · 2019-06-22T12:57:39Z

@silenceper Looks like you are correct. One of the other engineers has been working on a circle ci check to ensure we aren't accidentally doing this (which appears to be a common mistake for us).

This PR is implementing the leadership transfer extension described in the thesis chap 3.10. Background: Consul is performing some setup after acquiring leadership. It is possible that the setup fails, but there is no good way to step down as a leader. It is possible to use DemoteVoter as show in hashicorp/consul#5247, but this is suboptimal because it relies on Consul's autopilot to promote the old leader to a voter again. Since there is a perfectly fine way described in the thesis: leadership transfer extension, we decided to implement that instead. Doing it this way also helps other teams, since it is more generic. The necessary steps to perform are: 1. Leader picks target to transition to 2. Leader stops accepting client requests 3. Leader makes sure to replicate logs to the target 4. Leader sends TimeoutNow RPC request 5. Target receives TimeoutNow request, which triggers an election 6a. If the election is successful, a message with the new term will make the old leader step down 6b. if after electiontimeout the leadership transfer did not complete, the old leader resumes operation Resources: https://github.com/etcd-io/etcd/tree/master/raft

hanshasselberg added the needs-discussion Topic needs discussion with the larger Consul maintainers before committing to for a release label Jan 22, 2019

hanshasselberg mentioned this pull request Jan 28, 2019

Leadership transfer hashicorp/raft#306

Merged

8 tasks

hanshasselberg changed the title ~~Demote itself when establishLeadership fails~~ Transfer leadership when establishLeadership fails Feb 14, 2019

hanshasselberg force-pushed the establish_leader branch 2 times, most recently from 833070e to b8e9fa5 Compare May 23, 2019 15:41

hanshasselberg force-pushed the establish_leader branch from 5a706d7 to 018d3e5 Compare June 7, 2019 09:05

hanshasselberg commented Jun 7, 2019

View reviewed changes

hanshasselberg requested a review from a team June 7, 2019 09:22

hanshasselberg removed the needs-discussion Topic needs discussion with the larger Consul maintainers before committing to for a release label Jun 7, 2019

freddygv reviewed Jun 7, 2019

View reviewed changes

agent/consul/leader.go Show resolved Hide resolved

agent/consul/leader.go Show resolved Hide resolved

hanshasselberg added 7 commits June 11, 2019 15:19

demote itself when establishLeadership fails

06f41a8

use leadership transfer feature.

fa43afb

move to simpler transfer function and 3 times retry

6cf75c2

update raft

301e184

fix log

5767922

return if there is an error.

4071ae2

forgot to vendor.

1534af3

hanshasselberg force-pushed the establish_leader branch from 88b9490 to 1534af3 Compare June 11, 2019 13:20

banks requested changes Jun 12, 2019

View reviewed changes

hanshasselberg added 4 commits June 13, 2019 11:52

only leave leaderloop if we are no longer raft leader.

ed12d67

return when leadership transferred.

8e47230

add bunch of comments

d534f10

fix tests

09a1a32

banks approved these changes Jun 13, 2019

View reviewed changes

banks requested changes Jun 13, 2019

View reviewed changes

hanshasselberg added 5 commits June 13, 2019 21:44

set interval to 5sec when transfer leadership failed

c866127

set establishedLeader to false

c8e2774

fix comment

48f28c8

fix comment

2a97321

add timeformat

5ed8df5

banks approved these changes Jun 14, 2019

View reviewed changes

hanshasselberg merged commit f13fe4b into master Jun 19, 2019

hanshasselberg deleted the establish_leader branch June 19, 2019 12:50

hanshasselberg mentioned this pull request Jun 19, 2019

Cluster becomes unresponsive and does not elect new leader after disk latency spike on leader #3552

Closed

notnoop mentioned this pull request Jan 22, 2020

Handle Nomad leadership flapping (attempt 2) hashicorp/nomad#6977

Merged

notnoop mentioned this pull request Jul 20, 2020

Step down leadership on establishLeader failures hashicorp/nomad#8470

Closed

lgfa29 mentioned this pull request Mar 14, 2022

server: transfer leadership in case of error hashicorp/nomad#12293

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transfer leadership when establishLeadership fails #5247

Transfer leadership when establishLeadership fails #5247

hanshasselberg commented Jan 22, 2019 •

edited

Loading

hashicorp-cla commented Jan 22, 2019 •

edited

Loading

hanshasselberg Jun 7, 2019

hanshasselberg Jun 7, 2019 •

edited

Loading

hanshasselberg Jun 7, 2019

banks Jun 12, 2019

hanshasselberg Jun 13, 2019

freddygv left a comment •

edited

Loading

banks left a comment

banks Jun 12, 2019

banks Jun 12, 2019

hanshasselberg Jun 13, 2019 •

edited

Loading

banks Jun 13, 2019

hanshasselberg Jun 13, 2019

banks commented Jun 12, 2019

banks Jun 13, 2019

hanshasselberg Jun 13, 2019

banks left a comment

hanshasselberg commented Jun 13, 2019

banks left a comment

banks Jun 14, 2019

hanshasselberg Jun 14, 2019

banks Jun 14, 2019 •

edited

Loading

banks Jun 14, 2019

hanshasselberg Jun 14, 2019

silenceper commented Jun 22, 2019 •

edited

Loading

mkeeler commented Jun 22, 2019 •

edited

Loading

	func (a *Autopilot) handlePromotions(promotions []raft.Server) error {
	// This used to wait to only promote to maintain an odd quorum of
	// servers, but this was at odds with the dead server cleanup when doing
	// rolling updates (add one new server, wait, and then kill an old
	// server). The dead server cleanup would still count the old server as
	// a peer, which is conservative and the right thing to do, and this
	// would wait to promote, so you could get into a stalemate. It is safer
	// to promote early than remove early, so by promoting as soon as
	// possible we have chosen that as the solution here.
	for _, server := range promotions {
	a.logger.Printf("[INFO] autopilot: Promoting %s to voter", fmtServer(server))
	addFuture := a.delegate.Raft().AddVoter(server.ID, server.Address, 0, 0)
	if err := addFuture.Error(); err != nil {
	return fmt.Errorf("failed to add raft peer: %v", err)
	}
	}

	// If we promoted a server, trigger a check to remove dead servers.
	if len(promotions) > 0 {
	select {
	case a.removeDeadCh <- struct{}{}:
	default:
	}
	}
	return nil
	}

	// reassertLeaderCh is used to signal the leader loop should re-run
	// leadership actions after a snapshot restore.

Transfer leadership when establishLeadership fails #5247

Transfer leadership when establishLeadership fails #5247

Conversation

hanshasselberg commented Jan 22, 2019 • edited Loading

hashicorp-cla commented Jan 22, 2019 • edited Loading

Choose a reason for hiding this comment

hanshasselberg Jun 7, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

freddygv left a comment • edited Loading

Choose a reason for hiding this comment

banks left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hanshasselberg Jun 13, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

banks commented Jun 12, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

banks left a comment

Choose a reason for hiding this comment

hanshasselberg commented Jun 13, 2019

banks left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

banks Jun 14, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

silenceper commented Jun 22, 2019 • edited Loading

mkeeler commented Jun 22, 2019 • edited Loading

hanshasselberg commented Jan 22, 2019 •

edited

Loading

hashicorp-cla commented Jan 22, 2019 •

edited

Loading

hanshasselberg Jun 7, 2019 •

edited

Loading

freddygv left a comment •

edited

Loading

hanshasselberg Jun 13, 2019 •

edited

Loading

banks Jun 14, 2019 •

edited

Loading

silenceper commented Jun 22, 2019 •

edited

Loading

mkeeler commented Jun 22, 2019 •

edited

Loading