Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In some rare cases possible present 2 leaders in nomad servers cluster #4749

Closed
tantra35 opened this issue Oct 4, 2018 · 13 comments
Closed
Assignees
Milestone

Comments

@tantra35
Copy link
Contributor

tantra35 commented Oct 4, 2018

Nomad version

0.8.6

Issue

We saw this 2 or 3 times. And this looks like that present 2 nomad leaders at the same time

Look at screenshot of nomad telemetry
default

There is present overlap (2 leaders present in the cluster) of heartbeats on graphs above. When first switch of leaders happens on host where leader was present begins grows of goroutines

The switch happens between consulnomad-02(ip: 172.29.0.72) and ip-172-29-20-8
Here they logs

ip-172-29-20-8 2018/09/28 20:53:25.617835 [WARN] nomad: error getting server health from "consulnomad-02.atf01": last request still outstanding
ip-172-29-20-8 2018/09/28 20:53:23.617823 [WARN] nomad: error getting server health from "consulnomad-02.atf01": last request still outstanding
ip-172-29-20-8 2018/09/28 20:53:21.617921 [WARN] nomad: error getting server health from "consulnomad-02.atf01": last request still outstanding
ip-172-29-20-8 2018/09/28 20:53:19.617768 [WARN] nomad: error getting server health from "consulnomad-02.atf01": last request still outstanding
ip-172-29-20-8 2018/09/28 20:53:17.617845 [WARN] nomad: error getting server health from "consulnomad-02.atf01": last request still outstanding
ip-172-29-20-8 2018/09/28 20:53:15 [ERR] raft: Failed to AppendEntries to {Voter 7ed4842b-2104-ebf7-0460-fb494b36a4dc 172.29.0.72:4647}: read tcp 172.29.20.8:46422->172.29.0.72:4647: i/o timeout
ip-172-29-20-8 2018/09/28 20:53:15.617840 [WARN] nomad: error getting server health from "consulnomad-02.atf01": last request still outstanding
ip-172-29-20-8 2018/09/28 20:53:13.617832 [WARN] nomad: error getting server health from "consulnomad-02.atf01": last request still outstanding
ip-172-29-20-8 2018/09/28 20:53:11.617934 [WARN] nomad: error getting server health from "consulnomad-02.atf01": last request still outstanding
ip-172-29-20-8 2018/09/28 20:53:09.617859 [WARN] nomad: error getting server health from "consulnomad-02.atf01": last request still outstanding
ip-172-29-20-8 2018/09/28 20:53:07.617864 [WARN] nomad: error getting server health from "consulnomad-02.atf01": last request still outstanding
ip-172-29-20-8 2018/09/28 20:53:05 [ERR] raft: Failed to AppendEntries to {Voter 7ed4842b-2104-ebf7-0460-fb494b36a4dc 172.29.0.72:4647}: read tcp 172.29.20.8:46384->172.29.0.72:4647: i/o timeout
ip-172-29-20-8 2018/09/28 20:53:05.617893 [WARN] nomad: error getting server health from "consulnomad-02.atf01": last request still outstanding
ip-172-29-20-8 2018/09/28 20:53:03.617843 [WARN] nomad: error getting server health from "consulnomad-02.atf01": last request still outstanding
ip-172-29-20-8 2018/09/28 20:53:01.617855 [WARN] nomad: error getting server health from "consulnomad-02.atf01": last request still outstanding
ip-172-29-20-8 2018/09/28 20:52:59.617841 [WARN] nomad: error getting server health from "consulnomad-02.atf01": last request still outstanding
September 28th 2018, 20:52:57.719consulnomad-02 2018/09/28 20:52:57.719210 [ERR] nomad: failed to establish leadership: unable to reconcile job summaries: reconciliation of job summaries failed: timed out enqueuing operation
ip-172-29-20-8 2018/09/28 20:52:57.617836 [WARN] nomad: error getting server health from "consulnomad-02.atf01": last request still outstanding
ip-172-29-20-8 2018/09/28 20:52:55 [ERR] raft: Failed to AppendEntries to {Voter 7ed4842b-2104-ebf7-0460-fb494b36a4dc 172.29.0.72:4647}: read tcp 172.29.20.8:46366->172.29.0.72:4647: i/o timeout
ip-172-29-20-8 2018/09/28 20:52:55.617841 [WARN] nomad: error getting server health from "consulnomad-02.atf01": last request still outstanding
ip-172-29-20-8 2018/09/28 20:52:53.617833 [WARN] nomad: error getting server health from "consulnomad-02.atf01": last request still outstanding
ip-172-29-20-8 2018/09/28 20:52:51.617873 [WARN] nomad: error getting server health from "consulnomad-02.atf01": last request still outstanding
ip-172-29-20-8 2018/09/28 20:52:49.617834 [WARN] nomad: error getting server health from "consulnomad-02.atf01": last request still outstanding
ip-172-29-20-8 2018/09/28 20:52:47.617878 [WARN] nomad: error getting server health from "consulnomad-02.atf01": last request still outstanding
ip-172-29-20-8 2018/09/28 20:52:45 [ERR] raft: Failed to AppendEntries to {Voter 7ed4842b-2104-ebf7-0460-fb494b36a4dc 172.29.0.72:4647}: read tcp 172.29.20.8:46220->172.29.0.72:4647: i/o timeout
ip-172-29-20-8 2018/09/28 20:52:45.617842 [WARN] nomad: error getting server health from "consulnomad-02.atf01": last request still outstanding
ip-172-29-20-8 2018/09/28 20:52:43.617862 [WARN] nomad: error getting server health from "consulnomad-02.atf01": last request still outstanding
ip-172-29-20-8 2018/09/28 20:52:41.618085 [WARN] nomad: error getting server health from "consulnomad-02.atf01": last request still outstanding
ip-172-29-20-8 2018/09/28 20:52:39.617862 [WARN] nomad: error getting server health from "consulnomad-02.atf01": last request still outstanding
ip-172-29-20-8 2018/09/28 20:52:37.617861 [WARN] nomad: error getting server health from "consulnomad-02.atf01": last request still outstanding
ip-172-29-20-8 2018/09/28 20:52:35 [ERR] raft: Failed to AppendEntries to {Voter 7ed4842b-2104-ebf7-0460-fb494b36a4dc 172.29.0.72:4647}: read tcp 172.29.20.8:46094->172.29.0.72:4647: i/o timeout
ip-172-29-20-8 2018/09/28 20:52:35.617850 [WARN] nomad: error getting server health from "consulnomad-02.atf01": last request still outstanding
ip-172-29-20-8 2018/09/28 20:52:33.617854 [WARN] nomad: error getting server health from "consulnomad-02.atf01": last request still outstanding
ip-172-29-20-8 2018/09/28 20:52:33 [ERR] raft: Failed to make RequestVote RPC to {Voter 7ed4842b-2104-ebf7-0460-fb494b36a4dc 172.29.0.72:4647}: read tcp 172.29.20.8:55032->172.29.0.72:4647: i/o timeout
ip-172-29-20-8 2018/09/28 20:52:31.617835 [WARN] nomad: error getting server health from "consulnomad-02.atf01": last request still outstanding
ip-172-29-20-8 2018/09/28 20:52:29.617872 [WARN] nomad: error getting server health from "consulnomad-02.atf01": last request still outstanding
consulnomad-02 2018/09/28 20:52:28.925019 [INFO] server.nomad: successfully contacted 17 Nomad Servers
ip-172-29-20-8 2018/09/28 20:52:27.617854 [WARN] nomad: error getting server health from "consulnomad-02.atf01": last request still outstanding
ip-172-29-20-8 2018/09/28 20:52:26.618043 [WARN] nomad: error getting server health from "consulnomad-02.atf01": context deadline exceeded
ip-172-29-20-8 2018/09/28 20:52:26.618410 [WARN] nomad: error getting server health from "ip-172-29-4-194.atf01": context deadline exceeded
ip-172-29-20-8 2018/09/28 20:52:26.618625 [WARN] nomad: error getting server health from "ip-172-29-20-8.atf01": context deadline exceeded
consulnomad-02 2018/09/28 20:52:25 [ERR] raft: peer {Voter d3f5ff1e-a344-9b12-e95b-dc7645d6210f 172.29.4.194:4647} has newer term, stopping replication
consulnomad-02 2018/09/28 20:52:25 [ERR] raft: peer {Voter b3b5b05c-63b7-9fa8-6b3a-8c0ea511080f 172.29.20.8:4647} has newer term, stopping replication
ip-172-29-20-8 2018/09/28 20:52:25 [WARN] raft: Failed to contact 7ed4842b-2104-ebf7-0460-fb494b36a4dc in 1.430396899s
consulnomad-02 2018/09/28 20:52:24.996028 [WARN] nomad: error getting server health from "ip-172-29-4-194.atf01": context deadline exceeded
consulnomad-02 2018/09/28 20:52:24.996060 [WARN] nomad: error getting server health from "ip-172-29-20-8.atf01": context deadline exceeded
ip-172-29-20-8 2018/09/28 20:52:24 [WARN] raft: Failed to contact 7ed4842b-2104-ebf7-0460-fb494b36a4dc in 985.177158ms
ip-172-29-20-8 2018/09/28 20:52:24 [WARN] raft: Failed to contact 7ed4842b-2104-ebf7-0460-fb494b36a4dc in 503.921191ms
consulnomad-02 2018/09/28 20:52:24 [WARN] raft: Failed to contact b3b5b05c-63b7-9fa8-6b3a-8c0ea511080f in 504.128433ms
consulnomad-02 2018/09/28 20:52:24 [WARN] raft: Failed to contact quorum of nodes, stepping down
consulnomad-02 2018/09/28 20:52:24 [WARN] raft: Failed to contact d3f5ff1e-a344-9b12-e95b-dc7645d6210f in 504.137129ms
ip-172-29-20-8 2018/09/28 20:52:23.609095 [INFO] nomad: cluster leadership acquired
ip-172-29-20-8 2018/09/28 20:52:23 [INFO] raft: pipelining replication to peer {Voter d3f5ff1e-a344-9b12-e95b-dc7645d6210f 172.29.4.194:4647}
ip-172-29-20-8 2018/09/28 20:52:23 [INFO] raft: Election won. Tally: 2
ip-172-29-20-8 2018/09/28 20:52:23 [INFO] raft: Added peer 7ed4842b-2104-ebf7-0460-fb494b36a4dc, starting replication
ip-172-29-20-8 2018/09/28 20:52:23 [INFO] raft: Added peer d3f5ff1e-a344-9b12-e95b-dc7645d6210f, starting replication
ip-172-29-20-8 2018/09/28 20:52:23 [INFO] raft: Node at 172.29.20.8:4647 [Leader] entering Leader state
ip-172-29-20-8 2018/09/28 20:52:23 [WARN] raft: Heartbeat timeout from "" reached, starting election
ip-172-29-20-8 2018/09/28 20:52:23 [INFO] raft: Node at 172.29.20.8:4647 [Candidate] entering Candidate state in term 439
consulnomad-02 2018/09/28 20:52:23 [INFO] raft: Election won. Tally: 2
consulnomad-02 2018/09/28 20:52:23 [INFO] raft: Node at 172.29.0.72:4647 [Leader] entering Leader state
consulnomad-02 2018/09/28 20:52:23 [INFO] raft: Added peer d3f5ff1e-a344-9b12-e95b-dc7645d6210f, starting replication
consulnomad-02 2018/09/28 20:52:23 [INFO] raft: Added peer b3b5b05c-63b7-9fa8-6b3a-8c0ea511080f, starting replication
consulnomad-02 2018/09/28 20:52:22 [INFO] raft: Node at 172.29.0.72:4647 [Candidate] entering Candidate state in term 438
consulnomad-02 2018/09/28 20:52:22 [WARN] raft: Heartbeat timeout from "" reached, starting election
consulnomad-02 2018/09/28 20:52:20 [INFO] raft: aborting pipeline replication to peer {Voter d3f5ff1e-a344-9b12-e95b-dc7645d6210f 172.29.4.194:4647}
consulnomad-02 2018/09/28 20:52:20 [INFO] raft: Node at 172.29.0.72:4647 [Follower] entering Follower state (Leader: "")
consulnomad-02 2018/09/28 20:52:20 [INFO] raft: aborting pipeline replication to peer {Voter b3b5b05c-63b7-9fa8-6b3a-8c0ea511080f 172.29.20.8:4647}
consulnomad-02 2018/09/28 20:52:20 [WARN] raft: Failed to contact d3f5ff1e-a344-9b12-e95b-dc7645d6210f in 500.741405ms
consulnomad-02 2018/09/28 20:52:20 [WARN] raft: Failed to contact b3b5b05c-63b7-9fa8-6b3a-8c0ea511080f in 585.329533ms
consulnomad-02 2018/09/28 20:52:20 [WARN] raft: Failed to contact quorum of nodes, stepping down
consulnomad-02 2018/09/28 20:52:20 [WARN] raft: Failed to contact b3b5b05c-63b7-9fa8-6b3a-8c0ea511080f in 505.566757ms
consulnomad-02 2018/09/28 20:52:19 [INFO] raft: pipelining replication to peer {Voter d3f5ff1e-a344-9b12-e95b-dc7645d6210f 172.29.4.194:4647}
consulnomad-02 2018/09/28 20:52:19 [INFO] raft: pipelining replication to peer {Voter b3b5b05c-63b7-9fa8-6b3a-8c0ea511080f 172.29.20.8:4647}
consulnomad-02 2018/09/28 20:52:19.989258 [INFO] nomad: cluster leadership acquired
consulnomad-02 2018/09/28 20:52:19 [INFO] raft: Added peer d3f5ff1e-a344-9b12-e95b-dc7645d6210f, starting replication
consulnomad-02 2018/09/28 20:52:19 [INFO] raft: Election won. Tally: 2
consulnomad-02 2018/09/28 20:52:19 [INFO] raft: Added peer b3b5b05c-63b7-9fa8-6b3a-8c0ea511080f, starting replication
consulnomad-02 2018/09/28 20:52:19 [INFO] raft: Node at 172.29.0.72:4647 [Leader] entering Leader state
consulnomad-02 2018/09/28 20:52:18 [WARN] raft: Election timeout reached, restarting election
consulnomad-02 2018/09/28 20:52:18 [INFO] raft: Node at 172.29.0.72:4647 [Candidate] entering Candidate state in term 437
consulnomad-02 2018/09/28 20:52:16 [INFO] raft: Node at 172.29.0.72:4647 [Candidate] entering Candidate state in term 436
consulnomad-02 2018/09/28 20:52:16 [WARN] raft: Heartbeat timeout from "" reached, starting election
consulnomad-02 2018/09/28 20:52:15.861139 [ERR] worker: failed to dequeue evaluation: eval broker disabled
consulnomad-02 2018/09/28 20:52:15.860901 [INFO] nomad: cluster leadership lost
consulnomad-02 2018/09/28 20:52:15 [INFO] raft: aborting pipeline replication to peer {Voter d3f5ff1e-a344-9b12-e95b-dc7645d6210f 172.29.4.194:4647}
consulnomad-02 2018/09/28 20:52:15 [INFO] raft: Node at 172.29.0.72:4647 [Follower] entering Follower state (Leader: "")
consulnomad-02 2018/09/28 20:52:15 [INFO] raft: aborting pipeline replication to peer {Voter b3b5b05c-63b7-9fa8-6b3a-8c0ea511080f 172.29.20.8:4647}
consulnomad-02 2018/09/28 20:52:15 [WARN] raft: Failed to contact quorum of nodes, stepping down
consulnomad-02 2018/09/28 20:52:15 [WARN] raft: Failed to contact d3f5ff1e-a344-9b12-e95b-dc7645d6210f in 500.816289ms
consulnomad-02 2018/09/28 20:52:15 [WARN] raft: Failed to contact b3b5b05c-63b7-9fa8-6b3a-8c0ea511080f in 548.61118ms
consulnomad-02 2018/09/28 20:52:15 [WARN] raft: Failed to contact b3b5b05c-63b7-9fa8-6b3a-8c0ea511080f in 500.069296ms
@tantra35
Copy link
Contributor Author

tantra35 commented Oct 4, 2018

Seems that not taken into account the fact that establishLeadership can work very long, and at that moment can be flapping in raft between leader and follower state.

img-2018-10-05-02-19-57

On graph below, we can see that establishLeadership on consulnomad-2 works about 38 seconds.
During this time 3 times happens leader lost and acquire in follow sequences:

  1. Leader lost at 20:52:20,580
  2. Leader acquire at 20:52:23.522
  3. Leader lost at 20:52:24.026

The last Leader lost led to the deadlock, which stops raft from functionality correctly, and answer to heartbeats, what led to growth of Goroutines which in turn eats all ram(in fact, it is a memleak) llok at graph below

img-2018-10-05-02-34-04

@tantra35
Copy link
Contributor Author

tantra35 commented Oct 9, 2018

To prevent raft leader flapping we apply this stupid patch (this is only prototype - to describe idea, not working code):

diff --git a/nomad/leader.go b/nomad/leader.go
index 119e72ebf..6cbf37c65 100644
--- a/nomad/leader.go
+++ b/nomad/leader.go
@@ -46,11 +46,11 @@ var minAutopilotVersion = version.Must(version.NewVersion("0.8.0"))
 func (s *Server) monitorLeadership() {
 	var weAreLeaderCh chan struct{}
 	var leaderLoop sync.WaitGroup
+	var stopMonitorLeaderFlapingWg sync.WaitGroup
 	for {
 		select {
 		case isLeader := <-s.leaderCh:
-			switch {
-			case isLeader:
+			if isLeader {
 				if weAreLeaderCh != nil {
 					s.logger.Printf("[ERR] nomad: attempted to start the leader loop while running")
 					continue
@@ -58,13 +58,10 @@ func (s *Server) monitorLeadership() {
 
 				weAreLeaderCh = make(chan struct{})
 				leaderLoop.Add(1)
-				go func(ch chan struct{}) {
-					defer leaderLoop.Done()
-					s.leaderLoop(ch)
-				}(weAreLeaderCh)
+				go s.leaderLoop(weAreLeaderCh, &leaderLoop)
 				s.logger.Printf("[INFO] nomad: cluster leadership acquired")
 
-			default:
+			} else {
 				if weAreLeaderCh == nil {
 					s.logger.Printf("[ERR] nomad: attempted to stop the leader loop while not running")
 					continue
@@ -72,7 +69,33 @@ func (s *Server) monitorLeadership() {
 
 				s.logger.Printf("[DEBUG] nomad: shutting down leader loop")
 				close(weAreLeaderCh)
+
+				stopMonitorLeaderFlaping := make(chan struct{})
+				stopMonitorLeaderFlapingWg.Add(1)
+
+				go func() {
+					defer stopMonitorLeaderFlapingWg.Done()
+					beforeWaitLeaderState := isLeader
+
+					for {
+						select {
+						case isLeader = <-s.leaderCh:
+						case <- stopMonitorLeaderFlaping:
+							break
+						}
+					}
+
+					if beforeWaitLeaderState != isLeader {
+						select {
+							case s.leaderCh <- isLeader:
+							default:
+						}
+					}
+				}()
+
 				leaderLoop.Wait()
+				close(stopMonitorLeaderFlaping)
+				stopMonitorLeaderFlapingWg.Wait()
 				weAreLeaderCh = nil
 				s.logger.Printf("[INFO] nomad: cluster leadership lost")
 			}
@@ -85,7 +108,8 @@ func (s *Server) monitorLeadership() {
 
 // leaderLoop runs as long as we are the leader to run various
 // maintenance activities
-func (s *Server) leaderLoop(stopCh chan struct{}) {
+func (s *Server) leaderLoop(stopCh chan struct{}, wg *sync.WaitGroup) {
+	defer wg.Done()
 	var reconcileCh chan serf.Member
 	establishedLeader := false
 
@@ -135,6 +159,7 @@ RECONCILE:
 	// updates
 	reconcileCh = s.reconcileCh
 
+WAIT:
 	// Poll the stop channel to give it priority so we don't waste time
 	// trying to perform the other operations if we have been asked to shut
 	// down.
@@ -144,7 +169,6 @@ RECONCILE:
 	default:
 	}
 
-WAIT:
 	// Wait until leadership is lost
 	for {
 		select {

The key idea, since establish_leadership may take too long time, so if in this time multiple leader switching was happend(due perhaps not guarantied network between t2 instances in aws), when waiting for "leader goroutine"(leaderLoop), additional goroutine will launched, which "eat" all possible leader flapping, so this potentially remove described in this issue problem

@dadgar What you think about this?

@stale
Copy link

stale bot commented May 10, 2019

Hey there

Since this issue hasn't had any activity in a while - we're going to automatically close it in 30 days. If you're still seeing this issue with the latest version of Nomad, please respond here and we'll keep this open and take another look at this.

Thanks!

@stale
Copy link

stale bot commented Jun 9, 2019

This issue will be auto-closed because there hasn't been any activity for a few months. Feel free to open a new one if you still experience this problem 👍

@stale
Copy link

stale bot commented Sep 8, 2019

Hey there

Since this issue hasn't had any activity in a while - we're going to automatically close it in 30 days. If you're still seeing this issue with the latest version of Nomad, please respond here and we'll keep this open and take another look at this.

Thanks!

@stale
Copy link

stale bot commented Oct 8, 2019

This issue will be auto-closed because there hasn't been any activity for a few months. Feel free to open a new one if you still experience this problem 👍

@tgross
Copy link
Member

tgross commented Jan 6, 2020

Re-opening this as per the discussion in #6620.

@tgross tgross self-assigned this Jan 6, 2020
@tgross tgross added this to Needs Triage in Nomad - Community Issues Triage via automation Jan 6, 2020
@tgross tgross moved this from Needs Triage to Triaged in Nomad - Community Issues Triage Jan 6, 2020
@tgross tgross added this to the near-term milestone Jan 6, 2020
@tgross tgross added the type/bug label Jan 6, 2020
@tgross
Copy link
Member

tgross commented Jan 6, 2020

A couple of notes, after some discussion w/ my colleagues:

  • Raft leadership is distinct from Nomad leadership; the raft leader becomes the Nomad leader, but there's an opportunity for overlap during the transition as you've noted here. And in any case in raft leaders with different terms can run concurrently; this is the expected behavior during a partition but the stale leader can't perform append new updates to the log, which is the raft invariant.
  • The check we do for Nomad leadership uses a stale view of the raft leadership
    nomad/rpc.go#L471
  • What I was saying in goroutine leaks when making nomad api cals #6620 (comment) is not quite correct: because any update operations (which require consistency) go through the raft log anyways, the fact that there's an overlap in Nomad leadership during the transition should impact performance but not correctness.

It would be good if we can reduce the window for leadership transitions just from a performance perspective, but in #4749 (comment) it looks like the machines may also just be seriously underspecced -- the AWS t2 class machines really aren't suitable for this kind of multithreaded workload.

That being said, we still need to correct the problem here with the goroutine leak, which according to the work @tantra35 did in #6620 looks to be related to RPC connections being left open. That should narrow down the area where we need to investigate. I'll pick up the reproductions I was doing in #6620 and see if I can invoke some leader election flapping.

@tgross tgross modified the milestones: near-term, unscheduled Jan 9, 2020
@notnoop
Copy link
Contributor

notnoop commented Jan 22, 2020

Hi @tantra35 . Thank you so much for your patience here and level of details. It helped a lot in identifying some of the issues.

I believe there are few issues at play here and worth fixing:

First, it does look like raft leader flapping can result into a deadlock and make much of nomad inoperable and may lead to goroutines growing as raft is waiting for deadlock to complete. I have some write up and details in #6977 . It's basically your fix while trying to avoid avoid blocking raft. Consul apparently had a similar issue in hashicorp/consul#6852 . This somewhat explains why we didn't see the log messages I'd expect about stepping down and why blocked evals spiked (because it though itself as the leader) but never drained (because no server would pick them up or process them).

Second, there is the hung RPC connections issue of #6620 . I conjecture that raft hanging can cause some RPC to hang, specially if your patch here seems to have address the problem there and that the blocked_eval spike is correlated with hang RPC connections. I'm still researching how your cluster got in this situation, but now I have some ideas for how to reproduce it.

Third, even if an RPC connection is hang due to Raft deadlock, I would have expected Yamux to time out and drop connections eventually. We are looking into this too and may have a bug there. Sadly, when chaining calls, net/rpc and hashicorp/net-rpc-msgpackrpc doesn't provide a mechanism (that I know of) to propagate cancellation, so it's expected that inflight RPC complete even if the triggering HTTP API is cancelled, but I would have expected RPC call to return fast or time out.

As I work on reproduction, would love the following info:

Thank you again for info and patience - we will get to the bottom of this :).

@tantra35
Copy link
Contributor Author

tantra35 commented Jan 23, 2020

@notnoop Hello, first of all great work!!!

  1. We use nomad 0.9.7 in production (0.10.2 only testing now, for make decisions is it full enough ready to pull it to prod). Yes patch in Handle Nomad leadership flapping (attempt 2) #6977 compatible with both version, but for now we use our approach(described here), and now about more than 30 days doesn't see any problems, without patching we see this problem every 3-5 days

  2. Why leader flapped i can't understand, servers in control plane not overloaded, and have about 4-5% of CPU load. We use t2.medium instances in aws. And below i provide CPU monitoring stats for 30 days interval(so you can make sure that instance not overloaded) from one of our control plane servers(other have the same statistic)
    изображение, and at moment of flapping i cant see any anomaly in servers behavior(this is was first i thinked about), except some TCP retransmit counts(about 1-2 in minute, and this count doesn't look huge)

  3. We doesn't perform any other calls - only v1/client/allocation/<alloc-id>/stats

Also i must said that despite our approach, we still observed a leakage of the sockets, but this was due to a cross region API call (we had one place from which we have making API calls to multiple regions), and doesn;t correlated with leader flapping.

But this was old time, we change our scheme of collection statistic - we place collector in every region, so eliminates cross region calls, then we upgraded to nomad 0.9.x(there we forgot add patch described here and again got problems)

Also we patch yamux(https://github.com/hashicorp/yamux) to add timeout to OpenStream and Open calls, and replace this calls in client_rpc.go(https://github.com/hashicorp/nomad/blob/master/nomad/client_rpc.go#L220) like this(this patch eliminates hanging forever in casess when not possible establish connection inside yamux stream - just a guarantee that it will never hang here, because in #6620 in stack traces we observe this):

diff --git "a/nomad/client_rpc.go" "b/nomad/client_rpc.go"
index ca8db2336..405858bd4 100644
--- "a/nomad/client_rpc.go"
+++ "b/nomad/client_rpc.go"
@@ -217,7 +217,7 @@ func (s *Server) serverWithNodeConn(nodeID, region string) (*serverParts, error)
 // Yamux session for the node and the method to be called.
 func NodeRpc(session *yamux.Session, method string, args, reply interface{}) error {
 	// Open a new session
-	stream, err := session.Open()
+	stream, err := session.OpenTimeout(10*time.Second)
 	if err != nil {
 		return err
 	}
@@ -244,7 +244,7 @@ func NodeRpc(session *yamux.Session, method string, args, reply interface{}) err
 // the callers responsibility to close the connection if there is no error.
 func NodeStreamingRpc(session *yamux.Session, method string) (net.Conn, error) {
 	// Open a new session
-	stream, err := session.Open()
+	stream, err := session.OpenTimeout(10*time.Second)
 	if err != nil {
 		return nil, err
 	} 

here is yamux patch

diff --git a/vendor/github.com/hashicorp/yamux/session.go b/vendor/github.com/hashicorp/yamux/session.go
index a80ddec35..37c2f2ea1 100644
--- a/vendor/github.com/hashicorp/yamux/session.go
+++ b/vendor/github.com/hashicorp/yamux/session.go
@@ -151,8 +151,41 @@ func (s *Session) Open() (net.Conn, error) {
 	return conn, nil
 }
 
-// OpenStream is used to create a new stream
+func (s *Session) OpenTimeout(timeout time.Duration) (net.Conn, error) {
+	conn, err := s.OpenStreamTimeout(timeout)
+	if err != nil {
+		return nil, err
+	}
+	return conn, nil
+}
+
 func (s *Session) OpenStream() (*Stream, error) {
+	return s._openStream(nil, func(stream *Stream) error {
+		return stream.sendWindowUpdate()
+	})
+}
+
+func (s *Session) OpenStreamTimeout(timeout time.Duration) (*Stream, error) {
+	t := timerPool.Get()
+	timer := t.(*time.Timer)
+	timer.Reset(timeout)
+	defer func() {
+		if ! timer.Stop() {
+			select {
+			case <-timer.C:
+			default:
+			}
+		}
+		timerPool.Put(t)
+	}()
+
+	return s._openStream(timer.C, func(stream *Stream) error {
+		return stream.sendWindowUpdateTimeout(timer.C)
+	})
+}
+
+// OpenStream is used to create a new stream
+func (s *Session) _openStream(timeout <-chan time.Time, fn func(*Stream) error) (*Stream, error) {
 	if s.IsClosed() {
 		return nil, ErrSessionShutdown
 	}
@@ -165,6 +198,8 @@ func (s *Session) OpenStream() (*Stream, error) {
 	case s.synCh <- struct{}{}:
 	case <-s.shutdownCh:
 		return nil, ErrSessionShutdown
+	case <-timeout:
+		return nil, ErrTimeout
 	}
 
 GET_ID:
@@ -185,7 +220,7 @@ GET_ID:
 	s.streamLock.Unlock()
 
 	// Send the window update to create
-	if err := stream.sendWindowUpdate(); err != nil {
+	if err := fn(stream); err != nil {
 		select {
 		case <-s.synCh:
 		default:
@@ -340,20 +375,25 @@ func (s *Session) waitForSendErr(hdr header, body io.Reader, errCh chan error) e
 	timer := t.(*time.Timer)
 	timer.Reset(s.config.ConnectionWriteTimeout)
 	defer func() {
-		timer.Stop()
-		select {
-		case <-timer.C:
-		default:
+		if !timer.Stop() {
+			select {
+			case <-timer.C:
+			default:
+			}
 		}
 		timerPool.Put(t)
 	}()
 
+	return s.waitForSendErrTimeout(timer.C, hdr, body, errCh)
+}
+
+func (s *Session) waitForSendErrTimeout(timeout <-chan time.Time, hdr header, body io.Reader, errCh chan error) error {
 	ready := sendReady{Hdr: hdr, Body: body, Err: errCh}
 	select {
 	case s.sendCh <- ready:
 	case <-s.shutdownCh:
 		return ErrSessionShutdown
-	case <-timer.C:
+	case <-timeout:
 		return ErrConnectionWriteTimeout
 	}
 
@@ -362,7 +402,7 @@ func (s *Session) waitForSendErr(hdr header, body io.Reader, errCh chan error) e
 		return err
 	case <-s.shutdownCh:
 		return ErrSessionShutdown
-	case <-timer.C:
+	case <-timeout:
 		return ErrConnectionWriteTimeout
 	}
 }
@@ -625,6 +665,7 @@ func (s *Session) incomingStream(id uint32) error {
 func (s *Session) closeStream(id uint32) {
 	s.streamLock.Lock()
 	if _, ok := s.inflight[id]; ok {
+		delete(s.inflight, id)
 		select {
 		case <-s.synCh:
 		default:
diff --git a/vendor/github.com/hashicorp/yamux/stream.go b/vendor/github.com/hashicorp/yamux/stream.go
index aa2391973..b2a523599 100644
--- a/vendor/github.com/hashicorp/yamux/stream.go
+++ b/vendor/github.com/hashicorp/yamux/stream.go
@@ -237,6 +237,16 @@ func (s *Stream) sendFlags() uint16 {
 // sendWindowUpdate potentially sends a window update enabling
 // further writes to take place. Must be invoked with the lock.
 func (s *Stream) sendWindowUpdate() error {
+	return s._sendWindowUpdate(s.session.waitForSendErr)
+}
+
+func (s *Stream) sendWindowUpdateTimeout(timeout <-chan time.Time) error {
+	return s._sendWindowUpdate(func(hdr header, body io.Reader, errCh chan error) error {
+		return s.session.waitForSendErrTimeout(timeout, hdr, body, errCh)
+	})
+}
+
+func (s *Stream) _sendWindowUpdate(fn func(hdr header, body io.Reader, errCh chan error) error) error {
 	s.controlHdrLock.Lock()
 	defer s.controlHdrLock.Unlock()
 
@@ -264,7 +274,7 @@ func (s *Stream) sendWindowUpdate() error {
 
 	// Send the header
 	s.controlHdr.encode(typeWindowUpdate, flags, s.id, delta)
-	if err := s.session.waitForSendErr(s.controlHdr, nil, s.controlErr); err != nil {
+	if err := fn(s.controlHdr, nil, s.controlErr); err != nil {
 		return err
 	}
 	return nil
diff --git a/vendor/github.com/hashicorp/yamux/util.go b/vendor/github.com/hashicorp/yamux/util.go
index 8a73e9249..ff1a31b7c 100644
--- a/vendor/github.com/hashicorp/yamux/util.go
+++ b/vendor/github.com/hashicorp/yamux/util.go
@@ -9,7 +9,12 @@ var (
 	timerPool = &sync.Pool{
 		New: func() interface{} {
 			timer := time.NewTimer(time.Hour * 1e6)
-			timer.Stop()
+			if !timer.Stop() {
+				select {
+					case <-timer.C:
+					default:
+				}
+			}
 			return timer
 		},
 	}

@tgross
Copy link
Member

tgross commented Mar 3, 2020

@tantra35 I just wanted to follow up with you on this issue. On Feb 19 we shipped 0.10.4. My colleague @notnoop landed a fix to leadership flapping (#6977), which combined with the RPC fix we shipped in 0.10.3 we believe should fix the problems in this issue and in #6620

We continue to recommend against using t2 instance classes, but we think this should wrap up this issue for you.

@tgross tgross modified the milestones: unscheduled , 0.10.4 Mar 3, 2020
@tgross tgross moved this from Triaged to Waiting Reply in Nomad - Community Issues Triage Mar 3, 2020
@tgross
Copy link
Member

tgross commented Mar 5, 2020

Based on the comments in #6620 (comment), I'm going to close this issue and we can follow up on any remaining issues in #4606.

@tgross tgross closed this as completed Mar 5, 2020
Nomad - Community Issues Triage automation moved this from Waiting Reply to Done Mar 5, 2020
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 12, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants