Remove unnecessary locking and serverlist syncing in heartbeats #5654

preetapan · 2019-05-06T19:52:21Z

This removes an unnecessary shared lock between discovery and heartbeating
which was causing heartbeats to be missed upon retries when a single server
fails. Also made a drive by fix to call the periodic server shuffler goroutine.

I took some of the approach in #4688, but have kept the triggerDiscovery logic upon failure with the change that the server list is not retouched if the discovered servers are same as existing servers.

Fixes #4666

Prior to this change, running a three server cluster where one of the servers is arbitrarily paused caused many unnecessary reschedule events due the clients being considered lost. With this change, I see a dramatic difference, with the caveat that if a server is paused when a client is about to send its hearbeat and there's not enough grace period left for it to retry another server, it will still be considered a missed heartbeat.

This removes an unnecessary shared lock between discovery and heartbeating which was causing heartbeats to be missed upon retries when a single server fails. Also made a drive by fix to call the periodic server shuffler goroutine.

schmichael · 2019-05-07T14:52:05Z

client/client.go

@@ -2432,6 +2427,26 @@ DISCOLOOP:

 	consulLogger.Info("discovered following servers", "servers", nomadServers)

+	// Check if the list of servers discovered is identical to the list we already have


Hm, I think this code would be better placed in servers.SetServers. That way it could be done inside the server manager's lock and prevent concurrent RPCs from rotating the list and making this comparison stale.

Not a big deal as there's no data race here, just the possibility of a comparison against stale data which shouldn't do much harm.

(It would be easily unit testable in server manager as well)

The method for comparing whether the two server lists are equal in servers.SetServers does a sort -

nomad/client/servers/manager.go

Line 206 in 9da25ad

m.servers.Sort()

This approach avoids that sort step/modification. I can move over the code though and change the logic in SerServers to avoid the sort of existing servers, which modifies the observable server list.

schmichael · 2019-05-07T14:56:14Z

client/client.go

+		}
+	}
+	if allFound && len(nomadServers) == len(serverList) {
+		c.logger.Info("Not replacing server list, current server list is identical to servers discovered in Consul")


Debug log level at most. This should only matter when debugging heartbeating which is hopefully very rare!

Suggested change

c.logger.Info("Not replacing server list, current server list is identical to servers discovered in Consul")

c.logger.Debug("Not replacing server list, current server list is identical to servers discovered in Consul")

schmichael · 2019-05-07T14:58:58Z

dev/cluster/server1.hcl

@@ -16,5 +16,5 @@ server {
  }

  # Self-elect, should be 3 or 5 for production
-  bootstrap_expect = 3
+  bootstrap_expect = 1


This the dev cluster config, so bootstrap expect should match server2 and server3's.

You can override the value on the command line to use this config for a single server cluster or we could add a new config for single server clusters.

dev/cluster/client2.hcl

preetapan · 2019-05-07T21:25:43Z

client/servers/manager.go

 	// Determine if they are equal
-	equal := servers.Equal(m.servers)
+	equal := m.serversAreEqual(servers)


@schmichael this is ready for another look

notnoop · 2019-05-08T00:42:43Z

client/client.go

@@ -315,6 +315,9 @@ func NewClient(cfg *config.Config, consulCatalog consul.CatalogAPI, consulServic
 	// Initialize the server manager
 	c.servers = servers.New(c.logger, c.shutdownCh, c)

+	// Start server manager rebalancing go routine
+	go c.servers.Start()


Curious how this call was dropped? Was it a regression?

I don't think it was ever there so for long running clients only failed rpcs or discovering new servers would cause reshuffling.

client/servers/manager_test.go

a-zagaevskiy · 2019-05-17T08:53:46Z

client/servers/manager.go

+	copy.Sort()
+	servers.Sort()
+
+	return copy.Equal(servers)


Isn't it better to compare servers lists omitting their DC, only by IP-addresses? Cause servers coming from consulDiscovery() method would be without any DC, whereas current servers would have it.
See https://github.com/hashicorp/nomad/blob/master/client/client.go#L2436

Great catch! As far as I can tell the code to prefer the local DC's server has been broken since v0.8.0. I'm going to remove the DC field and add an issue+TODO.

This ensures that server-to-server streaming RPC calls use the tls wrapped connections. Prior to this, `streamingRpcImpl` function uses tls for setting header and invoking the rpc method, but returns unwrapped tls connection. Thus, streaming writes fail with tls errors. This tls streaming bug existed since 0.8.0[1], but PR #5654[2] exacerbated it in 0.9.2. Prior to PR #5654, nomad client used to shuffle servers at every heartbeat -- `servers.Manager.setServers`[3] always shuffled servers and was called by heartbeat code[4]. Shuffling servers meant that a nomad client would heartbeat and establish a connection against all nomad servers eventually. When handling streaming RPC calls, nomad servers used these local connection to communicate directly to the client. The server-to-server forwarding logic was left mostly unexercised. PR #5654 means that a nomad client may connect to a single server only and caused the server-to-server forward streaming RPC code to get exercised more and unearthed the problem. [1] https://github.com/hashicorp/nomad/blob/v0.8.0/nomad/rpc.go#L501-L515 [2] #5654 [3] https://github.com/hashicorp/nomad/blob/v0.9.1/client/servers/manager.go#L198-L216 [4] https://github.com/hashicorp/nomad/blob/v0.9.1/client/client.go#L1603

github-actions · 2023-02-10T02:17:52Z

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

preetapan requested review from schmichael and notnoop May 6, 2019 19:52

schmichael requested changes May 7, 2019

View reviewed changes

code review feedback

12e1804

preetapan commented May 7, 2019

View reviewed changes

schmichael approved these changes May 7, 2019

View reviewed changes

notnoop approved these changes May 8, 2019

View reviewed changes

fix typo and add one more test scenario

5bfa35a

preetapan merged commit eb7a3bc into master May 8, 2019

preetapan deleted the b-hearbeat-lockfix branch May 8, 2019 18:36

preetapan mentioned this pull request May 8, 2019

remove unnecessary triggerDiscovery calls. So we call it only in case… #4688

Closed

nickethier mentioned this pull request May 14, 2019

client: fix omitted datacenter of a server discovered via Consul #4997

Closed

a-zagaevskiy reviewed May 17, 2019

View reviewed changes

angrycub mentioned this pull request Jul 1, 2019

Nomad server failover vs. task restart - Consul DNS #5908

Closed

notnoop mentioned this pull request Jul 12, 2019

rpc: use tls wrapped connection for streaming rpc #5954

Merged

github-actions bot locked as resolved and limited conversation to collaborators Feb 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove unnecessary locking and serverlist syncing in heartbeats #5654

Remove unnecessary locking and serverlist syncing in heartbeats #5654

preetapan commented May 6, 2019

schmichael May 7, 2019

schmichael May 7, 2019

preetapan May 7, 2019 •

edited

Loading

schmichael May 7, 2019

schmichael May 7, 2019

preetapan May 7, 2019

notnoop May 8, 2019

preetapan May 8, 2019

a-zagaevskiy May 17, 2019

schmichael May 17, 2019

github-actions bot commented Feb 10, 2023

		@@ -2432,6 +2427,26 @@ DISCOLOOP:

		consulLogger.Info("discovered following servers", "servers", nomadServers)

		// Check if the list of servers discovered is identical to the list we already have

	c.logger.Info("Not replacing server list, current server list is identical to servers discovered in Consul")
	c.logger.Debug("Not replacing server list, current server list is identical to servers discovered in Consul")

Remove unnecessary locking and serverlist syncing in heartbeats #5654

Remove unnecessary locking and serverlist syncing in heartbeats #5654

Conversation

preetapan commented May 6, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

preetapan May 7, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Feb 10, 2023

preetapan May 7, 2019 •

edited

Loading