Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove unnecessary locking and serverlist syncing in heartbeats #5654

Merged
merged 3 commits into from
May 8, 2019

Conversation

preetapan
Copy link
Member

This removes an unnecessary shared lock between discovery and heartbeating
which was causing heartbeats to be missed upon retries when a single server
fails. Also made a drive by fix to call the periodic server shuffler goroutine.

I took some of the approach in #4688, but have kept the triggerDiscovery logic upon failure with the change that the server list is not retouched if the discovered servers are same as existing servers.

Fixes #4666

Prior to this change, running a three server cluster where one of the servers is arbitrarily paused caused many unnecessary reschedule events due the clients being considered lost. With this change, I see a dramatic difference, with the caveat that if a server is paused when a client is about to send its hearbeat and there's not enough grace period left for it to retry another server, it will still be considered a missed heartbeat.

This removes an unnecessary shared lock between discovery and heartbeating
which was causing heartbeats to be missed upon retries when a single server
fails. Also made a drive by fix to call the periodic server shuffler goroutine.
client/client.go Outdated
@@ -2432,6 +2427,26 @@ DISCOLOOP:

consulLogger.Info("discovered following servers", "servers", nomadServers)

// Check if the list of servers discovered is identical to the list we already have
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I think this code would be better placed in servers.SetServers. That way it could be done inside the server manager's lock and prevent concurrent RPCs from rotating the list and making this comparison stale.

Not a big deal as there's no data race here, just the possibility of a comparison against stale data which shouldn't do much harm.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(It would be easily unit testable in server manager as well)

Copy link
Member Author

@preetapan preetapan May 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method for comparing whether the two server lists are equal in servers.SetServers does a sort -

m.servers.Sort()

This approach avoids that sort step/modification. I can move over the code though and change the logic in SerServers to avoid the sort of existing servers, which modifies the observable server list.

client/client.go Outdated
}
}
if allFound && len(nomadServers) == len(serverList) {
c.logger.Info("Not replacing server list, current server list is identical to servers discovered in Consul")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debug log level at most. This should only matter when debugging heartbeating which is hopefully very rare!

Suggested change
c.logger.Info("Not replacing server list, current server list is identical to servers discovered in Consul")
c.logger.Debug("Not replacing server list, current server list is identical to servers discovered in Consul")

@@ -16,5 +16,5 @@ server {
}

# Self-elect, should be 3 or 5 for production
bootstrap_expect = 3
bootstrap_expect = 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This the dev cluster config, so bootstrap expect should match server2 and server3's.

You can override the value on the command line to use this config for a single server cluster or we could add a new config for single server clusters.

dev/cluster/client2.hcl Show resolved Hide resolved
// Determine if they are equal
equal := servers.Equal(m.servers)
equal := m.serversAreEqual(servers)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@schmichael this is ready for another look

@@ -315,6 +315,9 @@ func NewClient(cfg *config.Config, consulCatalog consul.CatalogAPI, consulServic
// Initialize the server manager
c.servers = servers.New(c.logger, c.shutdownCh, c)

// Start server manager rebalancing go routine
go c.servers.Start()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious how this call was dropped? Was it a regression?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it was ever there so for long running clients only failed rpcs or discovering new servers would cause reshuffling.

client/servers/manager_test.go Show resolved Hide resolved
copy.Sort()
servers.Sort()

return copy.Equal(servers)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it better to compare servers lists omitting their DC, only by IP-addresses? Cause servers coming from consulDiscovery() method would be without any DC, whereas current servers would have it.
See https://github.com/hashicorp/nomad/blob/master/client/client.go#L2436

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch! As far as I can tell the code to prefer the local DC's server has been broken since v0.8.0. I'm going to remove the DC field and add an issue+TODO.

notnoop pushed a commit that referenced this pull request Jul 12, 2019
This ensures that server-to-server streaming RPC calls use the tls
wrapped connections.

Prior to this, `streamingRpcImpl` function uses tls for setting header
and invoking the rpc method, but returns unwrapped tls connection.
Thus, streaming writes fail with tls errors.

This tls streaming bug existed since 0.8.0[1], but PR #5654[2]
exacerbated it in 0.9.2.  Prior to PR #5654, nomad client used to
shuffle servers at every heartbeat -- `servers.Manager.setServers`[3]
always shuffled servers and was called by heartbeat code[4].  Shuffling
servers meant that a nomad client would heartbeat and establish a
connection against all nomad servers eventually.  When handling
streaming RPC calls, nomad servers used these local connection to
communicate directly to the client.  The server-to-server forwarding
logic was left mostly unexercised.

PR #5654 means that a nomad client may connect to a single server only
and caused the server-to-server forward streaming RPC code to get
exercised more and unearthed the problem.

[1] https://github.com/hashicorp/nomad/blob/v0.8.0/nomad/rpc.go#L501-L515
[2] #5654
[3] https://github.com/hashicorp/nomad/blob/v0.9.1/client/servers/manager.go#L198-L216
[4] https://github.com/hashicorp/nomad/blob/v0.9.1/client/client.go#L1603
@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 10, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
4 participants