Avoid de-registering slowly restored services #5837

notnoop · 2019-06-14T17:00:04Z

When a nomad client restarts/upgraded, nomad restores state from running
task and starts the sync loop. If sync loop runs early, it may
deregister services from Consul prematurely even when Consul has the
running service as healthy.

This is not ideal, as re-registering the service means potentially
waiting a whole service health check interval before declaring the
service healthy.

We attempt to mitigate this by introducing an initialization probation
period. During this time, we only deregister services and checks that
were explicitly deregistered, and leave unrecognized ones alone. This
serves as a grace period for restoring to complete, or for operators to
restore should they recognize they restored with the wrong nomad data
directory.

I explored changing the order of starting up the consul client syncing goroutine, but given that tasks consul registration happen in restore/run goroutines, it's a significant drastic change to APIs to make consul agent wait until all restored services initialized correctly.

When a nomad client restarts/upgraded, nomad restores state from running task and starts the sync loop. If sync loop runs early, it may deregister services from Consul prematurely even when Consul has the running service as healthy. This is not ideal, as re-registering the service means potentially waiting a whole service health check interval before declaring the service healthy. We attempt to mitigate this by introducing an initialization probation period. During this time, we only deregister services and checks that were explicitly deregistered, and leave unrecognized ones alone. This serves as a grace period for restoring to complete, or for operators to restore should they recognize they restored with the wrong nomad data directory.

notnoop · 2019-06-14T17:05:42Z

FWIW, I had alternative implementation in #5838 - but I like this one more as I believe it's a bit simpler and convey the problem clearly (that we are only concerned about start up but not steady state of system.

schmichael

This seems like a fine fix for an ugly bug.

Alternate approach

I think I'd prefer the approach of delaying go ServiceClient.Run() until after Client.restoreState() has returned (since restoration is synchronous), but that would also require moving RegisterTask calls into TaskRunner.Restore() for running tasks. I think that approach is optimal, but the changes are not as nicely encapsulated as they are in this PR's approach.

It would also require the server to manually call go ServiceClient.Run() itself which complicates the dev agent where you need to share a single ServiceClient. A quick check in the server to see if the client is enabled before starting Run would be sufficient -- or Run could detect multiple calls and noop on subsequent ones.

I guess we'd probably want to add a TaskRunner RestoreHook to encapsulate this since all of the service registration code is currently in a Poststart hook... that makes this alternate approach even more complicated to implement unfortunately.

This would also delay registering the Nomad service itself until the agent was closer to a working state, but I can't think of any reason that would be a significant benefit.

That being said I'd rather have this fixed well then worry about the approach, and this is done and tested! So feel free to ship as is.

command/agent/consul/client.go

notnoop · 2019-07-16T23:35:30Z

I think I'd prefer the approach of delaying go ServiceClient.Run() until after Client.restoreState() has returned (since restoration is synchronous)

I considered this approach but I found it too complicated as you said. Though state restore is synchronise, TR.Run and consequently the consul service hooks run in goroutines, so it's not sufficient to simply reorder functions. We would need to add some gating/WaitGroups to track when all task service hooks finished executing after state restore before we invoke service synchronization , requiring substantial changes to the client/TR/AR apis as well as agent code.

github-actions · 2023-02-07T02:15:07Z

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

notnoop requested review from endocrimes and schmichael June 14, 2019 17:00

notnoop mentioned this pull request Jun 14, 2019

Avoid de-registering slowly restored services - attempt 2 #5838

Closed

preetapan added the 0.9.4 label Jul 16, 2019

schmichael approved these changes Jul 16, 2019

View reviewed changes

command/agent/consul/client.go Outdated Show resolved Hide resolved

command/agent/consul/client.go Outdated Show resolved Hide resolved

address review feedback

121c974

notnoop merged commit 15caf5c into master Jul 17, 2019

notnoop deleted the b-consul-restore-sync-2 branch July 17, 2019 04:02

notnoop pushed a commit that referenced this pull request Jul 18, 2019

changelog GH-5837 and GH-5948

c7ab4a6

github-actions bot locked as resolved and limited conversation to collaborators Feb 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid de-registering slowly restored services #5837

Avoid de-registering slowly restored services #5837

notnoop commented Jun 14, 2019 •

edited

Loading

notnoop commented Jun 14, 2019

schmichael left a comment •

edited

Loading

notnoop commented Jul 16, 2019 •

edited

Loading

github-actions bot commented Feb 7, 2023

Avoid de-registering slowly restored services #5837

Avoid de-registering slowly restored services #5837

Conversation

notnoop commented Jun 14, 2019 • edited Loading

notnoop commented Jun 14, 2019

schmichael left a comment • edited Loading

Choose a reason for hiding this comment

Alternate approach

notnoop commented Jul 16, 2019 • edited Loading

github-actions bot commented Feb 7, 2023

notnoop commented Jun 14, 2019 •

edited

Loading

schmichael left a comment •

edited

Loading

notnoop commented Jul 16, 2019 •

edited

Loading