client: do not restart restored tasks until server is contacted #5669

schmichael · 2019-05-08T22:20:17Z

Running restored allocations and pulling what allocations to run from
the server happen concurrently. This means that if a client is rebooted,
and has its allocations rescheduled, it may restart the dead allocations
before it contacts the server and determines they should be dead.

This commit makes tasks that fail to reattach on restore wait until the
server is contacted before restarting.

The PR also includes a small mock_driver fix and systemd unit file improvement from failed testing efforts.

I gave up on e2e testing reboots as there were many non-deterministic aspects and even if the correct assertions were made for each possible series of events, I couldn't figure out a way to prevent false-positives (as in the old buggy code would still often pass the test due to the specific setup of the test environment).

schmichael · 2019-05-08T22:23:09Z

drivers/mock/driver.go

+	}
+
+	return h.TaskStatus(), nil
+


Ended up not using this code during testing, but it seems generally safe and useful.

helper/gate/gate.go

client/allocrunner/taskrunner/task_runner.go

schmichael · 2019-05-10T17:47:52Z

client/allocrunner/taskrunner/task_runner.go

+			return
+		case <-tr.serversContactedCh:
+			tr.logger.Trace("server contacted; unblocking waiting task")
+		}


Should there be a timeout after which we allow the task to restart? Some people may be relying on the fact that disconnected Nomad nodes will continue to run their tasks. thinking

Alternatively we could have a client config and/or jobspec parameter for controlling this behavior.

I'm not sure - but I'd probably avoid adding more knobs to maintain until we hear clear demand.

To me the problem is that some tasks are not restarted and having partially running alloc while the nomad agent isn't running, potentially for hours. I find it easier to reason that in recovery a client that starts without being able to connect to server has the same behavior as no client starting. Adding a timeout seems to complicate the story for such a narrow case, without really addressing the big problem of having partially failing/running alloc while client is gone.

My one big concern is that when this is released it will break the ability for disconnected nodes to restart.

Right now completely disconnected nodes can fully reboot and will restart anything they were running before regardless of whether or not contact is ever made with servers.

It's difficult to know if anyone is relying on this behavior because it is much more of an emergent property of the client's code than an intentional and documented design choice.

Discussed offline and decided to block indefinitely. If a client can't contact servers in a timely fashion (seconds) it's likely to have its non-system allocs marked as lost and rescheduled elsewhere anyway so there's even less reason to restart at that point.

notnoop

The code and approach lgtm.

notnoop · 2019-05-13T15:05:18Z

client/allocrunner/taskrunner/task_runner.go

+	// serversContactedCh is passed to TaskRunners so they can detect when
+	// servers have been contacted for the first time in case of a failed
+	// restore.
+	serversContactedCh <-chan struct{}


This is more precisely: fetched expected running allocations from server - not heartbeat/registration/etc. Not sure if it's worth making the distinction or where to add it.

Yeah, we've never had good nomenclature for this. The code base often talks about "syncing" allocs, but since client->server and server->client updates are completely independent code paths it's always ambiguous to say "syncing".

syncedFromServersCh

allocsPulledCh

allocsRetrievedCh

unblockedRestoredCh - we could go with a name related to its use instead of the event/state it represents, but that doesn't seem as good.

Any thoughts/preferences? allocsPulledCh might be my preference as we use the term pulled in related code/logging.

I'd be ok with clarifying in comment with keeping the field name the same.

notnoop · 2019-05-13T15:13:54Z

client/allocrunner/taskrunner/task_runner.go

+			return
+		case <-tr.serversContactedCh:
+			tr.logger.Trace("server contacted; unblocking waiting task")
+		}


I'm not sure - but I'd probably avoid adding more knobs to maintain until we hear clear demand.

To me the problem is that some tasks are not restarted and having partially running alloc while the nomad agent isn't running, potentially for hours. I find it easier to reason that in recovery a client that starts without being able to connect to server has the same behavior as no client starting. Adding a timeout seems to complicate the story for such a narrow case, without really addressing the big problem of having partially failing/running alloc while client is gone.

Fixes #1795 Running restored allocations and pulling what allocations to run from the server happen concurrently. This means that if a client is rebooted, and has its allocations rescheduled, it may restart the dead allocations before it contacts the server and determines they should be dead. This commit makes tasks that fail to reattach on restore wait until the server is contacted before restarting.

@notnoop

Refactoring of 104067b Switch the MarkLive method for a chan that is closed by the client. Thanks to @notnoop for the idea! The old approach called a method on most existing ARs and TRs on every runAllocs call. The new approach does a once.Do call in runAllocs to accomplish the same thing with less work. Able to remove the gate abstraction that did much more than was needed.

Registration and restoring allocs don't share state or depend on each other in any way (syncing allocs with servers is done outside of registration). Since restoring is synchronous, start the registration goroutine first. For nodes with lots of allocs to restore or close to their heartbeat deadline, this could be the difference between becoming "lost" or not.

github-actions · 2023-02-10T02:18:12Z

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

schmichael commented May 8, 2019

View reviewed changes

helper/gate/gate.go Outdated Show resolved Hide resolved

schmichael commented May 9, 2019

View reviewed changes

client/allocrunner/taskrunner/task_runner.go Outdated Show resolved Hide resolved

schmichael requested a review from notnoop May 10, 2019 17:45

schmichael commented May 10, 2019

View reviewed changes

schmichael force-pushed the b-block-on-servers-4 branch from 9707a9d to e272d0b Compare May 10, 2019 18:00

notnoop approved these changes May 13, 2019

View reviewed changes

schmichael added a commit that referenced this pull request May 14, 2019

docs: changelog entry for #5669 and fix comment

ce5b474

schmichael added 6 commits May 14, 2019 10:53

e2e: fix nomad service for systemd<230

846b482

drivers/mock: implement InspectTask

4b854cc

docs: changelog entry for #5669 and fix comment

abd809d

schmichael force-pushed the b-block-on-servers-4 branch from ce5b474 to abd809d Compare May 14, 2019 17:54

schmichael merged commit 5b65afb into master May 14, 2019

schmichael deleted the b-block-on-servers-4 branch May 14, 2019 21:27

This was referenced Jul 1, 2019

Nomad server failover vs. task restart - Consul DNS #5908

Closed

Dead service after Nomad cluster restart #5919

Closed

Constraint/count is not respected after Nomad cluster restart (previously failed allocs) #5921

Open

github-actions bot locked as resolved and limited conversation to collaborators Feb 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

client: do not restart restored tasks until server is contacted #5669

client: do not restart restored tasks until server is contacted #5669

schmichael commented May 8, 2019

schmichael May 8, 2019

schmichael May 10, 2019

notnoop May 13, 2019

schmichael May 13, 2019

schmichael May 13, 2019

notnoop left a comment

notnoop May 13, 2019

schmichael May 13, 2019

notnoop May 13, 2019

notnoop May 13, 2019

github-actions bot commented Feb 10, 2023

client: do not restart restored tasks until server is contacted #5669

client: do not restart restored tasks until server is contacted #5669

Conversation

schmichael commented May 8, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

notnoop left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Feb 10, 2023