Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

client: do not restart restored tasks until server is contacted #5669

Merged
merged 6 commits into from
May 14, 2019

Conversation

schmichael
Copy link
Member

Fixes #1795

Running restored allocations and pulling what allocations to run from
the server happen concurrently. This means that if a client is rebooted,
and has its allocations rescheduled, it may restart the dead allocations
before it contacts the server and determines they should be dead.

This commit makes tasks that fail to reattach on restore wait until the
server is contacted before restarting.

The PR also includes a small mock_driver fix and systemd unit file improvement from failed testing efforts.

I gave up on e2e testing reboots as there were many non-deterministic aspects and even if the correct assertions were made for each possible series of events, I couldn't figure out a way to prevent false-positives (as in the old buggy code would still often pass the test due to the specific setup of the test environment).

}

return h.TaskStatus(), nil

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ended up not using this code during testing, but it seems generally safe and useful.

helper/gate/gate.go Outdated Show resolved Hide resolved
@schmichael schmichael requested a review from notnoop May 10, 2019 17:45
return
case <-tr.serversContactedCh:
tr.logger.Trace("server contacted; unblocking waiting task")
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be a timeout after which we allow the task to restart? Some people may be relying on the fact that disconnected Nomad nodes will continue to run their tasks. thinking

Alternatively we could have a client config and/or jobspec parameter for controlling this behavior.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure - but I'd probably avoid adding more knobs to maintain until we hear clear demand.

To me the problem is that some tasks are not restarted and having partially running alloc while the nomad agent isn't running, potentially for hours. I find it easier to reason that in recovery a client that starts without being able to connect to server has the same behavior as no client starting. Adding a timeout seems to complicate the story for such a narrow case, without really addressing the big problem of having partially failing/running alloc while client is gone.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My one big concern is that when this is released it will break the ability for disconnected nodes to restart.

Right now completely disconnected nodes can fully reboot and will restart anything they were running before regardless of whether or not contact is ever made with servers.

It's difficult to know if anyone is relying on this behavior because it is much more of an emergent property of the client's code than an intentional and documented design choice.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline and decided to block indefinitely. If a client can't contact servers in a timely fashion (seconds) it's likely to have its non-system allocs marked as lost and rescheduled elsewhere anyway so there's even less reason to restart at that point.

Copy link
Contributor

@notnoop notnoop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code and approach lgtm.

// serversContactedCh is passed to TaskRunners so they can detect when
// servers have been contacted for the first time in case of a failed
// restore.
serversContactedCh <-chan struct{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more precisely: fetched expected running allocations from server - not heartbeat/registration/etc. Not sure if it's worth making the distinction or where to add it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we've never had good nomenclature for this. The code base often talks about "syncing" allocs, but since client->server and server->client updates are completely independent code paths it's always ambiguous to say "syncing".

  • syncedFromServersCh
  • allocsPulledCh
  • allocsRetrievedCh
  • unblockedRestoredCh - we could go with a name related to its use instead of the event/state it represents, but that doesn't seem as good.

Any thoughts/preferences? allocsPulledCh might be my preference as we use the term pulled in related code/logging.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be ok with clarifying in comment with keeping the field name the same.

return
case <-tr.serversContactedCh:
tr.logger.Trace("server contacted; unblocking waiting task")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure - but I'd probably avoid adding more knobs to maintain until we hear clear demand.

To me the problem is that some tasks are not restarted and having partially running alloc while the nomad agent isn't running, potentially for hours. I find it easier to reason that in recovery a client that starts without being able to connect to server has the same behavior as no client starting. Adding a timeout seems to complicate the story for such a narrow case, without really addressing the big problem of having partially failing/running alloc while client is gone.

schmichael added a commit that referenced this pull request May 14, 2019
Fixes #1795

Running restored allocations and pulling what allocations to run from
the server happen concurrently. This means that if a client is rebooted,
and has its allocations rescheduled, it may restart the dead allocations
before it contacts the server and determines they should be dead.

This commit makes tasks that fail to reattach on restore wait until the
server is contacted before restarting.
Refactoring of 104067b

Switch the MarkLive method for a chan that is closed by the client.
Thanks to @notnoop for the idea!

The old approach called a method on most existing ARs and TRs on every
runAllocs call. The new approach does a once.Do call in runAllocs to
accomplish the same thing with less work. Able to remove the gate
abstraction that did much more than was needed.
Registration and restoring allocs don't share state or depend on each
other in any way (syncing allocs with servers is done outside of
registration).

Since restoring is synchronous, start the registration goroutine first.

For nodes with lots of allocs to restore or close to their heartbeat
deadline, this could be the difference between becoming "lost" or not.
@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 10, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Clients try to rerun old allocations after machine reboot
2 participants