Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

document/improve on client restarts with missing state #9512

Open
tgross opened this issue Dec 3, 2020 · 0 comments
Open

document/improve on client restarts with missing state #9512

tgross opened this issue Dec 3, 2020 · 0 comments
Labels
stage/needs-discussion theme/docs Documentation issues and enhancements

Comments

@tgross
Copy link
Member

tgross commented Dec 3, 2020

When a Nomad client is stopped, the allocations on that client host are left running. So long as the client isn't offline long enough to be considered "lost", when the client restarts it rummages around in its local state store to recreate handles to the running tasks. If a task is stopped while the Nomad client is stopped (by the user or simply crashing), the Nomad client has to restore the task. Any failure to do so is definitely a Nomad bug.

However, we've seen operators who remove the client data directory between restarts. There are two ways we've seen this go wrong:

  • If the client's data directory is removed while the client is shut down, the Nomad client has no way of recreating the handles to running tasks. This also means that Nomad can't shut down or restart those tasks, which could result in stale versions of applications can be running.
  • If the client's data directory is removed and the task containers are removed manually, but some other resource like an un-garbage-collected mount is left behind, this can prevent Nomad from scheduling the workload.

Many operators (typically those who are running on public cloud infra) will replace the client host entirely during client upgrades. But for those who do not, generally speaking they should not remove the data dir on the client. If they do they need to be aware of all the resources that can be leaked. We don't have good documentation warning about this or giving guidance on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stage/needs-discussion theme/docs Documentation issues and enhancements
Projects
None yet
Development

No branches or pull requests

1 participant