Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Garbage collection removes allocations that are still running #4940

Closed
joshuaclausen opened this issue Nov 29, 2018 · 1 comment · Fixed by #4965
Closed

Garbage collection removes allocations that are still running #4940

joshuaclausen opened this issue Nov 29, 2018 · 1 comment · Fixed by #4965

Comments

@joshuaclausen
Copy link

Nomad version

Nomad v0.8.6 (ab54ebc+CHANGES)n`

Operating system and Environment details

Windows Server 2012r2 and 2016

Issue

Allocations with a ClientStatus="running" and a DesiredStatus="stop" are removed by garbage collection, even if the process managed by the allocation continues running.

If the garbage collection is done on a server via a forced garbage collection, the server will no longer be aware of the allocation, but the client will be in some fashion.

If the garbage collection is done on a client, then it seems the server still thinks the allocation exists, while the client seems to get into a weird state. The client will, for example, try to delete the allocation directory, but if the allocation's process is logging to that directory, then the client will delete everything but the file that is being used by the process.

This seems to have some impact on job updates, since a replacement allocation will be created with ClientStatus="pending" and DesiredStatus="run", but it will not actually start running until the allocation it is replacing goes into "ClientStatus="complete" and DesiredStatus="stop". In my case, I'm seeing hundreds of allocations that get stuck in the pending->run state, never to actually start up, so it occurred to me this could be related.

This issue may be one of the precise definition of when an allocation is in the "terminal state". The docs don't seem to define it exactly - is it when an allocation has a ClientStatus="running" and a DesiredStatus="stop" (as current behavior seems to indicate), or is when an allocaton has a "ClientStatus="complete" and a DesiredStatus="stop" (which is what I had been expecting)?

It seems the expected behavior would be to never garbage collect an allocation if it's ClientStatus="running", except, possibly after some configurable threshold. I don't think I'm hitting that kind of threshold, since I can reproduce it with the below steps within minutes after an allocation has been started.

Reproduction steps

  1. Deploy a jobspec that runs a script that handles the nomad stop signal but does not exit for minutes or hours (simulate a long-duration graceful draining operation).
  2. Observe the allocation changes to having a ClientStatus="running" and a DesiredStatus="stop".
  3. Force a garbage collection from the server.
  4. Observe the allocation disappears. Test with "nomad status "
@joshuaclausen joshuaclausen changed the title Garbage collection removes allocations that are draining Garbage collection removes allocations that are still running Nov 29, 2018
dadgar added a commit that referenced this issue Dec 5, 2018
This PR fixes an edge case where we could GC an allocation that was in a
desired stop state but had not terminated yet. This can be hit if the
client hasn't shutdown the allocation yet or if the allocation is still
shutting down (long kill_timeout).

Fixes #4940
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 27, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants