Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad should differentiate between crashed/rebooted and disconnected nomad client for jobs with max_client_disconnect #15144

Closed
yaroslav-007 opened this issue Nov 4, 2022 · 10 comments · Fixed by #20029

Comments

@yaroslav-007
Copy link

Nomad should differentiate between crashed/rebooted and disconnected nomad client for jobs with max_client_disconnect for the relocation of allocation

Nomad version

nomad version : 1.4.2

Issue

Nomad reschedule allocations jobs with max_client_disconnect on crashed system where the allocation was stopped

Reproduction steps

Vagrant project attached. Follow the README file in the attachment

Expected Result

Nomad should differentiate between crashed and disconnected nomad client for jobs with max_client_disconnect. It should be able to see that there is no running allocation on the reconnected client and thus not take any further action

Actual Result

Nomad will reschedule a job (with max_client_disconnect) on the reconnect of crashed system.

Attachment

Attachment

@tgross
Copy link
Member

tgross commented Nov 21, 2022

@yaroslav-007 can you clarify the expected behavior here? Nomad servers can't know whether a Nomad client is netsplit vs crashed, because either way the client is no longer communicating with the server. And it doesn't actually matter in terms of the intended behavior of max_client_disconnect, which is to temporarily replace the allocations until the client comes back.

@yaroslav-007
Copy link
Author

Hello @tgross ,
Nomad should be able to detect that on the recovered client the allocations/docker containers are not running. In that case the temporary one (created because of the disconnected client ) could continue to run without any need of them to be redeployed on the the recovered node. This seems as unnecessary action in that case.

@tgross
Copy link
Member

tgross commented Nov 22, 2022

Ah yes now I understand! I think @lgfa29 fixed that behavior in #15068, which we just shipped today in Nomad 1.4.3

@lgfa29
Copy link
Contributor

lgfa29 commented Nov 22, 2022

More specifically, this could have been caused by #12680 where a new allocation was being placed unnecessarily. Hopefully this was fixed in 1.4.3 😄

@yaroslav-007
Copy link
Author

Hello,

I have checked with the latest version (1.4.3) and seems nomad continue to reschedule the jobs.

@tgross
Copy link
Member

tgross commented Nov 28, 2022

Awesome, thanks for following up!

@tgross
Copy link
Member

tgross commented Nov 30, 2022

Apologies, @yaroslav-007 followed up with me out-of-band and clarified that 1.4.3 has not fixed the bug. I'm going to reopen this and make sure it gets marked for roadmapping.

@tgross tgross reopened this Nov 30, 2022
@tgross tgross added stage/accepted Confirmed, and intend to work on. No timeline committment though. and removed stage/needs-discussion labels Nov 30, 2022
@tgross tgross added the hcc/cst Admin - internal label Nov 30, 2022
@mikenomitch
Copy link
Contributor

I asked @yaroslav-007 to retest this since we've had a lot of work on this recently. Copy/pasting his Slack reply here:

The behaviour is still the same. Just to make sure - the idea of the customer is when client A crash/get rebooted then client B takes the allocation (presumingly as temporary action as it thinks client A is suffering from network disconnect ). When client A nomad recovers, the customer expects nomad to say: Oooo this system was not disconnected but rather crashed. I will not relocate the allocation running on client B then, as it will cause unnecessary (downtime/action) (edited)

Thanks @yaroslav-007

@Juanadelacuesta Juanadelacuesta self-assigned this Jan 3, 2024
@Juanadelacuesta
Copy link
Member

After speaking with @yaroslav-007 we came to the conclusion that what the customer wants is more control over which allocation is kept once the disconnected node starts responding again:

  • If it was a network glitch, keep the original allocation, trusting it never stopped working, and killing it would generate an outage.
  • If it was a system crash, keep the replacement that is now serving because restarting the original one might cause an outage.

@Juanadelacuesta Juanadelacuesta added stage/needs-discussion and removed stage/accepted Confirmed, and intend to work on. No timeline committment though. labels Jan 5, 2024
@tgross
Copy link
Member

tgross commented Jan 5, 2024

I spent a bunch of time in client/client.go somewhat recently and I have some thoughts on this that might help us, but not a solution. 😀

Something to keep in mind here is that you can restart a client agent and all the allocations will be continue to run. There's no way for a client to detect a "system crash". The only thing the client can know on restart is that an allocation is no longer running because it fails to restore the allocrunner. Which is fine, because the client could be gracefully stopped and the allocation could crash and we'd have the same situation as a "system crash" as far as the server is concerned.

So when the client restarts, it does the following:

  • start a register and heartbeat goroutine (ref client.go#L596)
  • attempt to restore handles to the running allocations in the client state DB (ref client.go#L599)
    • If the allocrunner can't be restored, the client tells the server the allocation has failed (ref client.go#L1255-L1263)
  • Then the client starts two concurrent goroutines (ref client.go#L1842):
    • One waiting for updates from the local allocation runners. This goroutine batches up allocation status updates we send to the server.
    • One polling for updates from the server via GetAllocs RPC.

It looks to me that the problem is that we mark the node as reconnecting when we get a heartbeat, so there's a race between when the first heartbeat is sent and the stopped allocation is marked as failed. The trivial fix would be to do the restore before we start registration and heartbeat. But I'm not sure what side-effects that'll have. I do know that we've often discussed trying to move the restore until after the GetAllocs RPC is made, so that we only try to run allocations that the server wants us to run. But that would be the opposite of what we're trying to do here.

At first I thought maybe the solution is to not mark the allocation as reconnecting until we get an update from the client on the allocation status? But if the client was disconnected because of a network split and not crashed, it would not have any allocation status updates to send, just a heartbeat! We'd be able to fix the crashed case but would introduce a bug in the non-crashed case.

Again, I don't have a solution but hopefully this context helps 😀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging a pull request may close this issue.

5 participants