Nomad should differentiate between crashed/rebooted and disconnected nomad client for jobs with `max_client_disconnect` #15144

yaroslav-007 · 2022-11-04T09:27:06Z

Nomad should differentiate between crashed/rebooted and disconnected nomad client for jobs with max_client_disconnect for the relocation of allocation

Nomad version

nomad version : 1.4.2

Issue

Nomad reschedule allocations jobs with max_client_disconnect on crashed system where the allocation was stopped

Reproduction steps

Vagrant project attached. Follow the README file in the attachment

Expected Result

Nomad should differentiate between crashed and disconnected nomad client for jobs with max_client_disconnect. It should be able to see that there is no running allocation on the reconnected client and thus not take any further action

Actual Result

Nomad will reschedule a job (with max_client_disconnect) on the reconnect of crashed system.

Attachment

The text was updated successfully, but these errors were encountered:

tgross · 2022-11-21T15:59:35Z

@yaroslav-007 can you clarify the expected behavior here? Nomad servers can't know whether a Nomad client is netsplit vs crashed, because either way the client is no longer communicating with the server. And it doesn't actually matter in terms of the intended behavior of max_client_disconnect, which is to temporarily replace the allocations until the client comes back.

yaroslav-007 · 2022-11-22T16:30:47Z

Hello @tgross ,
Nomad should be able to detect that on the recovered client the allocations/docker containers are not running. In that case the temporary one (created because of the disconnected client ) could continue to run without any need of them to be redeployed on the the recovered node. This seems as unnecessary action in that case.

tgross · 2022-11-22T18:33:44Z

Ah yes now I understand! I think @lgfa29 fixed that behavior in #15068, which we just shipped today in Nomad 1.4.3

lgfa29 · 2022-11-22T19:23:16Z

More specifically, this could have been caused by #12680 where a new allocation was being placed unnecessarily. Hopefully this was fixed in 1.4.3 😄

yaroslav-007 · 2022-11-28T07:06:25Z

Hello,

I have checked with the latest version (1.4.3) and seems nomad continue to reschedule the jobs.

tgross · 2022-11-28T14:09:08Z

Awesome, thanks for following up!

tgross · 2022-11-30T13:41:06Z

Apologies, @yaroslav-007 followed up with me out-of-band and clarified that 1.4.3 has not fixed the bug. I'm going to reopen this and make sure it gets marked for roadmapping.

mikenomitch · 2023-12-14T19:30:45Z

I asked @yaroslav-007 to retest this since we've had a lot of work on this recently. Copy/pasting his Slack reply here:

The behaviour is still the same. Just to make sure - the idea of the customer is when client A crash/get rebooted then client B takes the allocation (presumingly as temporary action as it thinks client A is suffering from network disconnect ). When client A nomad recovers, the customer expects nomad to say: Oooo this system was not disconnected but rather crashed. I will not relocate the allocation running on client B then, as it will cause unnecessary (downtime/action) (edited)

Thanks @yaroslav-007

Juanadelacuesta · 2024-01-05T09:28:27Z

After speaking with @yaroslav-007 we came to the conclusion that what the customer wants is more control over which allocation is kept once the disconnected node starts responding again:

If it was a network glitch, keep the original allocation, trusting it never stopped working, and killing it would generate an outage.
If it was a system crash, keep the replacement that is now serving because restarting the original one might cause an outage.

tgross · 2024-01-05T13:56:26Z

I spent a bunch of time in client/client.go somewhat recently and I have some thoughts on this that might help us, but not a solution. 😀

Something to keep in mind here is that you can restart a client agent and all the allocations will be continue to run. There's no way for a client to detect a "system crash". The only thing the client can know on restart is that an allocation is no longer running because it fails to restore the allocrunner. Which is fine, because the client could be gracefully stopped and the allocation could crash and we'd have the same situation as a "system crash" as far as the server is concerned.

So when the client restarts, it does the following:

start a register and heartbeat goroutine (ref client.go#L596)
attempt to restore handles to the running allocations in the client state DB (ref client.go#L599)
- If the allocrunner can't be restored, the client tells the server the allocation has failed (ref client.go#L1255-L1263)
Then the client starts two concurrent goroutines (ref client.go#L1842):
- One waiting for updates from the local allocation runners. This goroutine batches up allocation status updates we send to the server.
- One polling for updates from the server via GetAllocs RPC.

It looks to me that the problem is that we mark the node as reconnecting when we get a heartbeat, so there's a race between when the first heartbeat is sent and the stopped allocation is marked as failed. The trivial fix would be to do the restore before we start registration and heartbeat. But I'm not sure what side-effects that'll have. I do know that we've often discussed trying to move the restore until after the GetAllocs RPC is made, so that we only try to run allocations that the server wants us to run. But that would be the opposite of what we're trying to do here.

At first I thought maybe the solution is to not mark the allocation as reconnecting until we get an update from the client on the allocation status? But if the client was disconnected because of a network split and not crashed, it would not have any allocation status updates to send, just a heartbeat! We'd be able to fix the crashed case but would introduce a bug in the non-crashed case.

Again, I don't have a solution but hopefully this context helps 😀

yaroslav-007 added the type/bug label Nov 4, 2022

lgfa29 added stage/needs-investigation theme/edge labels Nov 4, 2022

tgross added stage/needs-discussion and removed stage/needs-investigation labels Nov 21, 2022

tgross added the stage/waiting-reply label Nov 22, 2022

tgross closed this as completed Nov 28, 2022

tgross removed the stage/waiting-reply label Nov 28, 2022

tgross reopened this Nov 30, 2022

tgross added stage/accepted Confirmed, and intend to work on. No timeline committment though. and removed stage/needs-discussion labels Nov 30, 2022

tgross added the hcc/cst Admin - internal label Nov 30, 2022

Juanadelacuesta self-assigned this Jan 3, 2024

Juanadelacuesta added stage/needs-discussion and removed stage/accepted Confirmed, and intend to work on. No timeline committment though. labels Jan 5, 2024

Juanadelacuesta mentioned this issue Feb 7, 2024

Introduces Disconnect block into the TaskGroup configuration #19886

Merged

Juanadelacuesta mentioned this issue Mar 6, 2024

func: add new picker dependency #20029

Merged

Juanadelacuesta closed this as completed in #20029 Mar 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nomad should differentiate between crashed/rebooted and disconnected nomad client for jobs with `max_client_disconnect` #15144

Nomad should differentiate between crashed/rebooted and disconnected nomad client for jobs with `max_client_disconnect` #15144

yaroslav-007 commented Nov 4, 2022

tgross commented Nov 21, 2022

yaroslav-007 commented Nov 22, 2022

tgross commented Nov 22, 2022

lgfa29 commented Nov 22, 2022

yaroslav-007 commented Nov 28, 2022

tgross commented Nov 28, 2022

tgross commented Nov 30, 2022

mikenomitch commented Dec 14, 2023

Juanadelacuesta commented Jan 5, 2024

tgross commented Jan 5, 2024

Nomad should differentiate between crashed/rebooted and disconnected nomad client for jobs with max_client_disconnect #15144

Nomad should differentiate between crashed/rebooted and disconnected nomad client for jobs with max_client_disconnect #15144

Comments

yaroslav-007 commented Nov 4, 2022

Nomad version

Issue

Reproduction steps

Expected Result

Actual Result

Attachment

tgross commented Nov 21, 2022

yaroslav-007 commented Nov 22, 2022

tgross commented Nov 22, 2022

lgfa29 commented Nov 22, 2022

yaroslav-007 commented Nov 28, 2022

tgross commented Nov 28, 2022

tgross commented Nov 30, 2022

mikenomitch commented Dec 14, 2023

Juanadelacuesta commented Jan 5, 2024

tgross commented Jan 5, 2024

Nomad should differentiate between crashed/rebooted and disconnected nomad client for jobs with `max_client_disconnect` #15144

Nomad should differentiate between crashed/rebooted and disconnected nomad client for jobs with `max_client_disconnect` #15144