Improved Allocation Handling on Lost Clients #10953

mikenomitch · 2021-07-27T15:52:26Z

Proposal

An option should be added to enable “lost” Nomad clients that reconnect to the server cluster to not restart their allocations.

Use-cases

This helps Nomad deployments that are in high latency environments with clients geographically distant from the server cluster. Nomad clients running with LTE connections (on IOT devices for instance) might regularly lose connection for minutes at a time. In these cases, when the Nomad client reconnects to the server cluster, ideally everything resumes functioning as normal.

Proposal Details

Currently, if a client fails to ping to server cluster in the heartbeat_grace period and if stop_after_client_disconnect is not set, the client allocation will continue running.

Under some conditions, a replacement allocation will be scheduled on a new client node. If a client node reconnects and a replacement allocation is running elsewhere (in this case, the total number of running allocations exceeds the expected count), ideally the client with the lowest affinity score would stop running. In the case of equal affinities, I think it makes sense for the original to continue, but I think random selection would be fine.

Under some conditions, a replacement allocation is not run (if no node matches the constraints). If this is the case, the node would ideally just reconnect and the allocation would not get restarted. Currently, it does restart.

I don’t think this would require any new configuration, but if some users want to keep the restart behavior, then a “restart_on_client_reconnect” boolean could be added to the job config.

Changing the meaning of `lost`

Allowing lost allocations to transition back to running is a significant change that breaks backward compatibility. To be clear I (@schmichael) think it's worth it, but it will take a considerable amount of testing and documentation to ensure a smooth transition.

Currently Allocation.ClientStatus=lost is a terminal state along with complete (intentionally stopped or a batch job that completed successfully) and failed (as determined by the restart policy).

Everywhere that calls Allocation.ClientTerminalStatus will need to be audited for correctness.

Any project, such as the Autoscaler, that relies on differentiating terminal from non-terminal allocation statuses also needs a migration plan.

Reproduction Steps

When no replacement allocation is made:

Create a Nomad cluster with a single client
Run an allocation on that client
From the client, break the connection to the server. Simplest way is probably to edit pf.conf and block the IP addresses of the server cluster.
Note that the client is down and the allocation is now pending.
Re-connect the client to the server (remove any pf.conf changes and re-run pfctl)
Wait until the client reconnects and is marked as ready
Note that when the allocation comes back, a new allocation will replace the old.

Ideally, this would be the same allocation as before continuing to run.

With a replacement allocation

Create a Nomad cluster with a two clients
Run a job with a single alloc count and give it an affinity for one of the clients.
Note that it is running on that client.
From the client, break the connection to the server.
Note that it reruns on the new client.
From both clients, the allocation should still be running. (but the server cluster will not know about one)
Re-connect the client to the server
Wait until the client reconnects and is marked as ready
Note that the allocation is killed on the reconnecting client and continues to run on the other client.

Ideally, the allocation with the better node rank would continue to run the allocation.

I think case one is more important than case two, as there could be some corner cases I'm not thinking about in case two.

The text was updated successfully, but these errors were encountered:

jrasell · 2021-07-28T13:12:16Z

The allocation.ClientStatus is used within the Nomad Autoscaler in a couple of places, none of which I believe will be adversely affect by this change. These places are the Nomad APM and the scaleutils node-selector which utilise this field in order to:

a) filter out allocation resources from utilisation totals
b) to filter nodes which have the least number of non-terminal allocations.

In the first situation, this new behaviour might cause some flapping during scale evaluations which could easily be counteracted by configuration and policy settings. The second situation has potential for nodes to be marked as empty and eligible for terminal; however, I believe the filtering of the node pool based on node status would protect against this. To a more high-level extent, I don't believe these environments described fit into an autoscaling environment.

In the situation described where a node becomes lost for some minutes and the autoscaler is enabled, it may be prudent to add additional safety checks around the stability of the job group before attempting to scale. Currently job groups in deployment are protected against allowing scaling to occur; whereas we might want to extend this to also include job groups where allocations are starting or stopped to replace lost/re-found allocations.

In a more general integration sense where allocations are consumed and processed either by blocking queries, ticked listing calls, or consumption of events; as detailed some mitigation may be needed depending on the application internal behaviour. That being said, if all updates trigger the correct API responses (an event is triggered whenever the alloc status changes), then a well behaviour processor should be able to correctly deal with such changes.

github-actions · 2022-10-09T02:45:59Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

mikenomitch added the type/enhancement label Jul 27, 2021

mikenomitch changed the title ~~Improved Lost Allocation Handling~~ Improved Allocation Handling on Lost Clients Jul 27, 2021

schmichael added the theme/core label Jul 27, 2021

mikenomitch added the theme/edge label Dec 21, 2021

DerekStrickland self-assigned this Feb 11, 2022

DerekStrickland added this to the 1.3.0 milestone Feb 16, 2022

DerekStrickland mentioned this issue Apr 6, 2022

disconnected clients: Feature branch merge #12476

Merged

DerekStrickland closed this as completed in #12476 Apr 6, 2022

github-actions bot locked as resolved and limited conversation to collaborators Oct 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved Allocation Handling on Lost Clients #10953

Improved Allocation Handling on Lost Clients #10953

mikenomitch commented Jul 27, 2021 •

edited by DerekStrickland

Loading

jrasell commented Jul 28, 2021 •

edited

Loading

github-actions bot commented Oct 9, 2022

Improved Allocation Handling on Lost Clients #10953

Improved Allocation Handling on Lost Clients #10953

Comments

mikenomitch commented Jul 27, 2021 • edited by DerekStrickland Loading

Proposal

Use-cases

Proposal Details

Changing the meaning of lost

Reproduction Steps

When no replacement allocation is made:

With a replacement allocation

jrasell commented Jul 28, 2021 • edited Loading

github-actions bot commented Oct 9, 2022

mikenomitch commented Jul 27, 2021 •

edited by DerekStrickland

Loading

Changing the meaning of `lost`

jrasell commented Jul 28, 2021 •

edited

Loading