Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved Allocation Handling on Lost Clients #10953

Closed
mikenomitch opened this issue Jul 27, 2021 · 2 comments · Fixed by #12476
Closed

Improved Allocation Handling on Lost Clients #10953

mikenomitch opened this issue Jul 27, 2021 · 2 comments · Fixed by #12476

Comments

@mikenomitch
Copy link
Contributor

mikenomitch commented Jul 27, 2021

Proposal

An option should be added to enable “lost” Nomad clients that reconnect to the server cluster to not restart their allocations.

Use-cases

This helps Nomad deployments that are in high latency environments with clients geographically distant from the server cluster. Nomad clients running with LTE connections (on IOT devices for instance) might regularly lose connection for minutes at a time. In these cases, when the Nomad client reconnects to the server cluster, ideally everything resumes functioning as normal.

Proposal Details

Currently, if a client fails to ping to server cluster in the heartbeat_grace period and if stop_after_client_disconnect is not set, the client allocation will continue running.

Under some conditions, a replacement allocation will be scheduled on a new client node. If a client node reconnects and a replacement allocation is running elsewhere (in this case, the total number of running allocations exceeds the expected count), ideally the client with the lowest affinity score would stop running. In the case of equal affinities, I think it makes sense for the original to continue, but I think random selection would be fine.

Under some conditions, a replacement allocation is not run (if no node matches the constraints). If this is the case, the node would ideally just reconnect and the allocation would not get restarted. Currently, it does restart.

I don’t think this would require any new configuration, but if some users want to keep the restart behavior, then a “restart_on_client_reconnect” boolean could be added to the job config.

Changing the meaning of lost

Allowing lost allocations to transition back to running is a significant change that breaks backward compatibility. To be clear I (@schmichael) think it's worth it, but it will take a considerable amount of testing and documentation to ensure a smooth transition.

Currently Allocation.ClientStatus=lost is a terminal state along with complete (intentionally stopped or a batch job that completed successfully) and failed (as determined by the restart policy).

Everywhere that calls Allocation.ClientTerminalStatus will need to be audited for correctness.

Any project, such as the Autoscaler, that relies on differentiating terminal from non-terminal allocation statuses also needs a migration plan.

Reproduction Steps

When no replacement allocation is made:

  • Create a Nomad cluster with a single client
  • Run an allocation on that client
  • From the client, break the connection to the server. Simplest way is probably to edit pf.conf and block the IP addresses of the server cluster.
  • Note that the client is down and the allocation is now pending.
  • Re-connect the client to the server (remove any pf.conf changes and re-run pfctl)
  • Wait until the client reconnects and is marked as ready
  • Note that when the allocation comes back, a new allocation will replace the old.

new-alloc

Ideally, this would be the same allocation as before continuing to run.

With a replacement allocation

  • Create a Nomad cluster with a two clients
  • Run a job with a single alloc count and give it an affinity for one of the clients.
  • Note that it is running on that client.
  • From the client, break the connection to the server.
  • Note that it reruns on the new client.
  • From both clients, the allocation should still be running. (but the server cluster will not know about one)
  • Re-connect the client to the server
  • Wait until the client reconnects and is marked as ready
  • Note that the allocation is killed on the reconnecting client and continues to run on the other client.

Ideally, the allocation with the better node rank would continue to run the allocation.

I think case one is more important than case two, as there could be some corner cases I'm not thinking about in case two.

@mikenomitch mikenomitch changed the title Improved Lost Allocation Handling Improved Allocation Handling on Lost Clients Jul 27, 2021
@jrasell
Copy link
Member

jrasell commented Jul 28, 2021

The allocation.ClientStatus is used within the Nomad Autoscaler in a couple of places, none of which I believe will be adversely affect by this change. These places are the Nomad APM and the scaleutils node-selector which utilise this field in order to:

  • a) filter out allocation resources from utilisation totals
  • b) to filter nodes which have the least number of non-terminal allocations.

In the first situation, this new behaviour might cause some flapping during scale evaluations which could easily be counteracted by configuration and policy settings. The second situation has potential for nodes to be marked as empty and eligible for terminal; however, I believe the filtering of the node pool based on node status would protect against this. To a more high-level extent, I don't believe these environments described fit into an autoscaling environment.

In the situation described where a node becomes lost for some minutes and the autoscaler is enabled, it may be prudent to add additional safety checks around the stability of the job group before attempting to scale. Currently job groups in deployment are protected against allowing scaling to occur; whereas we might want to extend this to also include job groups where allocations are starting or stopped to replace lost/re-found allocations.

In a more general integration sense where allocations are consumed and processed either by blocking queries, ticked listing calls, or consumption of events; as detailed some mitigation may be needed depending on the application internal behaviour. That being said, if all updates trigger the correct API responses (an event is triggered whenever the alloc status changes), then a well behaviour processor should be able to correctly deal with such changes.

@github-actions
Copy link

github-actions bot commented Oct 9, 2022

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 9, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants