Skip to content

Commit

Permalink
Add max client disconnect docs
Browse files Browse the repository at this point in the history
Co-authored-by: Derek Strickland <1111455+DerekStrickland@users.noreply.github.com>
  • Loading branch information
2 people authored and tgross committed Apr 6, 2022
1 parent 6791147 commit 27035cc
Showing 1 changed file with 90 additions and 13 deletions.
103 changes: 90 additions & 13 deletions website/content/docs/job-specification/group.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -81,19 +81,24 @@ job "docs" {
own [`shutdown_delay`](/docs/job-specification/task#shutdown_delay)
which waits between deregistering task services and stopping the task.

- `stop_after_client_disconnect` `(string: "")` - Specifies a duration
after which a Nomad client that cannot communicate with the servers
will stop allocations based on this task group. By default, a client
will not stop an allocation until explicitly told to by a server. A
client that fails to heartbeat to a server within the
[`heartbeat_grace`] window and any allocations running on it will be
marked "lost" and Nomad will schedule replacement
allocations. However, these replaced allocations will continue to
run on the non-responsive client; an operator may desire that these
replaced allocations are also stopped in this case — for example,
allocations requiring exclusive access to an external resource. When
specified, the Nomad client will stop them after this duration. The
Nomad client process must be running for this to occur.
- `stop_after_client_disconnect` `(string: "")` - Specifies a duration after
which a Nomad client will stop allocations, if it cannot communicate with the
servers. By default, a client will not stop an allocation until explicitly
told to by a server. A client that fails to heartbeat to a server within the
[`heartbeat_grace`] window and any allocations running on it will be marked
"lost" and Nomad will schedule replacement allocations. The replaced
allocations will normally continue to run on the non-responsive client. But
you may want them to stop instead — for example, allocations requiring
exclusive access to an external resource. When specified, the Nomad client
will stop them after this duration.
The Nomad client process must be running for this to occur. This setting
cannot be used with [`max_client_disconnect`].

- `max_client_disconnect` `(string: "")` - Specifies a duration during which a
Nomad client will attempt to reconnect allocations after it fails to heartbeat
in the [`heartbeat_grace`] window. See [the example code
below][max-client-disconnect] for more details. This setting cannot be used
with [`stop_after_client_disconnect`].

- `task` <code>([Task][]: &lt;required&gt;)</code> - Specifies one or more tasks to run
within this group. This can be specified multiple times, to add a task as part
Expand Down Expand Up @@ -255,6 +260,75 @@ group "second" {
}
```

### Max Client Disconnect

`max_client_disconnect` specifies a duration during which a Nomad client will
attempt to reconnect allocations after it fails to heartbeat in the
[`heartbeat_grace`] window.

By default, allocations running on a client that fails to heartbeat will be
marked "lost". When a client reconnects, its allocations, which may still be
healthy, will restart because they have been marked "lost". This can cause
issues with stateful tasks or tasks with long restart times.

Instead, an operator may desire that these allocations reconnect without a
restart. When `max_client_disconnect` is specified, the Nomad server will mark
clients that fail to heartbeat as "disconnected" rather than "down", and will
mark allocations on a disconnected client as "unknown" rather than "lost". These
allocations may continue to run on the disconnected client. Replacement
allocations will be scheduled according to the allocations' reschedule policy
until the disconnected client reconnects. Once a disconnected client reconnects,
Nomad will compare the "unknown" allocations with their replacements and keep
the one with the best node score. If the `max_client_disconnect` duration
expires before the client reconnects, the allocations will be marked "lost".
Clients that contain "unknown" allocations will transition to "disconnected"
rather than "down" until the last `max_client_disconnect` duration has expired.

In the example code below, if both of these task groups were placed on the same
client and that client experienced a network outage, both of the group's
allocations would be marked as "disconnected" at two minutes because of the
client's `heartbeat_grace` value of "2m". If the network outage continued for
eight hours, and the client continued to fail to heartbeat, the client would
remain in a "disconnected" state, as the first group's `max_client_disconnect`
is twelve hours. Once all groups' `max_client_disconnect` durations are
exceeded, in this case in twelve hours, the client node will be marked as "down"
and the allocation will be marked as "lost". If the client had reconnected
before twelve hours had passed, the allocations would gracefully reconnect
without a restart.

Max Client Disconnect is useful for edge deployments, or scenarios when
operators want zero on-client downtime due to node connectivity issues. This
setting cannot be used with [`stop_after_client_disconnect`].

```hcl
# client_config.hcl
client {
enabled = true
heartbeat_grace = "2m"
}
```

```hcl
# jobspec.nomad
group "first" {
max_client_disconnect = "12h"
task "first-task" {
...
}
}
group "second" {
max_client_disconnect = "6h"
task "second-task" {
...
}
}
```

[task]: /docs/job-specification/task 'Nomad task Job Specification'
[job]: /docs/job-specification/job 'Nomad job Job Specification'
[constraint]: /docs/job-specification/constraint 'Nomad constraint Job Specification'
Expand All @@ -264,6 +338,9 @@ group "second" {
[affinity]: /docs/job-specification/affinity 'Nomad affinity Job Specification'
[ephemeraldisk]: /docs/job-specification/ephemeral_disk 'Nomad ephemeral_disk Job Specification'
[`heartbeat_grace`]: /docs/configuration/server#heartbeat_grace
[`max_client_disconnect`]: /docs/job-specification/group#max_client_disconnect
[max-client-disconnect]: /docs/job-specification/group#max-client-disconnect 'the example code below'
[`stop_after_client_disconnect`]: /docs/job-specification/group#stop_after_client_disconnect
[meta]: /docs/job-specification/meta 'Nomad meta Job Specification'
[migrate]: /docs/job-specification/migrate 'Nomad migrate Job Specification'
[network]: /docs/job-specification/network 'Nomad network Job Specification'
Expand Down

0 comments on commit 27035cc

Please sign in to comment.