Skip to content

Commit

Permalink
docs: add plan for node rejected details and more (#12564)
Browse files Browse the repository at this point in the history
- Moved federation docs to the bottom since *everyone* is potentially
  affected by the other sections on the page, but only users of
  federation are affected by it.
- Added section on the plan for node rejected bug since it is fairly
  easy to diagnose and removing affected nodes is a fairly reliable
  workaround.
- Mention 5s cliff for wait_for_index.
- Remove the lie that we do not have job status metrics! How old was
  that?!
- Reinforce the importance of monitoring basic system resources
  • Loading branch information
schmichael committed Apr 14, 2022
1 parent 4ca9803 commit 19bac3c
Showing 1 changed file with 85 additions and 46 deletions.
131 changes: 85 additions & 46 deletions website/content/docs/operations/monitoring-nomad.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -8,38 +8,39 @@ description: |-

# Monitoring Nomad

The Nomad client and server agents collect a wide range of runtime metrics
related to the performance of the system. Operators can use this data to gain
real-time visibility into their cluster and improve performance. Additionally,
Nomad operators can set up monitoring and alerting based on these metrics in
order to respond to any changes in the cluster state.

On the server side, leaders and
followers have metrics in common as well as metrics that are specific to their
roles. Clients have separate metrics for the host metrics and for
allocations/tasks, both of which have to be [explicitly
enabled][telemetry-stanza]. There are also runtime metrics that are common to
all servers and clients.
The Nomad client and server agents collect a wide range of runtime metrics.
These metrics are useful for monitoring the health and performance of Nomad
clusters. Careful monitoring can spot trends before they cause issues and help
debug issues if they arise.

All Nomad agents, both servers and clients, report basic system and Go runtime
metrics.

Nomad servers all report many metrics, but some metrics are specific to the
leader server. Since leadership may change at any time, these metrics should be
monitored on all servers. Missing (or 0) metrics from non-leaders may be safely
ignored.

Nomad clients have separate metrics for the host they are running on as well as
for each allocation being run. Both of these metrics [must be explicitly
enabled][telemetry-stanza].

By default, the Nomad agent collects telemetry data at a [1 second
interval][collection-interval]. Note that Nomad supports [gauges, counters, and
timers][metric-types].

There are three ways to obtain metrics from Nomad:

- Query the [/metrics API endpoint][metrics-api-endpoint] to return metrics for
the current Nomad process (as of Nomad 0.7). This endpoint supports Prometheus
formatted metrics.
- Query the [/v1/metrics API endpoint][metrics-api-endpoint] to return metrics
for the current Nomad process. This endpoint supports Prometheus formatted
metrics.

- Send the USR1 signal to the Nomad process. This will dump the current
telemetry information to STDERR (on Linux).

- Configure Nomad to automatically forward metrics to a third-party provider.

Nomad 0.7 added support for [tagged metrics][tagged-metrics], improving the
integrations with [DataDog][datadog-telem] and [Prometheus][prometheus-telem].
Metrics can also be forwarded to [Statsite][statsite-telem],
[StatsD][statsd-telem], and [Circonus][circonus-telem].
- Configure Nomad to automatically forward metrics to a third-party provider
such as [DataDog][datadog-telem], [Prometheus][prometheus-telem],
[statsd][statsd-telem], and [Circonus][circonus-telem].

## Alerting

Expand Down Expand Up @@ -71,7 +72,12 @@ patterns.

# Key Performance Indicators

The sections below cover a number of important metrics
Nomad servers' memory, CPU, disk, and network usage all scales linearly with
cluster size and scheduling throughput. The most important aspect of ensuring
Nomad operates normally is monitoring these system resources to ensure the
servers are not encountering resource constraints.

The sections below cover a number of other important metrics.

## Consensus Protocol (Raft)

Expand Down Expand Up @@ -111,28 +117,46 @@ The `nomad.raft.fsm.apply` metric is an indicator of the time it takes
for a server to apply Raft entries to the internal state machine. If
this number trends upwards, look at the `nomad.nomad.fsm.*` metrics to
see if a specific Raft entry is increasing in latency. You can compare
this to warn-level logs on the Nomad servers for "attempting to apply
large raft entry". If a specific type of message appears here, there
this to warn-level logs on the Nomad servers for `attempting to apply
large raft entry`. If a specific type of message appears here, there
may be a job with a large job specification or dispatch payload that
is increasing the time it takes to apply raft messages.
is increasing the time it takes to apply Raft messages. Try shrinking the size
of the job either by putting distinct task groups into separate jobs,
downloading templates instead of embedding them, or reducing the `count` on
task groups.

## Federated Deployments (Serf)
## Scheduling

Nomad uses the membership and failure detection capabilities of the Serf library
to maintain a single, global gossip pool for all servers in a federated
deployment. An uptick in `member.flap` and/or `msg.suspect` is a reliable indicator
that membership is unstable.
The [Scheduling] documentation describes the workflow of how evaluations become
scheduled plans and placed allocations.

If these metrics increase, look at CPU load on the servers and network
latency and packet loss for the [Serf] address.
### Progress

## Scheduling
There is a class of bug possible in Nomad where the two parts of the scheduling
pipeline, the workers and the leader's plan applier, *disagree* about the
validity of a plan. In the pathological case this can cause a job to never
finish scheduling, as workers produce the same plan and the plan applier
repeatedly rejects it.

While this class of bug is very rare, it can be detected by repeated log lines
on the Nomad servers containing `plan for node rejected`:

```
nomad: plan for node rejected: node_id=0fa84370-c713-b914-d329-f6485951cddc reason="reserved port collision" eval_id=098a5
```

While it is possible for these log lines to occur infrequently due to normal
cluster conditions, they should not appear repeatedly and prevent the job from
eventually running (look up the evaluation ID logged to find the job).

If this log *does* appear repeatedly with the same `node_id` referenced, try
[draining] the node and shutting it down. Misconfigurations not caught by
validation can cause nodes to enter this state: [#11830][gh-11830].

The [Scheduling] documentation describes the workflow of how
evaluations become scheduled plans and placed allocations. The
following metrics, listed in the order they are emitted, allow an
operator to observe changes in throughput at the various points in the
scheduling process.
### Performance

The following metrics allow observing changes in throughput at the various
points in the scheduling process.

- **nomad.worker.invoke_scheduler.<type\>** - The time to run the
scheduler of the given type. Each scheduler worker handles one
Expand Down Expand Up @@ -169,9 +193,11 @@ scheduling process.
entirely in memory on the leader. If this metric increases, examine
the CPU and memory resources of the leader.

- **nomad.plan.wait_for_index** - The time required for the planner to
wait for the Raft index of the plan to be processed. If this metric
increases, refer to the [Consensus Protocol (Raft)] section above.
- **nomad.plan.wait_for_index** - The time required for the planner to wait for
the Raft index of the plan to be processed. If this metric increases, refer
to the [Consensus Protocol (Raft)] section above. If this metric approaches 5
seconds, scheduling operations may fail and be retried. If possible reduce
scheduling load until metrics improve.

- **nomad.plan.submit** - The time to submit a scheduler plan from the
worker to the leader. This operation requires writing to Raft and
Expand Down Expand Up @@ -215,8 +241,8 @@ when the CPU is at or above the reserved resources for the task.

## Job and Task Status

We do not currently surface metrics for job and task/allocation status, although
we will consider adding metrics where it makes sense.
See [Job Summary Metrics] for monitoring the health and status of workloads
running on Nomad.

## Runtime Metrics

Expand All @@ -230,21 +256,34 @@ general indicators of load and memory pressure.
It is recommended to alert on upticks in any of the above, server memory usage
in particular.

## Federated Deployments (Serf)

Nomad uses the membership and failure detection capabilities of the Serf library
to maintain a single, global gossip pool for all servers in a federated
deployment. An uptick in `member.flap` and/or `msg.suspect` is a reliable indicator
that membership is unstable.

If these metrics increase, look at CPU load on the servers and network
latency and packet loss for the [Serf] address.

[alerting-rules]: https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/
[alertmanager]: https://prometheus.io/docs/alerting/alertmanager/
[allocation-metrics]: /docs/telemetry/metrics#allocation-metrics
[circonus-telem]: /docs/configuration/telemetry#circonus
[collection-interval]: /docs/configuration/telemetry#collection_interval
[datadog-alerting]: https://www.datadoghq.com/blog/monitoring-101-alerting/
[datadog-telem]: /docs/configuration/telemetry#datadog
[prometheus-telem]: /docs/configuration/telemetry#prometheus
[metrics-api-endpoint]: /api-docs/metrics
[draining]: https://learn.hashicorp.com/tutorials/nomad/node-drain
[gh-11830]: https://github.com/hashicorp/nomad/pull/11830
[metric-types]: /docs/telemetry/metrics#metric-types
[metrics-api-endpoint]: /api-docs/metrics
[prometheus-telem]: /docs/configuration/telemetry#prometheus
[serf]: /docs/configuration#serf-1
[statsd-exporter]: https://github.com/prometheus/statsd_exporter
[statsd-telem]: /docs/configuration/telemetry#statsd
[statsite-telem]: /docs/configuration/telemetry#statsite
[tagged-metrics]: /docs/telemetry/metrics#tagged-metrics
[telemetry-stanza]: /docs/configuration/telemetry
[serf]: /docs/configuration#serf-1
[Consensus Protocol (Raft)]: /docs/operations/telemetry#consensus-protocol-raft
[Job Summary Metrics]: /docs/operations/metrics-reference#job-summary-metrics
[Scheduling]: /docs/internals/scheduling/scheduling

0 comments on commit 19bac3c

Please sign in to comment.