diff --git a/website/source/docs/telemetry/index.html.md b/website/source/docs/telemetry/index.html.md index 3d33c27f2201..3f95ab7bf90b 100644 --- a/website/source/docs/telemetry/index.html.md +++ b/website/source/docs/telemetry/index.html.md @@ -3,812 +3,22 @@ layout: "docs" page_title: "Telemetry" sidebar_current: "docs-telemetry" description: |- - Learn about the telemetry data available in Nomad. + Telemetry docs home page --- # Telemetry -The Nomad agent collects various runtime metrics about the performance of -different libraries and subsystems. These metrics are aggregated on a ten -second interval and are retained for one minute. +The Nomad client and server agents collect a wide range of runtime metrics +related to the performance of the system. Operators can use this data to gain +real-time visibility into their cluster and improve performance. Additionally, +Nomad operators can set up monitoring and alerting based on these metrics in +order to respond to any changes in the cluster state. -This data can be accessed via an HTTP endpoint or via sending a signal to the -Nomad process. +Please refer to the documentation listed below or in the sidebar to learn more +about how you can leverage the telemetry Nomad exposes. -Via HTTP, as of Nomad version 0.7, this data is available at `/metrics`. See -[Metrics](/api/metrics.html) for more information. +* [Overview][overview] +* [Metrics][metrics] - -To view this data via sending a signal to the Nomad process: on Unix, -this is `USR1` while on Windows it is `BREAK`. Once Nomad receives the signal, -it will dump the current telemetry information to the agent's `stderr`. - -This telemetry information can be used for debugging or otherwise -getting a better view of what Nomad is doing. - -Telemetry information can be streamed to both [statsite](https://github.com/armon/statsite) -as well as statsd based on providing the appropriate configuration options. - -To configure the telemetry output please see the [agent -configuration](/docs/configuration/telemetry.html). - -Below is sample output of a telemetry dump: - -```text -[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.nomad.broker.total_blocked': 0.000 -[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.nomad.plan.queue_depth': 0.000 -[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.runtime.malloc_count': 7568.000 -[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.runtime.total_gc_runs': 8.000 -[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.nomad.broker.total_ready': 0.000 -[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.runtime.num_goroutines': 56.000 -[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.runtime.sys_bytes': 3999992.000 -[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.runtime.heap_objects': 4135.000 -[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.nomad.heartbeat.active': 1.000 -[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.nomad.broker.total_unacked': 0.000 -[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.nomad.broker.total_waiting': 0.000 -[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.runtime.alloc_bytes': 634056.000 -[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.runtime.free_count': 3433.000 -[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.runtime.total_gc_pause_ns': 6572135.000 -[2015-09-17 16:59:40 -0700 PDT][C] 'nomad.memberlist.msg.alive': Count: 1 Sum: 1.000 -[2015-09-17 16:59:40 -0700 PDT][C] 'nomad.serf.member.join': Count: 1 Sum: 1.000 -[2015-09-17 16:59:40 -0700 PDT][C] 'nomad.raft.barrier': Count: 1 Sum: 1.000 -[2015-09-17 16:59:40 -0700 PDT][C] 'nomad.raft.apply': Count: 1 Sum: 1.000 -[2015-09-17 16:59:40 -0700 PDT][C] 'nomad.nomad.rpc.query': Count: 2 Sum: 2.000 -[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.serf.queue.Query': Count: 6 Sum: 0.000 -[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.nomad.fsm.register_node': Count: 1 Sum: 1.296 -[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.serf.queue.Intent': Count: 6 Sum: 0.000 -[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.runtime.gc_pause_ns': Count: 8 Min: 126492.000 Mean: 821516.875 Max: 3126670.000 Stddev: 1139250.294 Sum: 6572135.000 -[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.raft.leader.dispatchLog': Count: 3 Min: 0.007 Mean: 0.018 Max: 0.039 Stddev: 0.018 Sum: 0.054 -[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.nomad.leader.reconcileMember': Count: 1 Sum: 0.007 -[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.nomad.leader.reconcile': Count: 1 Sum: 0.025 -[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.raft.fsm.apply': Count: 1 Sum: 1.306 -[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.nomad.client.get_allocs': Count: 1 Sum: 0.110 -[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.nomad.worker.dequeue_eval': Count: 29 Min: 0.003 Mean: 363.426 Max: 503.377 Stddev: 228.126 Sum: 10539.354 -[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.serf.queue.Event': Count: 6 Sum: 0.000 -[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.raft.commitTime': Count: 3 Min: 0.013 Mean: 0.037 Max: 0.079 Stddev: 0.037 Sum: 0.110 -[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.nomad.leader.barrier': Count: 1 Sum: 0.071 -[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.nomad.client.register': Count: 1 Sum: 1.626 -[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.nomad.eval.dequeue': Count: 21 Min: 500.610 Mean: 501.753 Max: 503.361 Stddev: 1.030 Sum: 10536.813 -[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.memberlist.gossip': Count: 12 Min: 0.009 Mean: 0.017 Max: 0.025 Stddev: 0.005 Sum: 0.204 -``` - -# Key Metrics - -When telemetry is being streamed to statsite or statsd, `interval` is defined to -be their flush interval. Otherwise, the interval can be assumed to be 10 seconds -when retrieving metrics using the above described signals. - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
MetricDescriptionUnitType
`nomad.runtime.num_goroutines`Number of goroutines and general load pressure indicator# of goroutinesGauge
`nomad.runtime.alloc_bytes`Memory utilization# of bytesGauge
`nomad.runtime.heap_objects`Number of objects on the heap. General memory pressure indicator# of heap objectsGauge
`nomad.raft.apply`Number of Raft transactionsRaft transactions / `interval`Counter
`nomad.raft.lastIndex`Index of the last log in stable storageSequence numberGauge
`nomad.raft.appliedIndex`Index of the last applied logSequence numberGauge
`nomad.raft.replication.appendEntries`Raft transaction commit timems / Raft Log AppendTimer
`nomad.raft.leader.lastContact`Time since last contact to leader. General indicator of Raft latencyms / Leader ContactTimer
`nomad.broker.total_ready`Number of evaluations ready to be processed# of evaluationsGauge
`nomad.broker.total_unacked`Evaluations dispatched for processing but incomplete# of evaluationsGauge
`nomad.broker.total_blocked` - Evaluations that are blocked until an existing evaluation for the same job - completes - # of evaluationsGauge
`nomad.plan.queue_depth`Number of scheduler Plans waiting to be evaluated# of plansGauge
`nomad.plan.submit` - Time to submit a scheduler Plan. Higher values cause lower scheduling - throughput - ms / Plan SubmitTimer
`nomad.plan.evaluate` - Time to validate a scheduler Plan. Higher values cause lower scheduling - throughput. Similar to `nomad.plan.submit` but does not include RPC time - or time in the Plan Queue - ms / Plan EvaluationTimer
`nomad.state.snapshotIndex`Latest index in the server's in memory state storeSequence numberGauge
`nomad.worker.invoke_scheduler.`Time to run the scheduler of the given typems / Scheduler RunTimer
`nomad.worker.wait_for_index` - Time waiting for Raft log replication from leader. High delays result in - lower scheduling throughput - ms / Raft Index WaitTimer
`nomad.heartbeat.active` - Number of active heartbeat timers. Each timer represents a Nomad Client - connection - # of heartbeat timersGauge
`nomad.heartbeat.invalidate` - The length of time it takes to invalidate a Nomad Client due to failed - heartbeats - ms / Heartbeat InvalidationTimer
`nomad.rpc.query`Number of RPC queriesRPC Queries / `interval`Counter
`nomad.rpc.request`Number of RPC requests being handledRPC Requests / `interval`Counter
`nomad.rpc.request_error`Number of RPC requests being handled that result in an errorRPC Errors / `interval`Counter
- -# Client Metrics - -The Nomad client emits metrics related to the resource usage of the allocations -and tasks running on it and the node itself. Operators have to explicitly turn -on publishing host and allocation metrics. Publishing allocation and host -metrics can be turned on by setting the value of `publish_allocation_metrics` -`publish_node_metrics` to `true`. - - -By default the collection interval is 1 second but it can be changed by the -changing the value of the `collection_interval` key in the `telemetry` -configuration block. - -Please see the [agent configuration](/docs/configuration/telemetry.html) -page for more details. - -As of Nomad 0.9, Nomad will emit additional labels for [parameterized](/docs/job-specification/parameterized.html) and -[periodic](/docs/job-specification/parameterized.html) jobs. Nomad -emits the parent job id as a new label `parent_id`. Also, the labels `dispatch_id` -and `periodic_id` are emitted, containing the ID of the specific invocation of the -parameterized or periodic job respectively. For example, a dispatch job with the id -`myjob/dispatch-1312323423423`, will have the following labels. - - - - - - - - - - - - - - - - - - -
LabelValue
job`myjob/dispatch-1312323423423`
parent_idmyjob
dispatch_id1312323423423
- -## Host Metrics (post Nomad version 0.7) - -Starting in version 0.7, Nomad will emit tagged metrics, in the below format: - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
MetricDescriptionUnitTypeLabels
`nomad.client.allocated.cpu`Total amount of CPU shares the scheduler has allocated to tasksMHzGaugenode_id, datacenter
`nomad.client.unallocated.cpu`Total amount of CPU shares free for the scheduler to allocate to tasksMHzGaugenode_id, datacenter
`nomad.client.allocated.memory`Total amount of memory the scheduler has allocated to tasksMegabytesGaugenode_id, datacenter
`nomad.client.unallocated.memory`Total amount of memory free for the scheduler to allocate to tasksMegabytesGaugenode_id, datacenter
`nomad.client.allocated.disk`Total amount of disk space the scheduler has allocated to tasksMegabytesGaugenode_id, datacenter
`nomad.client.unallocated.disk`Total amount of disk space free for the scheduler to allocate to tasksMegabytesGaugenode_id, datacenter
`nomad.client.allocated.network`Total amount of bandwidth the scheduler has allocated to tasks on the - given deviceMegabitsGaugenode_id, datacenter, device
`nomad.client.unallocated.network`Total amount of bandwidth free for the scheduler to allocate to tasks on - the given deviceMegabitsGaugenode_id, datacenter, device
`nomad.client.host.memory.total`Total amount of physical memory on the nodeBytesGaugenode_id, datacenter
`nomad.client.host.memory.available`Total amount of memory available to processes which includes free and - cached memoryBytesGaugenode_id, datacenter
`nomad.client.host.memory.used`Amount of memory used by processesBytesGaugenode_id, datacenter
`nomad.client.host.memory.free`Amount of memory which is freeBytesGaugenode_id, datacenter
`nomad.client.uptime`Uptime of the host running the Nomad clientSecondsGaugenode_id, datacenter
`nomad.client.host.cpu.total`Total CPU utilizationPercentageGaugenode_id, datacenter, cpu
`nomad.client.host.cpu.user`CPU utilization in the user spacePercentageGaugenode_id, datacenter, cpu
`nomad.client.host.cpu.system`CPU utilization in the system spacePercentageGaugenode_id, datacenter, cpu
`nomad.client.host.cpu.idle`Idle time spent by the CPUPercentageGaugenode_id, datacenter, cpu
`nomad.client.host.disk.size`Total size of the deviceBytesGaugenode_id, datacenter, disk
`nomad.client.host.disk.used`Amount of space which has been usedBytesGaugenode_id, datacenter, disk
`nomad.client.host.disk.available`Amount of space which is availableBytesGaugenode_id, datacenter, disk
`nomad.client.host.disk.used_percent`Percentage of disk space usedPercentageGaugenode_id, datacenter, disk
`nomad.client.host.disk.inodes_percent`Disk space consumed by the inodesPercentGaugenode_id, datacenter, disk
`nomad.client.allocs.start`Number of allocations startingIntegerCounternode_id, job, task_group
`nomad.client.allocs.running`Number of allocations starting to runIntegerCounternode_id, job, task_group
`nomad.client.allocs.failed`Number of allocations failingIntegerCounternode_id, job, task_group
`nomad.client.allocs.restart`Number of allocations restartingIntegerCounternode_id, job, task_group
`nomad.client.allocs.complete`Number of allocations completingIntegerCounternode_id, job, task_group
`nomad.client.allocs.destroy`Number of allocations being destroyedIntegerCounternode_id, job, task_group
- -Nomad 0.9 adds an additional "node_class" label from the client's -`NodeClass` attribute. This label is set to the string "none" if empty. - -## Host Metrics (deprecated post Nomad 0.7) - -The below are metrics emitted by Nomad in versions prior to 0.7. These metrics -can be emitted in the below format post-0.7 (as well as the new format, -detailed above) but any new metrics will only be available in the new format. - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
MetricDescriptionUnitType
`nomad.client.allocated.cpu.`Total amount of CPU shares the scheduler has allocated to tasksMHzGauge
`nomad.client.unallocated.cpu.`Total amount of CPU shares free for the scheduler to allocate to tasksMHzGauge
`nomad.client.allocated.memory.`Total amount of memory the scheduler has allocated to tasksMegabytesGauge
`nomad.client.unallocated.memory.`Total amount of memory free for the scheduler to allocate to tasksMegabytesGauge
`nomad.client.allocated.disk.`Total amount of disk space the scheduler has allocated to tasksMegabytesGauge
`nomad.client.unallocated.disk.`Total amount of disk space free for the scheduler to allocate to tasksMegabytesGauge
`nomad.client.allocated.network..`Total amount of bandwidth the scheduler has allocated to tasks on the - given deviceMegabitsGauge
`nomad.client.unallocated.network..`Total amount of bandwidth free for the scheduler to allocate to tasks on - the given deviceMegabitsGauge
`nomad.client.host.memory..total`Total amount of physical memory on the nodeBytesGauge
`nomad.client.host.memory..available`Total amount of memory available to processes which includes free and - cached memoryBytesGauge
`nomad.client.host.memory..used`Amount of memory used by processesBytesGauge
`nomad.client.host.memory..free`Amount of memory which is freeBytesGauge
`nomad.client.uptime.`Uptime of the host running the Nomad clientSecondsGauge
`nomad.client.host.cpu...total`Total CPU utilizationPercentageGauge
`nomad.client.host.cpu...user`CPU utilization in the user spacePercentageGauge
`nomad.client.host.cpu...system`CPU utilization in the system spacePercentageGauge
`nomad.client.host.cpu...idle`Idle time spent by the CPUPercentageGauge
`nomad.client.host.disk...size`Total size of the deviceBytesGauge
`nomad.client.host.disk...used`Amount of space which has been usedBytesGauge
`nomad.client.host.disk...available`Amount of space which is availableBytesGauge
`nomad.client.host.disk...used_percent`Percentage of disk space usedPercentageGauge
`nomad.client.host.disk...inodes_percent`Disk space consumed by the inodesPercentGauge
- -## Allocation Metrics - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
MetricDescriptionUnitType
`nomad.client.allocs.....memory.rss`Amount of RSS memory consumed by the taskBytesGauge
`nomad.client.allocs.....memory.cache`Amount of memory cached by the taskBytesGauge
`nomad.client.allocs.....memory.swap`Amount of memory swapped by the taskBytesGauge
`nomad.client.allocs.....memory.max_usage`Maximum amount of memory ever used by the taskBytesGauge
`nomad.client.allocs.....memory.kernel_usage`Amount of memory used by the kernel for this taskBytesGauge
`nomad.client.allocs.....memory.kernel_max_usage`Maximum amount of memory ever used by the kernel for this taskBytesGauge
`nomad.client.allocs.....cpu.total_percent`Total CPU resources consumed by the task across all coresPercentageGauge
`nomad.client.allocs.....cpu.system`Total CPU resources consumed by the task in the system spacePercentageGauge
`nomad.client.allocs.....cpu.user`Total CPU resources consumed by the task in the user spacePercentageGauge
`nomad.client.allocs.....cpu.throttled_time`Total time that the task was throttledNanosecondsGauge
`nomad.client.allocs.....cpu.total_ticks`CPU ticks consumed by the process in the last collection intervalIntegerGauge
- -# Job Metrics - -Job metrics are emitted by the Nomad leader server. - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
MetricDescriptionUnitTypeLabels
`nomad.job_summary.queued`Number of queued allocations for a jobIntegerGaugejob, task_group
`nomad.job_summary.complete`Number of complete allocations for a jobIntegerGaugejob, task_group
`nomad.job_summary.failed`Number of failed allocations for a jobIntegerGaugejob, task_group
`nomad.job_summary.running`Number of running allocations for a jobIntegerGaugejob, task_group
`nomad.job_summary.starting`Number of starting allocations for a jobIntegerGaugejob, task_group
`nomad.job_summary.lost`Number of lost allocations for a jobIntegerGaugejob, task_group
- -# Metric Types - - - - - - - - - - - - - - - - - - - - - - -
TypeDescriptionQuantiles
Gauge - Gauge types report an absolute number at the end of the aggregation - interval - false
Counter - Counts are incremented and flushed at the end of the aggregation - interval and then are reset to zero - true
Timer - Timers measure the time to complete a task and will include quantiles, - means, standard deviation, etc per interval. - true
+[metrics]: /docs/telemetry/metrics.html +[overview]: /docs/telemetry/overview.html \ No newline at end of file diff --git a/website/source/docs/telemetry/metrics.html.md b/website/source/docs/telemetry/metrics.html.md new file mode 100644 index 000000000000..a99ad844d197 --- /dev/null +++ b/website/source/docs/telemetry/metrics.html.md @@ -0,0 +1,805 @@ +--- +layout: "docs" +page_title: "Metrics" +sidebar_current: "docs-telemetry-metrics" +description: |- + Learn about the different metrics available in Nomad. +--- + +# Metrics + +The Nomad agent collects various runtime metrics about the performance of +different libraries and subsystems. These metrics are aggregated on a ten +second interval and are retained for one minute. + +This data can be accessed via an HTTP endpoint or via sending a signal to the +Nomad process. + +As of Nomad version 0.7, this data is available via HTTP at `/metrics`. See +[Metrics](/api/metrics.html) for more information. + + +To view this data via sending a signal to the Nomad process: on Unix, +this is `USR1` while on Windows it is `BREAK`. Once Nomad receives the signal, +it will dump the current telemetry information to the agent's `stderr`. + +This telemetry information can be used for debugging or otherwise +getting a better view of what Nomad is doing. + +Telemetry information can be streamed to both [statsite](https://github.com/armon/statsite) +as well as statsd based on providing the appropriate configuration options. + +To configure the telemetry output please see the [agent +configuration](/docs/configuration/telemetry.html). + +Below is sample output of a telemetry dump: + +```text +[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.nomad.broker.total_blocked': 0.000 +[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.nomad.plan.queue_depth': 0.000 +[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.runtime.malloc_count': 7568.000 +[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.runtime.total_gc_runs': 8.000 +[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.nomad.broker.total_ready': 0.000 +[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.runtime.num_goroutines': 56.000 +[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.runtime.sys_bytes': 3999992.000 +[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.runtime.heap_objects': 4135.000 +[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.nomad.heartbeat.active': 1.000 +[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.nomad.broker.total_unacked': 0.000 +[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.nomad.broker.total_waiting': 0.000 +[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.runtime.alloc_bytes': 634056.000 +[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.runtime.free_count': 3433.000 +[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.runtime.total_gc_pause_ns': 6572135.000 +[2015-09-17 16:59:40 -0700 PDT][C] 'nomad.memberlist.msg.alive': Count: 1 Sum: 1.000 +[2015-09-17 16:59:40 -0700 PDT][C] 'nomad.serf.member.join': Count: 1 Sum: 1.000 +[2015-09-17 16:59:40 -0700 PDT][C] 'nomad.raft.barrier': Count: 1 Sum: 1.000 +[2015-09-17 16:59:40 -0700 PDT][C] 'nomad.raft.apply': Count: 1 Sum: 1.000 +[2015-09-17 16:59:40 -0700 PDT][C] 'nomad.nomad.rpc.query': Count: 2 Sum: 2.000 +[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.serf.queue.Query': Count: 6 Sum: 0.000 +[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.nomad.fsm.register_node': Count: 1 Sum: 1.296 +[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.serf.queue.Intent': Count: 6 Sum: 0.000 +[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.runtime.gc_pause_ns': Count: 8 Min: 126492.000 Mean: 821516.875 Max: 3126670.000 Stddev: 1139250.294 Sum: 6572135.000 +[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.raft.leader.dispatchLog': Count: 3 Min: 0.007 Mean: 0.018 Max: 0.039 Stddev: 0.018 Sum: 0.054 +[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.nomad.leader.reconcileMember': Count: 1 Sum: 0.007 +[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.nomad.leader.reconcile': Count: 1 Sum: 0.025 +[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.raft.fsm.apply': Count: 1 Sum: 1.306 +[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.nomad.client.get_allocs': Count: 1 Sum: 0.110 +[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.nomad.worker.dequeue_eval': Count: 29 Min: 0.003 Mean: 363.426 Max: 503.377 Stddev: 228.126 Sum: 10539.354 +[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.serf.queue.Event': Count: 6 Sum: 0.000 +[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.raft.commitTime': Count: 3 Min: 0.013 Mean: 0.037 Max: 0.079 Stddev: 0.037 Sum: 0.110 +[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.nomad.leader.barrier': Count: 1 Sum: 0.071 +[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.nomad.client.register': Count: 1 Sum: 1.626 +[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.nomad.eval.dequeue': Count: 21 Min: 500.610 Mean: 501.753 Max: 503.361 Stddev: 1.030 Sum: 10536.813 +[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.memberlist.gossip': Count: 12 Min: 0.009 Mean: 0.017 Max: 0.025 Stddev: 0.005 Sum: 0.204 +``` + +## Key Metrics + +When telemetry is being streamed to statsite or statsd, `interval` is defined to +be their flush interval. Otherwise, the interval can be assumed to be 10 seconds +when retrieving metrics using the above described signals. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
MetricDescriptionUnitType
`nomad.runtime.num_goroutines`Number of goroutines and general load pressure indicator# of goroutinesGauge
`nomad.runtime.alloc_bytes`Memory utilization# of bytesGauge
`nomad.runtime.heap_objects`Number of objects on the heap. General memory pressure indicator# of heap objectsGauge
`nomad.raft.apply`Number of Raft transactionsRaft transactions / `interval`Counter
`nomad.raft.replication.appendEntries`Raft transaction commit timems / Raft Log AppendTimer
`nomad.raft.leader.lastContact`Time since last contact to leader. General indicator of Raft latencyms / Leader ContactTimer
`nomad.broker.total_ready`Number of evaluations ready to be processed# of evaluationsGauge
`nomad.broker.total_unacked`Evaluations dispatched for processing but incomplete# of evaluationsGauge
`nomad.broker.total_blocked` + Evaluations that are blocked until an existing evaluation for the same job + completes + # of evaluationsGauge
`nomad.plan.queue_depth`Number of scheduler Plans waiting to be evaluated# of plansGauge
`nomad.plan.submit` + Time to submit a scheduler Plan. Higher values cause lower scheduling + throughput + ms / Plan SubmitTimer
`nomad.plan.evaluate` + Time to validate a scheduler Plan. Higher values cause lower scheduling + throughput. Similar to `nomad.plan.submit` but does not include RPC time + or time in the Plan Queue + ms / Plan EvaluationTimer
`nomad.worker.invoke_scheduler.`Time to run the scheduler of the given typems / Scheduler RunTimer
`nomad.worker.wait_for_index` + Time waiting for Raft log replication from leader. High delays result in + lower scheduling throughput + ms / Raft Index WaitTimer
`nomad.heartbeat.active` + Number of active heartbeat timers. Each timer represents a Nomad Client + connection + # of heartbeat timersGauge
`nomad.heartbeat.invalidate` + The length of time it takes to invalidate a Nomad Client due to failed + heartbeats + ms / Heartbeat InvalidationTimer
`nomad.rpc.query`Number of RPC queriesRPC Queries / `interval`Counter
`nomad.rpc.request`Number of RPC requests being handledRPC Requests / `interval`Counter
`nomad.rpc.request_error`Number of RPC requests being handled that result in an errorRPC Errors / `interval`Counter
+ +## Client Metrics + +The Nomad client emits metrics related to the resource usage of the allocations +and tasks running on it and the node itself. Operators have to explicitly turn +on publishing host and allocation metrics. Publishing allocation and host +metrics can be turned on by setting the value of `publish_allocation_metrics` +`publish_node_metrics` to `true`. + + +By default the collection interval is 1 second but it can be changed by the +changing the value of the `collection_interval` key in the `telemetry` +configuration block. + +Please see the [agent configuration](/docs/configuration/telemetry.html) +page for more details. + +As of Nomad 0.9, Nomad will emit additional labels for [parameterized](/docs/job-specification/parameterized.html) and +[periodic](/docs/job-specification/parameterized.html) jobs. Nomad +emits the parent job id as a new label `parent_id`. Also, the labels `dispatch_id` +and `periodic_id` are emitted, containing the ID of the specific invocation of the +parameterized or periodic job respectively. For example, a dispatch job with the id +`myjob/dispatch-1312323423423`, will have the following labels. + + + + + + + + + + + + + + + + + + +
LabelValue
job`myjob/dispatch-1312323423423`
parent_idmyjob
dispatch_id1312323423423
+ +## Host Metrics (post Nomad version 0.7) + +Starting in version 0.7, Nomad will emit [tagged metrics][tagged-metrics], in the below format: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
MetricDescriptionUnitTypeLabels
`nomad.client.allocated.cpu`Total amount of CPU shares the scheduler has allocated to tasksMHzGaugenode_id, datacenter
`nomad.client.unallocated.cpu`Total amount of CPU shares free for the scheduler to allocate to tasksMHzGaugenode_id, datacenter
`nomad.client.allocated.memory`Total amount of memory the scheduler has allocated to tasksMegabytesGaugenode_id, datacenter
`nomad.client.unallocated.memory`Total amount of memory free for the scheduler to allocate to tasksMegabytesGaugenode_id, datacenter
`nomad.client.allocated.disk`Total amount of disk space the scheduler has allocated to tasksMegabytesGaugenode_id, datacenter
`nomad.client.unallocated.disk`Total amount of disk space free for the scheduler to allocate to tasksMegabytesGaugenode_id, datacenter
`nomad.client.allocated.network`Total amount of bandwidth the scheduler has allocated to tasks on the + given deviceMegabitsGaugenode_id, datacenter, device
`nomad.client.unallocated.network`Total amount of bandwidth free for the scheduler to allocate to tasks on + the given deviceMegabitsGaugenode_id, datacenter, device
`nomad.client.host.memory.total`Total amount of physical memory on the nodeBytesGaugenode_id, datacenter
`nomad.client.host.memory.available`Total amount of memory available to processes which includes free and + cached memoryBytesGaugenode_id, datacenter
`nomad.client.host.memory.used`Amount of memory used by processesBytesGaugenode_id, datacenter
`nomad.client.host.memory.free`Amount of memory which is freeBytesGaugenode_id, datacenter
`nomad.client.uptime`Uptime of the host running the Nomad clientSecondsGaugenode_id, datacenter
`nomad.client.host.cpu.total`Total CPU utilizationPercentageGaugenode_id, datacenter, cpu
`nomad.client.host.cpu.user`CPU utilization in the user spacePercentageGaugenode_id, datacenter, cpu
`nomad.client.host.cpu.system`CPU utilization in the system spacePercentageGaugenode_id, datacenter, cpu
`nomad.client.host.cpu.idle`Idle time spent by the CPUPercentageGaugenode_id, datacenter, cpu
`nomad.client.host.disk.size`Total size of the deviceBytesGaugenode_id, datacenter, disk
`nomad.client.host.disk.used`Amount of space which has been usedBytesGaugenode_id, datacenter, disk
`nomad.client.host.disk.available`Amount of space which is availableBytesGaugenode_id, datacenter, disk
`nomad.client.host.disk.used_percent`Percentage of disk space usedPercentageGaugenode_id, datacenter, disk
`nomad.client.host.disk.inodes_percent`Disk space consumed by the inodesPercentGaugenode_id, datacenter, disk
`nomad.client.allocs.start`Number of allocations startingIntegerCounternode_id, job, task_group
`nomad.client.allocs.running`Number of allocations starting to runIntegerCounternode_id, job, task_group
`nomad.client.allocs.failed`Number of allocations failingIntegerCounternode_id, job, task_group
`nomad.client.allocs.restart`Number of allocations restartingIntegerCounternode_id, job, task_group
`nomad.client.allocs.complete`Number of allocations completingIntegerCounternode_id, job, task_group
`nomad.client.allocs.destroy`Number of allocations being destroyedIntegerCounternode_id, job, task_group
+ +Nomad 0.9 adds an additional `node_class` label from the client's +`NodeClass` attribute. This label is set to the string "none" if empty. + +## Host Metrics (deprecated post Nomad 0.7) + +The below are metrics emitted by Nomad in versions prior to 0.7. These metrics +can be emitted in the below format post-0.7 (as well as the new format, +detailed above) but any new metrics will only be available in the new format. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
MetricDescriptionUnitType
`nomad.client.allocated.cpu.`Total amount of CPU shares the scheduler has allocated to tasksMHzGauge
`nomad.client.unallocated.cpu.`Total amount of CPU shares free for the scheduler to allocate to tasksMHzGauge
`nomad.client.allocated.memory.`Total amount of memory the scheduler has allocated to tasksMegabytesGauge
`nomad.client.unallocated.memory.`Total amount of memory free for the scheduler to allocate to tasksMegabytesGauge
`nomad.client.allocated.disk.`Total amount of disk space the scheduler has allocated to tasksMegabytesGauge
`nomad.client.unallocated.disk.`Total amount of disk space free for the scheduler to allocate to tasksMegabytesGauge
`nomad.client.allocated.network..`Total amount of bandwidth the scheduler has allocated to tasks on the + given deviceMegabitsGauge
`nomad.client.unallocated.network..`Total amount of bandwidth free for the scheduler to allocate to tasks on + the given deviceMegabitsGauge
`nomad.client.host.memory..total`Total amount of physical memory on the nodeBytesGauge
`nomad.client.host.memory..available`Total amount of memory available to processes which includes free and + cached memoryBytesGauge
`nomad.client.host.memory..used`Amount of memory used by processesBytesGauge
`nomad.client.host.memory..free`Amount of memory which is freeBytesGauge
`nomad.client.uptime.`Uptime of the host running the Nomad clientSecondsGauge
`nomad.client.host.cpu...total`Total CPU utilizationPercentageGauge
`nomad.client.host.cpu...user`CPU utilization in the user spacePercentageGauge
`nomad.client.host.cpu...system`CPU utilization in the system spacePercentageGauge
`nomad.client.host.cpu...idle`Idle time spent by the CPUPercentageGauge
`nomad.client.host.disk...size`Total size of the deviceBytesGauge
`nomad.client.host.disk...used`Amount of space which has been usedBytesGauge
`nomad.client.host.disk...available`Amount of space which is availableBytesGauge
`nomad.client.host.disk...used_percent`Percentage of disk space usedPercentageGauge
`nomad.client.host.disk...inodes_percent`Disk space consumed by the inodesPercentGauge
+ +## Allocation Metrics + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
MetricDescriptionUnitType
`nomad.client.allocs.....memory.rss`Amount of RSS memory consumed by the taskBytesGauge
`nomad.client.allocs.....memory.cache`Amount of memory cached by the taskBytesGauge
`nomad.client.allocs.....memory.swap`Amount of memory swapped by the taskBytesGauge
`nomad.client.allocs.....memory.max_usage`Maximum amount of memory ever used by the taskBytesGauge
`nomad.client.allocs.....memory.kernel_usage`Amount of memory used by the kernel for this taskBytesGauge
`nomad.client.allocs.....memory.kernel_max_usage`Maximum amount of memory ever used by the kernel for this taskBytesGauge
`nomad.client.allocs.....cpu.total_percent`Total CPU resources consumed by the task across all coresPercentageGauge
`nomad.client.allocs.....cpu.system`Total CPU resources consumed by the task in the system spacePercentageGauge
`nomad.client.allocs.....cpu.user`Total CPU resources consumed by the task in the user spacePercentageGauge
`nomad.client.allocs.....cpu.throttled_time`Total time that the task was throttledNanosecondsGauge
`nomad.client.allocs.....cpu.total_ticks`CPU ticks consumed by the process in the last collection intervalIntegerGauge
+ +## Job Metrics + +Job metrics are emitted by the Nomad leader server. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
MetricDescriptionUnitTypeLabels
`nomad.job_summary.queued`Number of queued allocations for a jobIntegerGaugejob, task_group
`nomad.job_summary.complete`Number of complete allocations for a jobIntegerGaugejob, task_group
`nomad.job_summary.failed`Number of failed allocations for a jobIntegerGaugejob, task_group
`nomad.job_summary.running`Number of running allocations for a jobIntegerGaugejob, task_group
`nomad.job_summary.starting`Number of starting allocations for a jobIntegerGaugejob, task_group
`nomad.job_summary.lost`Number of lost allocations for a jobIntegerGaugejob, task_group
+ +## Metric Types + + + + + + + + + + + + + + + + + + + + + + +
TypeDescriptionQuantiles
Gauge + Gauge types report an absolute number at the end of the aggregation + interval + false
Counter + Counts are incremented and flushed at the end of the aggregation + interval and then are reset to zero + true
Timer + Timers measure the time to complete a task and will include quantiles, + means, standard deviation, etc per interval. + true
+ +## Tagged Metrics + +As of version 0.7, Nomad will start emitting metrics in a tagged format. Each +metric can support more than one tag, meaning that it is possible to do a +match over metrics for datapoints such as a particular datacenter, and return +all metrics with this tag. Nomad supports labels for namespaces as well. + +[tagged-metrics]: /docs/telemetry/metrics.html#tagged-metrics diff --git a/website/source/docs/telemetry/overview.html.md b/website/source/docs/telemetry/overview.html.md new file mode 100644 index 000000000000..018eedc4c224 --- /dev/null +++ b/website/source/docs/telemetry/overview.html.md @@ -0,0 +1,162 @@ +--- +layout: "docs" +page_title: "Overview" +sidebar_current: "docs-telemetry-overview" +description: |- + Overview of runtime metrics available in Nomad along with monitoring and + alerting. +--- + +# Telemetry Overview + +The Nomad client and server agents collect a wide range of runtime metrics +related to the performance of the system. On the server side, leaders and +followers have metrics in common as well as metrics that are specific to their +roles. Clients have separate metrics for the host metrics and for +allocations/tasks, both of which have to be [explicitly +enabled][telemetry-stanza]. There are also runtime metrics that are common to +all servers and clients. + +By default, the Nomad agent collects telemetry data at a [1 second +interval][collection-interval]. Note that Nomad supports [Gauges, counters and +timers][metric-types]. + +There are three ways to obtain metrics from Nomad: + +* Query the [/metrics API endpoint][metrics-api-endpoint] to return metrics for + the current Nomad process (as of Nomad 0.7). This endpoint supports Prometheus + formatted metrics. +* Send the USR1 signal to the Nomad process. This will dump the current + telemetry information to STDERR (on Linux). +* Configure Nomad to automatically forward metrics to a third-party provider. + +Nomad 0.7 added support for [tagged metrics][tagged-metrics], improving the +integrations with [DataDog][datadog-telem] and [Prometheus][prometheus-telem]. +Metrics can also be forwarded to [Statsite][statsite-telem], +[StatsD][statsd-telem], and [Circonus][circonus-telem]. + +## Alerting + +The recommended practice for alerting is to leverage the alerting capabilities +of your monitoring provider. Nomad’s intention is to surface metrics that enable +users to configure the necessary alerts using their existing monitoring systems +as a scaffold, rather than to natively support alerting. Here are a few common +patterns: + +* Export metrics from Nomad to Prometheus using the [StatsD + exporter][statsd-exporter], define [alerting rules][alerting-rules] in + Prometheus, and use [Alertmanager][alertmanager] for summarization and + routing/notifications (to PagerDuty, Slack, etc.). A similar workflow is + supported for [Datadog][datadog-alerting]. + +* Periodically submit test jobs into Nomad to determine if your application + deployment pipeline is working end-to-end. This pattern is well-suited to + batch processing workloads. + +* Deploy Nagios on Nomad. Centrally manage Nomad job files and add the Nagios + monitor when a new Nomad job is added. When a job is removed, remove the + Nagios monitor. Map Consul alerts to the Nagios monitor. This provides a + job-specific alerting system. + +* Write a script that looks at the history of each batch job to determine + whether or not the job is in an unhealthy state, updating your monitoring + system as appropriate. In many cases, it may be ok if a given batch job fails + occasionally, as long as it goes back to passing. + +# Key Performance Indicators + +The sections below cover a number of important metrics + +## Consensus Protocol (Raft) + +Nomad uses the Raft consensus protocol for leader election and state +replication. Spurious leader elections can be caused by networking issues +between the servers or insufficient CPU resources. Users in cloud environments +often bump their servers up to the next instance class with improved networking +and CPU to stabilize leader elections. The `nomad.raft.leader.lastContact` metric +is a general indicator of Raft latency which can be used to observe how Raft +timing is performing and guide the decision to upgrade to more powerful servers. +`nomad.raft.leader.lastContact` should not get too close to the leader lease +timeout of 500ms. + +## Federated Deployments (Serf) + +Nomad uses the membership and failure detection capabilities of the Serf library +to maintain a single, global gossip pool for all servers in a federated +deployment. An uptick in `member.flap` and/or `msg.suspect` is a reliable indicator +that membership is unstable. + +## Scheduling + +The following metrics allow an operator to observe changes in throughput at the +various points in the scheduling process (evaluation, scheduling/planning, and +placement): + +* **nomad.broker.total_blocked** - The number of blocked evaluations. +* **nomad.worker.invoke_scheduler.\** - The time to run the scheduler of + the given type. +* **nomad.plan.evaluate** - The time to evaluate a scheduler Plan. +* **nomad.plan.submit** - The time to submit a scheduler Plan. +* **nomad.plan.queue_depth** - The number of scheduler Plans waiting to be + evaluated. + +Upticks in any of the above metrics indicate a decrease in scheduler throughput. + +## Capacity + +The importance of monitoring resource availability is workload specific. Batch +processing workloads often operate under the assumption that the cluster should +be at or near capacity, with queued jobs running as soon as adequate resources +become available. Clusters that are primarily responsible for long running +services with an uptime requirement may want to maintain headroom at 20% or +more. The following metrics can be used to assess capacity across the cluster on +a per client basis: + +* **nomad.client.allocated.cpu** +* **nomad.client.unallocated.cpu** +* **nomad.client.allocated.disk** +* **nomad.client.unallocated.disk** +* **nomad.client.allocated.iops** +* **nomad.client.unallocated.iops** +* **nomad.client.allocated.memory** +* **nomad.client.unallocated.memory** + +## Task Resource Consumption + +The metrics listed [here][allocation-metrics] can be used to track resource +consumption on a per task basis. For user facing services, it is common to alert +when the CPU is at or above the reserved resources for the task. + +## Job and Task Status + +We do not currently surface metrics for job and task/allocation status, although +we will consider adding metrics where it makes sense. + +## Runtime Metrics + +Runtime metrics apply to all clients and servers. The following metrics are +general indicators of load and memory pressure: + +* **nomad.runtime.num_goroutines** +* **nomad.runtime.heap_objects** +* **nomad.runtime.alloc_bytes** + +It is recommended to alert on upticks in any of the above, server memory usage +in particular. + + +[alerting-rules]: https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/ +[alertmanager]: https://prometheus.io/docs/alerting/alertmanager/ +[allocation-metrics]: /docs/telemetry/metrics.html#allocation-metrics +[circonus-telem]: /docs/configuration/telemetry.html#circonus +[collection-interval]: /docs/configuration/telemetry.html#collection_interval +[datadog-alerting]: https://www.datadoghq.com/blog/monitoring-101-alerting/ +[datadog-telem]: /docs/configuration/telemetry.html#datadog +[prometheus-telem]: /docs/configuration/telemetry.html#prometheus +[metrics-api-endpoint]: /api/metrics.html +[metric-types]: /docs/telemetry/metrics.html#metric-types +[statsd-exporter]: https://github.com/prometheus/statsd_exporter +[statsd-telem]: /docs/configuration/telemetry.html#statsd +[statsite-telem]: /docs/configuration/telemetry.html#statsite +[tagged-metrics]: /docs/telemetry/metrics.html#tagged-metrics +[telemetry-stanza]: /docs/configuration/telemetry.html diff --git a/website/source/layouts/docs.erb b/website/source/layouts/docs.erb index 4c9b6b750f13..fc7e3b31564b 100644 --- a/website/source/layouts/docs.erb +++ b/website/source/layouts/docs.erb @@ -506,6 +506,14 @@ > Telemetry + >