From 1040fc24fc671c3ae903edfc0a259ef3755aaa03 Mon Sep 17 00:00:00 2001 From: Aaron Abbott Date: Wed, 9 Sep 2020 19:16:01 +0000 Subject: [PATCH] System metrics semantic conventions Conventions from [OTEP 119](https://github.com/open-telemetry/oteps/pull/119) --- CHANGELOG.md | 2 + .../metrics/semantic_conventions/README.md | 7 +- .../semantic_conventions/process-metrics.md | 21 +++ .../semantic_conventions/runtime-metrics.md | 42 +++++ .../semantic_conventions/system-metrics.md | 153 ++++++++++++++++++ 5 files changed, 224 insertions(+), 1 deletion(-) create mode 100644 specification/metrics/semantic_conventions/process-metrics.md create mode 100644 specification/metrics/semantic_conventions/runtime-metrics.md create mode 100644 specification/metrics/semantic_conventions/system-metrics.md diff --git a/CHANGELOG.md b/CHANGELOG.md index 2d96cd1ef85..67fb3629f64 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -23,6 +23,8 @@ New: ([#697](https://github.com/open-telemetry/opentelemetry-specification/pull/697)) * API was extended to allow adding arbitrary event attributes ([#874](https://github.com/open-telemetry/opentelemetry-specification/pull/874)) * `exception.escaped` was added ([#784](https://github.com/open-telemetry/opentelemetry-specification/pull/784)) +- Add semantic conventions for system metrics + ([#937](https://github.com/open-telemetry/opentelemetry-specification/pull/937)) Updates: diff --git a/specification/metrics/semantic_conventions/README.md b/specification/metrics/semantic_conventions/README.md index 6c63439e4e3..0588fcde98f 100644 --- a/specification/metrics/semantic_conventions/README.md +++ b/specification/metrics/semantic_conventions/README.md @@ -1,6 +1,11 @@ # Metrics Semantic Conventions -TODO: Add semantic conventions for metric names and labels. +The following semantic conventions surrounding metrics are defined: + +* [HTTP Metrics](http-metrics.md): Semantic conventions and instruments for HTTP metrics. +* [System Metrics](system-metrics.md): Semantic conventions and instruments for standard system metrics. +* [Process Metrics](process-metrics.md): Semantic conventions and instruments for standard process metrics. +* [Runtime Metrics](runtime-metrics.md): Semantic conventions and instruments for runtime metrics. Apart from semantic conventions for metrics and [traces](../../trace/semantic_conventions/README.md), OpenTelemetry also defines the concept of overarching [Resources](../../resource/sdk.md) with their own diff --git a/specification/metrics/semantic_conventions/process-metrics.md b/specification/metrics/semantic_conventions/process-metrics.md new file mode 100644 index 00000000000..66479b22f77 --- /dev/null +++ b/specification/metrics/semantic_conventions/process-metrics.md @@ -0,0 +1,21 @@ +# Semantic Conventions for Process Metrics + +This document describes instruments and labels for common process level +metrics in OpenTelemetry. Also consider the general [semantic conventions for +system metrics](system-metrics.md#semantic-conventions) when creating +instruments not explicitly defined in this document. + + + + + +- [Metric Instruments](#metric-instruments) + * [Standard Process Metrics - `process.`](#standard-process-metrics---process) + + + +## Metric Instruments + +### Standard Process Metrics - `process.` + +TODO diff --git a/specification/metrics/semantic_conventions/runtime-metrics.md b/specification/metrics/semantic_conventions/runtime-metrics.md new file mode 100644 index 00000000000..7f4bd729ad0 --- /dev/null +++ b/specification/metrics/semantic_conventions/runtime-metrics.md @@ -0,0 +1,42 @@ +# Semantic Conventions for Runtime Metrics + +This document describes instruments and labels for common runtime level +metrics in OpenTelemetry. Also consider the general [semantic conventions for +system metrics](system-metrics.md#semantic-conventions) when creating +instruments not explicitly defined in this document. + + + + + +- [Metric Instruments](#metric-instruments) + * [Runtime Metrics - `runtime.`](#runtime-metrics---runtime) + + [Runtime Specific Metrics - `runtime.{environment}.`](#runtime-specific-metrics---runtimeenvironment) + + + +## Metric Instruments + +### Runtime Metrics - `runtime.` + +Runtime environments vary widely in their terminology, implementation, and +relative values for a given metric. For example, Go and Python are both +garbage collected languages, but comparing heap usage between the two +runtimes directly is not meaningful. For this reason, this document does not +propose any standard top-level runtime metric instruments. See [OTEP +108](https://github.com/open-telemetry/oteps/pull/108/files) for additional +discussion. + +#### Runtime Specific Metrics - `runtime.{environment}.` + +Runtime level metrics specific to a certain runtime environment should be +prefixed with `runtime.{environment}.` and follow the semantic conventions +outlined in [semantic conventions for system +metrics](system-metrics.md#semantic-conventions). For example, Go runtime +metrics use `runtime.go.` as a prefix. + +Some programming languages have multiple runtime environments that vary +significantly in their implementation, for example [Python has many +implementations](https://wiki.python.org/moin/PythonImplementations). For +these languages, consider using specific `environment` prefixes to avoid +ambiguity, like `runtime.cpython.` and `runtime.pypy.`. diff --git a/specification/metrics/semantic_conventions/system-metrics.md b/specification/metrics/semantic_conventions/system-metrics.md new file mode 100644 index 00000000000..ca6f9f85e65 --- /dev/null +++ b/specification/metrics/semantic_conventions/system-metrics.md @@ -0,0 +1,153 @@ +# Semantic Conventions for System Metrics + +This document describes instruments and labels for common system level +metrics in OpenTelemetry. Also included are general semantic conventions for +system, process, and runtime metrics, which should be considered when +creating instruments not explicitly defined in the specification. + + + + + +- [Semantic Conventions](#semantic-conventions) + * [Instrument Names](#instrument-names) + * [Units](#units) +- [Metric Instruments](#metric-instruments) + * [Standard System Metrics - `system.`](#standard-system-metrics---system) + + [`system.cpu.`](#systemcpu) + + [`system.memory.`](#systemmemory) + + [`system.swap.`](#systemswap) + + [`system.disk.`](#systemdisk) + + [`system.filesystem.`](#systemfilesystem) + + [`system.network.`](#systemnetwork) + + [`system.process.`](#systemprocess) + + [OS Specific System Metrics - `system.{os}.`](#os-specific-system-metrics---systemos) + + + +## Semantic Conventions + +The following semantic conventions aim to keep naming consistent. They +provide guidelines for most of the cases in this specification and should be +followed for other instruments not explicitly defined in this document. + +### Instrument Names + +- **usage** - an instrument that measures an amount used out of a known total +amount should be called `entity.usage`. For example, +`system.filesystem.usage` for the amount of disk spaced used. A measure of +the amount of an unlimited resource consumed is differentiated from +**usage**. This may be time, data, etc. +- **utilization** - an instrument that measures a *value ratio* of usage +(like percent, but in the range `[0, 1]`) should be called +`entity.utilization`. For example, `system.memory.utilization` for the ratio +of memory in use. +- **time** - an instrument that measures passage of time should be called +`entity.time`. For example, `system.cpu.time` with varying values of label +`state` for idle, user, etc. +- **io** - an instrument that measures bidirectional data flow should be +called `entity.io` and have labels for direction. For example, +`system.network.io`. +- Other instruments that do not fit the above descriptions may be named more +freely. For example, `system.swap.page_faults` and `system.network.packets`. +Units do not need to be specified in the names since they are included during +instrument creation, but can be added if there is ambiguity. + +### Units + +- Instruments for utilization metrics (that measure the ratio out of a total) +SHOULD use units of `1`. Such values represent a *value ratio* and are always +in the range `[0, 1]`. +- Instruments that measure an integer count of something SHOULD use semantic +units like `packets`, `errors`, `faults`, etc. + +## Metric Instruments + +### Standard System Metrics - `system.` + +#### `system.cpu.` + +**Description:** System level processor metrics. +| Name | Units | Instrument Type | Value Type | Label Key | Label Values | +| ---------------------- | ------- | ----------------- | ---------- | --------- | ----------------------------------- | +| system.cpu.time | seconds | SumObserver | Double | state | idle, user, system, interrupt, etc. | +| | | | | cpu | 1 - #cores | +| system.cpu.utilization | 1 | UpDownSumObserver | Double | state | idle, user, system, interrupt, etc. | +| | | | | cpu | 1 - #cores | + +#### `system.memory.` + +**Description:** System level memory metrics. +| Name | Units | Instrument Type | Value Type | Label Key | Label Values | +| ------------------------- | ----- | ----------------- | ---------- | --------- | ------------------------ | +| system.memory.usage | bytes | UpDownSumObserver | Int64 | state | used, free, cached, etc. | +| system.memory.utilization | 1 | ValueObserver | Double | state | used, free, cached, etc. | + +#### `system.swap.` + +**Description:** System level swap/paging metrics. +| Name | Units | Instrument Type | Value Type | Label Key | Label Values | +| ---------------------------- | ---------- | ----------------- | ---------- | --------- | ------------ | +| system.swap.usage | pages | UpDownSumObserver | Int64 | state | used, free | +| system.swap.utilization | 1 | ValueObserver | Double | state | used, free | +| system.swap.page\_faults | faults | SumObserver | Int64 | type | major, minor | +| system.swap.page\_operations | operations | SumObserver | Int64 | type | major, minor | +| | | | | direction | in, out | + +#### `system.disk.` + +**Description:** System level disk performance metrics. +| Name | Units | Instrument Type | Value Type | Label Key | Label Values | +| ---------------------------- | ---------- | --------------- | ---------- | --------- | ------------ | +| system.disk.io | bytes | SumObserver | Int64 | device | (identifier) | +| | | | | direction | read, write | +| system.disk.operations | operations | SumObserver | Int64 | device | (identifier) | +| | | | | direction | read, write | +| system.disk.time | seconds | SumObserver | Double | device | (identifier) | +| | | | | direction | read, write | +| system.disk.merged | 1 | SumObserver | Int64 | device | (identifier) | +| | | | | direction | read, write | + +#### `system.filesystem.` + +**Description:** System level filesystem metrics. +| Name | Units | Instrument Type | Value Type | Label Key | Label Values | +| ----------------------------- | ----- | ----------------- | ---------- | --------- | -------------------- | +| system.filesystem.usage | bytes | UpDownSumObserver | Int64 | device | (identifier) | +| | | | | state | used, free, reserved | +| system.filesystem.utilization | 1 | ValueObserver | Double | device | (identifier) | +| | | | | state | used, free, reserved | + +#### `system.network.` + +**Description:** System level network metrics. +| Name | Units | Instrument Type | Value Type | Label Key | Label Values | +| ------------------------------- | ----------- | ----------------- | ---------- | --------- | ---------------------------------------------------------------------------------------------- | +| system.network.dropped\_packets | packets | SumObserver | Int64 | device | (identifier) | +| | | | | direction | transmit, receive | +| system.network.packets | packets | SumObserver | Int64 | device | (identifier) | +| | | | | direction | transmit, receive | +| system.network.errors | errors | SumObserver | Int64 | device | (identifier) | +| | | | | direction | transmit, receive | +| system.network.io | bytes | SumObserver | Int64 | device | (identifier) | +| | | | | direction | transmit, receive | +| system.network.connections | connections | UpDownSumObserver | Int64 | device | (identifier) | +| | | | | protocol | tcp, udp, [etc.](https://en.wikipedia.org/wiki/Transport_layer#Protocols) | +| | | | | state | [e.g. for tcp](https://en.wikipedia.org/wiki/Transmission_Control_Protocol#Protocol_operation) | + +#### `system.process.` + +**Description:** System level aggregate process metrics. For metrics at the +individual process level, see [process metrics](process-metrics.md). +| Name | Units | Instrument Type | Value Type | Label Key | Label Values | +| -------------------- | --------- | --------------- | ---------- | --------- | ---------------------------------------------------------------------------------------------- | +| system.process.count | processes | SumObserver | Int64 | status | running, sleeping, [etc.](https://man7.org/linux/man-pages/man1/ps.1.html#PROCESS_STATE_CODES) | + +#### OS Specific System Metrics - `system.{os}.` + +Instrument names for system level metrics that have different and conflicting +meaning across multiple OSes should be prefixed with `system.{os}.` and +follow the hierarchies listed above for different entities like CPU, memory, +and network. For example, an instrument for measuring the load average on +Linux could be named `system.linux.cpu.load`, reusing the `cpu` name proposed +above.