Skip to content

Commit

Permalink
System metrics semantic conventions
Browse files Browse the repository at this point in the history
Conventions from [OTEP
119](open-telemetry/oteps#119)
  • Loading branch information
aabmass committed Sep 9, 2020
1 parent 0229140 commit 1040fc2
Show file tree
Hide file tree
Showing 5 changed files with 224 additions and 1 deletion.
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,8 @@ New:
([#697](https://github.com/open-telemetry/opentelemetry-specification/pull/697))
* API was extended to allow adding arbitrary event attributes ([#874](https://github.com/open-telemetry/opentelemetry-specification/pull/874))
* `exception.escaped` was added ([#784](https://github.com/open-telemetry/opentelemetry-specification/pull/784))
- Add semantic conventions for system metrics
([#937](https://github.com/open-telemetry/opentelemetry-specification/pull/937))

Updates:

Expand Down
7 changes: 6 additions & 1 deletion specification/metrics/semantic_conventions/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,11 @@
# Metrics Semantic Conventions

TODO: Add semantic conventions for metric names and labels.
The following semantic conventions surrounding metrics are defined:

* [HTTP Metrics](http-metrics.md): Semantic conventions and instruments for HTTP metrics.
* [System Metrics](system-metrics.md): Semantic conventions and instruments for standard system metrics.
* [Process Metrics](process-metrics.md): Semantic conventions and instruments for standard process metrics.
* [Runtime Metrics](runtime-metrics.md): Semantic conventions and instruments for runtime metrics.

Apart from semantic conventions for metrics and [traces](../../trace/semantic_conventions/README.md),
OpenTelemetry also defines the concept of overarching [Resources](../../resource/sdk.md) with their own
Expand Down
21 changes: 21 additions & 0 deletions specification/metrics/semantic_conventions/process-metrics.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# Semantic Conventions for Process Metrics

This document describes instruments and labels for common process level
metrics in OpenTelemetry. Also consider the general [semantic conventions for
system metrics](system-metrics.md#semantic-conventions) when creating
instruments not explicitly defined in this document.

<!-- Re-generate TOC with `markdown-toc --no-first-h1 -i` -->

<!-- toc -->

- [Metric Instruments](#metric-instruments)
* [Standard Process Metrics - `process.`](#standard-process-metrics---process)

<!-- tocstop -->

## Metric Instruments

### Standard Process Metrics - `process.`

TODO
42 changes: 42 additions & 0 deletions specification/metrics/semantic_conventions/runtime-metrics.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# Semantic Conventions for Runtime Metrics

This document describes instruments and labels for common runtime level
metrics in OpenTelemetry. Also consider the general [semantic conventions for
system metrics](system-metrics.md#semantic-conventions) when creating
instruments not explicitly defined in this document.

<!-- Re-generate TOC with `markdown-toc --no-first-h1 -i` -->

<!-- toc -->

- [Metric Instruments](#metric-instruments)
* [Runtime Metrics - `runtime.`](#runtime-metrics---runtime)
+ [Runtime Specific Metrics - `runtime.{environment}.`](#runtime-specific-metrics---runtimeenvironment)

<!-- tocstop -->

## Metric Instruments

### Runtime Metrics - `runtime.`

Runtime environments vary widely in their terminology, implementation, and
relative values for a given metric. For example, Go and Python are both
garbage collected languages, but comparing heap usage between the two
runtimes directly is not meaningful. For this reason, this document does not
propose any standard top-level runtime metric instruments. See [OTEP
108](https://github.com/open-telemetry/oteps/pull/108/files) for additional
discussion.

#### Runtime Specific Metrics - `runtime.{environment}.`

Runtime level metrics specific to a certain runtime environment should be
prefixed with `runtime.{environment}.` and follow the semantic conventions
outlined in [semantic conventions for system
metrics](system-metrics.md#semantic-conventions). For example, Go runtime
metrics use `runtime.go.` as a prefix.

Some programming languages have multiple runtime environments that vary
significantly in their implementation, for example [Python has many
implementations](https://wiki.python.org/moin/PythonImplementations). For
these languages, consider using specific `environment` prefixes to avoid
ambiguity, like `runtime.cpython.` and `runtime.pypy.`.
153 changes: 153 additions & 0 deletions specification/metrics/semantic_conventions/system-metrics.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
# Semantic Conventions for System Metrics

This document describes instruments and labels for common system level
metrics in OpenTelemetry. Also included are general semantic conventions for
system, process, and runtime metrics, which should be considered when
creating instruments not explicitly defined in the specification.

<!-- Re-generate TOC with `markdown-toc --no-first-h1 -i` -->

<!-- toc -->

- [Semantic Conventions](#semantic-conventions)
* [Instrument Names](#instrument-names)
* [Units](#units)
- [Metric Instruments](#metric-instruments)
* [Standard System Metrics - `system.`](#standard-system-metrics---system)
+ [`system.cpu.`](#systemcpu)
+ [`system.memory.`](#systemmemory)
+ [`system.swap.`](#systemswap)
+ [`system.disk.`](#systemdisk)
+ [`system.filesystem.`](#systemfilesystem)
+ [`system.network.`](#systemnetwork)
+ [`system.process.`](#systemprocess)
+ [OS Specific System Metrics - `system.{os}.`](#os-specific-system-metrics---systemos)

<!-- tocstop -->

## Semantic Conventions

The following semantic conventions aim to keep naming consistent. They
provide guidelines for most of the cases in this specification and should be
followed for other instruments not explicitly defined in this document.

### Instrument Names

- **usage** - an instrument that measures an amount used out of a known total
amount should be called `entity.usage`. For example,
`system.filesystem.usage` for the amount of disk spaced used. A measure of
the amount of an unlimited resource consumed is differentiated from
**usage**. This may be time, data, etc.
- **utilization** - an instrument that measures a *value ratio* of usage
(like percent, but in the range `[0, 1]`) should be called
`entity.utilization`. For example, `system.memory.utilization` for the ratio
of memory in use.
- **time** - an instrument that measures passage of time should be called
`entity.time`. For example, `system.cpu.time` with varying values of label
`state` for idle, user, etc.
- **io** - an instrument that measures bidirectional data flow should be
called `entity.io` and have labels for direction. For example,
`system.network.io`.
- Other instruments that do not fit the above descriptions may be named more
freely. For example, `system.swap.page_faults` and `system.network.packets`.
Units do not need to be specified in the names since they are included during
instrument creation, but can be added if there is ambiguity.

### Units

- Instruments for utilization metrics (that measure the ratio out of a total)
SHOULD use units of `1`. Such values represent a *value ratio* and are always
in the range `[0, 1]`.
- Instruments that measure an integer count of something SHOULD use semantic
units like `packets`, `errors`, `faults`, etc.

## Metric Instruments

### Standard System Metrics - `system.`

#### `system.cpu.`

**Description:** System level processor metrics.
| Name | Units | Instrument Type | Value Type | Label Key | Label Values |
| ---------------------- | ------- | ----------------- | ---------- | --------- | ----------------------------------- |
| system.cpu.time | seconds | SumObserver | Double | state | idle, user, system, interrupt, etc. |
| | | | | cpu | 1 - #cores |
| system.cpu.utilization | 1 | UpDownSumObserver | Double | state | idle, user, system, interrupt, etc. |
| | | | | cpu | 1 - #cores |

#### `system.memory.`

**Description:** System level memory metrics.
| Name | Units | Instrument Type | Value Type | Label Key | Label Values |
| ------------------------- | ----- | ----------------- | ---------- | --------- | ------------------------ |
| system.memory.usage | bytes | UpDownSumObserver | Int64 | state | used, free, cached, etc. |
| system.memory.utilization | 1 | ValueObserver | Double | state | used, free, cached, etc. |

#### `system.swap.`

**Description:** System level swap/paging metrics.
| Name | Units | Instrument Type | Value Type | Label Key | Label Values |
| ---------------------------- | ---------- | ----------------- | ---------- | --------- | ------------ |
| system.swap.usage | pages | UpDownSumObserver | Int64 | state | used, free |
| system.swap.utilization | 1 | ValueObserver | Double | state | used, free |
| system.swap.page\_faults | faults | SumObserver | Int64 | type | major, minor |
| system.swap.page\_operations | operations | SumObserver | Int64 | type | major, minor |
| | | | | direction | in, out |

#### `system.disk.`

**Description:** System level disk performance metrics.
| Name | Units | Instrument Type | Value Type | Label Key | Label Values |
| ---------------------------- | ---------- | --------------- | ---------- | --------- | ------------ |
| system.disk.io<!--notlink--> | bytes | SumObserver | Int64 | device | (identifier) |
| | | | | direction | read, write |
| system.disk.operations | operations | SumObserver | Int64 | device | (identifier) |
| | | | | direction | read, write |
| system.disk.time | seconds | SumObserver | Double | device | (identifier) |
| | | | | direction | read, write |
| system.disk.merged | 1 | SumObserver | Int64 | device | (identifier) |
| | | | | direction | read, write |

#### `system.filesystem.`

**Description:** System level filesystem metrics.
| Name | Units | Instrument Type | Value Type | Label Key | Label Values |
| ----------------------------- | ----- | ----------------- | ---------- | --------- | -------------------- |
| system.filesystem.usage | bytes | UpDownSumObserver | Int64 | device | (identifier) |
| | | | | state | used, free, reserved |
| system.filesystem.utilization | 1 | ValueObserver | Double | device | (identifier) |
| | | | | state | used, free, reserved |

#### `system.network.`

**Description:** System level network metrics.
| Name | Units | Instrument Type | Value Type | Label Key | Label Values |
| ------------------------------- | ----------- | ----------------- | ---------- | --------- | ---------------------------------------------------------------------------------------------- |
| system.network.dropped\_packets | packets | SumObserver | Int64 | device | (identifier) |
| | | | | direction | transmit, receive |
| system.network.packets | packets | SumObserver | Int64 | device | (identifier) |
| | | | | direction | transmit, receive |
| system.network.errors | errors | SumObserver | Int64 | device | (identifier) |
| | | | | direction | transmit, receive |
| system<!--notlink-->.network.io | bytes | SumObserver | Int64 | device | (identifier) |
| | | | | direction | transmit, receive |
| system.network.connections | connections | UpDownSumObserver | Int64 | device | (identifier) |
| | | | | protocol | tcp, udp, [etc.](https://en.wikipedia.org/wiki/Transport_layer#Protocols) |
| | | | | state | [e.g. for tcp](https://en.wikipedia.org/wiki/Transmission_Control_Protocol#Protocol_operation) |

#### `system.process.`

**Description:** System level aggregate process metrics. For metrics at the
individual process level, see [process metrics](process-metrics.md).
| Name | Units | Instrument Type | Value Type | Label Key | Label Values |
| -------------------- | --------- | --------------- | ---------- | --------- | ---------------------------------------------------------------------------------------------- |
| system.process.count | processes | SumObserver | Int64 | status | running, sleeping, [etc.](https://man7.org/linux/man-pages/man1/ps.1.html#PROCESS_STATE_CODES) |

#### OS Specific System Metrics - `system.{os}.`

Instrument names for system level metrics that have different and conflicting
meaning across multiple OSes should be prefixed with `system.{os}.` and
follow the hierarchies listed above for different entities like CPU, memory,
and network. For example, an instrument for measuring the load average on
Linux could be named `system.linux.cpu.load`, reusing the `cpu` name proposed
above.

0 comments on commit 1040fc2

Please sign in to comment.