Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add metrics spec supplementary guidelines #1966

Merged
merged 19 commits into from
Sep 30, 2021
251 changes: 251 additions & 0 deletions specification/metrics/supplementary-guidelines.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,251 @@
# Supplementary Guidelines

Note: this document is NOT a spec, it is provided to support the Metrics
[API](./api.md) and [SDK](./sdk.md) specifications, it does NOT add any extra
requirements to the existing specifications.

Table of Contents:

* [Guidelines for instrumentation library
authors](#guidelines-for-instrumentation-library-authors)
* [Guidelines for SDK authors](#guidelines-for-sdk-authors)
* [Aggregation temporality](#aggregation-temporality)
* [Memory management](#memory-management)

## Guidelines for instrumentation library authors

TBD

## Guidelines for SDK authors

### Aggregation temporality

The OpenTelemetry Metrics [Data Model](./datamodel.md) and [SDK](./sdk.md) are
designed to support both Cumulative and Delta
[Temporality](./datamodel.md#temporality). It is important to understand that
temporality will impact how the SDK could manage memory usage. Let's take the
following HTTP requests example:

* During the time range (T<sub>0</sub>, T<sub>1</sub>]:
* verb = `GET`, status = `200`, duration = `50 (ms)`
* verb = `GET`, status = `200`, duration = `100 (ms)`
* verb = `GET`, status = `500`, duration = `1 (ms)`
* During the time range (T<sub>1</sub>, T<sub>2</sub>]:
* no HTTP request has been received
* During the time range (T<sub>2</sub>, T<sub>3</sub>]
* verb = `GET`, status = `500`, duration = `5 (ms)`
* verb = `GET`, status = `500`, duration = `2 (ms)`
* During the time range (T<sub>3</sub>, T<sub>4</sub>]:
* verb = `GET`, status = `200`, duration = `100 (ms)`
* During the time range (T<sub>4</sub>, T<sub>5</sub>]:
* verb = `GET`, status = `200`, duration = `100 (ms)`
* verb = `GET`, status = `200`, duration = `30 (ms)`
* verb = `GET`, status = `200`, duration = `50 (ms)`

Let's imagine we export the metrics as [Histogram](./datamodel.md#histogram),
and to simplify the story we will only have one histogram bucket `(-Inf, +Inf)`:

If we export the metrics using **Delta Temporality**:

* (T<sub>0</sub>, T<sub>1</sub>]
* dimensions: {verb = `GET`, status = `200`}, count: `2`, min: `50 (ms)`, max:
reyang marked this conversation as resolved.
Show resolved Hide resolved
`100 (ms)`
* dimensions: {verb = `GET`, status = `500`}, count: `1`, min: `1 (ms)`, max:
`1 (ms)`
* (T<sub>1</sub>, T<sub>2</sub>]
* nothing since we don't have any Measurement received
* (T<sub>2</sub>, T<sub>3</sub>]
* dimensions: {verb = `GET`, status = `500`}, count: `2`, min: `2 (ms)`, max:
`5 (ms)`
* (T<sub>3</sub>, T<sub>4</sub>]
* dimensions: {verb = `GET`, status = `200`}, count: `1`, min: `100 (ms)`,
max: `100 (ms)`
* (T<sub>4</sub>, T<sub>5</sub>]
* dimensions: {verb = `GET`, status = `200`}, count: `3`, min: `30 (ms)`, max:
`100 (ms)`

You can see that the SDK **only needs to track what has happened after the
latest collection/export cycle**. For example, when the SDK started to process
measurements in (T<sub>1</sub>, T<sub>2</sub>], it can completely forget about
what has happened during (T<sub>0</sub>, T<sub>1</sub>].

If we export the metrics using **Cumulative Temporality**:

* (T<sub>0</sub>, T<sub>1</sub>]
* dimensions: {verb = `GET`, status = `200`}, count: `2`, min: `50 (ms)`, max:
`100 (ms)`
* dimensions: {verb = `GET`, status = `500`}, count: `1`, min: `1 (ms)`, max:
`1 (ms)`
* (T<sub>0</sub>, T<sub>2</sub>]
* dimensions: {verb = `GET`, status = `200`}, count: `2`, min: `50 (ms)`, max:
`100 (ms)`
* dimensions: {verb = `GET`, status = `500`}, count: `1`, min: `1 (ms)`, max:
`1 (ms)`
* (T<sub>0</sub>, T<sub>3</sub>]
* dimensions: {verb = `GET`, status = `200`}, count: `2`, min: `50 (ms)`, max:
`100 (ms)`
* dimensions: {verb = `GET`, status = `500`}, count: `3`, min: `1 (ms)`, max:
`5 (ms)`
* (T<sub>0</sub>, T<sub>4</sub>]
* dimensions: {verb = `GET`, status = `200`}, count: `3`, min: `50 (ms)`, max:
`100 (ms)`
* dimensions: {verb = `GET`, status = `500`}, count: `3`, min: `1 (ms)`, max:
`5 (ms)`
* (T<sub>0</sub>, T<sub>5</sub>]
* dimensions: {verb = `GET`, status = `200`}, count: `6`, min: `30 (ms)`, max:
`100 (ms)`
* dimensions: {verb = `GET`, status = `500`}, count: `3`, min: `1 (ms)`, max:
`5 (ms)`

You can see that we are performing Delta->Cumulative conversion, and the SDK
**has to track what has happened prior to the latest collection/export cycle**,
in the worst case, the SDK **will have to remember what has happened since the
ever beginning of the process**.
reyang marked this conversation as resolved.
Show resolved Hide resolved

Imagine if we have a long running service and we collect metrics with 7
dimensions and each dimension can have 30 different values. We might eventually
end up having to remember the complete set of all `21,870,000,000` permutations!
This **cardinality explosion** is a well-known challenge in the metrics space.

Making it even worse, if we export the permutations even if there are no recent
updates, the export batch could become huge and will be very costly. For
example, do we really need/want to export the same thing for (T<sub>0</sub>,
T<sub>2</sub>] in the above case?

So here are some suggestions that we encourage SDK implementers to consider:

* You want to control the memory usage rather than allow it to grow indefinitely
/ unbounded - regardless of what aggregation temporality is being used.
* You want to improve the memory efficiency by being able to **forget about
things that are no longer needed**.
* You probably don't want to keep exporting the same thing over and over again,
if there is no updates. You might want to consider [Resets and
Gaps](./datamodel.md#resets-and-gaps). For example, if a Cumulative metrics
stream hasn't received any updates for a long period of time, would it be okay
to reset the start time?

In the above case, we have Measurements reported by a [Histogram
Instrument](./api.md#histogram). What if we collect measurements from an
[Asynchronous Counter](./api.md#asynchronous-counter)?

The following example shows the number of [page
faults](https://en.wikipedia.org/wiki/Page_fault) of each thread since the
thread ever started:

* During the time range (T<sub>0</sub>, T<sub>1</sub>]:
* pid = `1001`, tid = `1`, #PF = `50`
* pid = `1001`, tid = `2`, #PF = `30`
* During the time range (T<sub>1</sub>, T<sub>2</sub>]:
* pid = `1001`, tid = `1`, #PF = `53`
* pid = `1001`, tid = `2`, #PF = `38`
* During the time range (T<sub>2</sub>, T<sub>3</sub>]
* pid = `1001`, tid = `1`, #PF = `56`
* pid = `1001`, tid = `2`, #PF = `42`
* During the time range (T<sub>3</sub>, T<sub>4</sub>]:
* pid = `1001`, tid = `1`, #PF = `60`
* pid = `1001`, tid = `2`, #PF = `47`
* During the time range (T<sub>4</sub>, T<sub>5</sub>]:
* thread 1 died, thread 3 started
* pid = `1001`, tid = `2`, #PF = `53`
* pid = `1001`, tid = `3`, #PF = `5`

If we export the metrics using **Cumulative Temporality**:

* (T<sub>0</sub>, T<sub>1</sub>]
* dimensions: {pid = `1001`, tid = `1`}, sum: `50`
* dimensions: {pid = `1001`, tid = `2`}, sum: `30`
* (T<sub>0</sub>, T<sub>2</sub>]
* dimensions: {pid = `1001`, tid = `1`}, sum: `53`
* dimensions: {pid = `1001`, tid = `2`}, sum: `38`
* (T<sub>0</sub>, T<sub>3</sub>]
* dimensions: {pid = `1001`, tid = `1`}, sum: `56`
* dimensions: {pid = `1001`, tid = `2`}, sum: `42`
* (T<sub>0</sub>, T<sub>4</sub>]
* dimensions: {pid = `1001`, tid = `1`}, sum: `60`
* dimensions: {pid = `1001`, tid = `2`}, sum: `47`
* (T<sub>0</sub>, T<sub>5</sub>]
* dimensions: {pid = `1001`, tid = `2`}, sum: `53`
* dimensions: {pid = `1001`, tid = `3`}, sum: `5`

It is quite straightforward - we just take the data being reported from the
asynchronous instruments and send them. We might want to consider if [Resets and
Gaps](./datamodel.md#resets-and-gaps) should be used to denote the end of a
metric stream - e.g. thread 1 died, the thread ID might be reused by the
operating system, and we probably don't want to confuse the metrics backend.

If we export the metrics using **Delta Temporality**:

* (T<sub>0</sub>, T<sub>1</sub>]
* dimensions: {pid = `1001`, tid = `1`}, delta: `50`
* dimensions: {pid = `1001`, tid = `2`}, delta: `30`
* (T<sub>1</sub>, T<sub>2</sub>]
* dimensions: {pid = `1001`, tid = `1`}, delta: `3`
* dimensions: {pid = `1001`, tid = `2`}, delta: `8`
* (T<sub>2</sub>, T<sub>3</sub>]
* dimensions: {pid = `1001`, tid = `1`}, delta: `3`
* dimensions: {pid = `1001`, tid = `2`}, delta: `4`
* (T<sub>3</sub>, T<sub>4</sub>]
* dimensions: {pid = `1001`, tid = `1`}, delta: `4`
* dimensions: {pid = `1001`, tid = `2`}, delta: `5`
* (T<sub>4</sub>, T<sub>5</sub>]
* dimensions: {pid = `1001`, tid = `2`}, delta: `6`
* dimensions: {pid = `1001`, tid = `3`}, delta: `5`

You can see that we are performing Cumulative->Delta conversion, and it requires
us to remember the last value of **every single permutation we've encountered so
far**, because if we don't, we won't be able to calculate the delta value using
`current value - last value`. And as you can tell, this is super expensive.

Making it more interesting, if we have min/max value, it is **mathematically
impossible** to reliably deduce the Delta temporality from Cumulative
temporality. For example:

* If the maximum value is 10 during (T<sub>0</sub>, T<sub>2</sub>] and the
maximum value is 20 during (T<sub>0</sub>, T<sub>3</sub>], we know that the
maximum value during (T<sub>2</sub>, T<sub>3</sub>] must be 20.
* If the maximum value is 20 during (T<sub>0</sub>, T<sub>2</sub>] and the
maximum value is also 20 during (T<sub>0</sub>, T<sub>3</sub>], we wouldn't
know what the maximum value is during (T<sub>2</sub>, T<sub>3</sub>], unless
we know that there is no value (count = 0).
reyang marked this conversation as resolved.
Show resolved Hide resolved

So here are some suggestions that we encourage SDK implementers to consider:

* You probably don't want to encourage your users to do Cumulative->Delta
conversion. Actually, you might want to discourage them from doing this.
* If you have to do Cumulative->Delta conversion, and you encountered min/max,
rather than drop the data on the floor, you might want to convert them to
something useful - e.g. [Gauge](./datamodel.md#gauge).

### Memory management

Memory management is a wide topic, here we will only cover some of the most
important things for OpenTelemetry SDK.

**Choose a better design so the SDK has less things to be memorized**, avoid
keeping things in memory unless there is a must need. One good example is the
[aggregation temporality](#aggregation-temporality).

**Design a better memory layout**, so the storage is efficient and accessing the
storage can be fast. This is normally specific to the targeting programming
language and platform. For example, aligning the memory to the CPU cache line,
keeping the hot memories close to each other, keeping the memory close to the
hardware (e.g. non-paged pool,
[NUMA](https://en.wikipedia.org/wiki/Non-uniform_memory_access)).

**Pre-allocate and pool the memory**, so the SDK doesn't have to allocate memory
on-the-fly. This is especially useful to language runtimes that have garbage
collectors, as it ensures the hot path in the code won't trigger garbage
collection.

**Limit the memory usage, and handle critical memory condition.** The general
expectation is that a telemetry SDK should not fail the application. This can be
done via some dimension-capping algorithm - e.g. start to combine/drop some data
points when the SDK hits the memory limit, and provide a mechanism to report the
data loss.

**Provide configurations to the application owner.** The answer to _"what is an
efficient memory usage"_ is ultimately depending on the goal of the application
owner. For example, the application owners might want to spend more memory in
order to keep more permutations of metrics dimensions, or they might want to use
memory aggressively for certain dimensions that are important, and keep a
conservative limit for dimensions that are less important.