-
Notifications
You must be signed in to change notification settings - Fork 889
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add metrics spec supplementary guidelines (#1966)
* add metrics spec supplementary guidelines * fix wording * tweak wording * improve readability * fix typo * spellcheck * fix typo * add more examples * try to clarify a bit * improve wording * cover more topics * fix typo * fix typo * layout * fix typo * Update specification/metrics/supplementary-guidelines.md Co-authored-by: Alan West <3676547+alanwest@users.noreply.github.com> Co-authored-by: Alan West <3676547+alanwest@users.noreply.github.com>
- Loading branch information
Showing
1 changed file
with
251 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,251 @@ | ||
# Supplementary Guidelines | ||
|
||
Note: this document is NOT a spec, it is provided to support the Metrics | ||
[API](./api.md) and [SDK](./sdk.md) specifications, it does NOT add any extra | ||
requirements to the existing specifications. | ||
|
||
Table of Contents: | ||
|
||
* [Guidelines for instrumentation library | ||
authors](#guidelines-for-instrumentation-library-authors) | ||
* [Guidelines for SDK authors](#guidelines-for-sdk-authors) | ||
* [Aggregation temporality](#aggregation-temporality) | ||
* [Memory management](#memory-management) | ||
|
||
## Guidelines for instrumentation library authors | ||
|
||
TBD | ||
|
||
## Guidelines for SDK authors | ||
|
||
### Aggregation temporality | ||
|
||
The OpenTelemetry Metrics [Data Model](./datamodel.md) and [SDK](./sdk.md) are | ||
designed to support both Cumulative and Delta | ||
[Temporality](./datamodel.md#temporality). It is important to understand that | ||
temporality will impact how the SDK could manage memory usage. Let's take the | ||
following HTTP requests example: | ||
|
||
* During the time range (T<sub>0</sub>, T<sub>1</sub>]: | ||
* verb = `GET`, status = `200`, duration = `50 (ms)` | ||
* verb = `GET`, status = `200`, duration = `100 (ms)` | ||
* verb = `GET`, status = `500`, duration = `1 (ms)` | ||
* During the time range (T<sub>1</sub>, T<sub>2</sub>]: | ||
* no HTTP request has been received | ||
* During the time range (T<sub>2</sub>, T<sub>3</sub>] | ||
* verb = `GET`, status = `500`, duration = `5 (ms)` | ||
* verb = `GET`, status = `500`, duration = `2 (ms)` | ||
* During the time range (T<sub>3</sub>, T<sub>4</sub>]: | ||
* verb = `GET`, status = `200`, duration = `100 (ms)` | ||
* During the time range (T<sub>4</sub>, T<sub>5</sub>]: | ||
* verb = `GET`, status = `200`, duration = `100 (ms)` | ||
* verb = `GET`, status = `200`, duration = `30 (ms)` | ||
* verb = `GET`, status = `200`, duration = `50 (ms)` | ||
|
||
Let's imagine we export the metrics as [Histogram](./datamodel.md#histogram), | ||
and to simplify the story we will only have one histogram bucket `(-Inf, +Inf)`: | ||
|
||
If we export the metrics using **Delta Temporality**: | ||
|
||
* (T<sub>0</sub>, T<sub>1</sub>] | ||
* dimensions: {verb = `GET`, status = `200`}, count: `2`, min: `50 (ms)`, max: | ||
`100 (ms)` | ||
* dimensions: {verb = `GET`, status = `500`}, count: `1`, min: `1 (ms)`, max: | ||
`1 (ms)` | ||
* (T<sub>1</sub>, T<sub>2</sub>] | ||
* nothing since we don't have any Measurement received | ||
* (T<sub>2</sub>, T<sub>3</sub>] | ||
* dimensions: {verb = `GET`, status = `500`}, count: `2`, min: `2 (ms)`, max: | ||
`5 (ms)` | ||
* (T<sub>3</sub>, T<sub>4</sub>] | ||
* dimensions: {verb = `GET`, status = `200`}, count: `1`, min: `100 (ms)`, | ||
max: `100 (ms)` | ||
* (T<sub>4</sub>, T<sub>5</sub>] | ||
* dimensions: {verb = `GET`, status = `200`}, count: `3`, min: `30 (ms)`, max: | ||
`100 (ms)` | ||
|
||
You can see that the SDK **only needs to track what has happened after the | ||
latest collection/export cycle**. For example, when the SDK started to process | ||
measurements in (T<sub>1</sub>, T<sub>2</sub>], it can completely forget about | ||
what has happened during (T<sub>0</sub>, T<sub>1</sub>]. | ||
|
||
If we export the metrics using **Cumulative Temporality**: | ||
|
||
* (T<sub>0</sub>, T<sub>1</sub>] | ||
* dimensions: {verb = `GET`, status = `200`}, count: `2`, min: `50 (ms)`, max: | ||
`100 (ms)` | ||
* dimensions: {verb = `GET`, status = `500`}, count: `1`, min: `1 (ms)`, max: | ||
`1 (ms)` | ||
* (T<sub>0</sub>, T<sub>2</sub>] | ||
* dimensions: {verb = `GET`, status = `200`}, count: `2`, min: `50 (ms)`, max: | ||
`100 (ms)` | ||
* dimensions: {verb = `GET`, status = `500`}, count: `1`, min: `1 (ms)`, max: | ||
`1 (ms)` | ||
* (T<sub>0</sub>, T<sub>3</sub>] | ||
* dimensions: {verb = `GET`, status = `200`}, count: `2`, min: `50 (ms)`, max: | ||
`100 (ms)` | ||
* dimensions: {verb = `GET`, status = `500`}, count: `3`, min: `1 (ms)`, max: | ||
`5 (ms)` | ||
* (T<sub>0</sub>, T<sub>4</sub>] | ||
* dimensions: {verb = `GET`, status = `200`}, count: `3`, min: `50 (ms)`, max: | ||
`100 (ms)` | ||
* dimensions: {verb = `GET`, status = `500`}, count: `3`, min: `1 (ms)`, max: | ||
`5 (ms)` | ||
* (T<sub>0</sub>, T<sub>5</sub>] | ||
* dimensions: {verb = `GET`, status = `200`}, count: `6`, min: `30 (ms)`, max: | ||
`100 (ms)` | ||
* dimensions: {verb = `GET`, status = `500`}, count: `3`, min: `1 (ms)`, max: | ||
`5 (ms)` | ||
|
||
You can see that we are performing Delta->Cumulative conversion, and the SDK | ||
**has to track what has happened prior to the latest collection/export cycle**, | ||
in the worst case, the SDK **will have to remember what has happened since the | ||
beginning of the process**. | ||
|
||
Imagine if we have a long running service and we collect metrics with 7 | ||
dimensions and each dimension can have 30 different values. We might eventually | ||
end up having to remember the complete set of all `21,870,000,000` permutations! | ||
This **cardinality explosion** is a well-known challenge in the metrics space. | ||
|
||
Making it even worse, if we export the permutations even if there are no recent | ||
updates, the export batch could become huge and will be very costly. For | ||
example, do we really need/want to export the same thing for (T<sub>0</sub>, | ||
T<sub>2</sub>] in the above case? | ||
|
||
So here are some suggestions that we encourage SDK implementers to consider: | ||
|
||
* You want to control the memory usage rather than allow it to grow indefinitely | ||
/ unbounded - regardless of what aggregation temporality is being used. | ||
* You want to improve the memory efficiency by being able to **forget about | ||
things that are no longer needed**. | ||
* You probably don't want to keep exporting the same thing over and over again, | ||
if there is no updates. You might want to consider [Resets and | ||
Gaps](./datamodel.md#resets-and-gaps). For example, if a Cumulative metrics | ||
stream hasn't received any updates for a long period of time, would it be okay | ||
to reset the start time? | ||
|
||
In the above case, we have Measurements reported by a [Histogram | ||
Instrument](./api.md#histogram). What if we collect measurements from an | ||
[Asynchronous Counter](./api.md#asynchronous-counter)? | ||
|
||
The following example shows the number of [page | ||
faults](https://en.wikipedia.org/wiki/Page_fault) of each thread since the | ||
thread ever started: | ||
|
||
* During the time range (T<sub>0</sub>, T<sub>1</sub>]: | ||
* pid = `1001`, tid = `1`, #PF = `50` | ||
* pid = `1001`, tid = `2`, #PF = `30` | ||
* During the time range (T<sub>1</sub>, T<sub>2</sub>]: | ||
* pid = `1001`, tid = `1`, #PF = `53` | ||
* pid = `1001`, tid = `2`, #PF = `38` | ||
* During the time range (T<sub>2</sub>, T<sub>3</sub>] | ||
* pid = `1001`, tid = `1`, #PF = `56` | ||
* pid = `1001`, tid = `2`, #PF = `42` | ||
* During the time range (T<sub>3</sub>, T<sub>4</sub>]: | ||
* pid = `1001`, tid = `1`, #PF = `60` | ||
* pid = `1001`, tid = `2`, #PF = `47` | ||
* During the time range (T<sub>4</sub>, T<sub>5</sub>]: | ||
* thread 1 died, thread 3 started | ||
* pid = `1001`, tid = `2`, #PF = `53` | ||
* pid = `1001`, tid = `3`, #PF = `5` | ||
|
||
If we export the metrics using **Cumulative Temporality**: | ||
|
||
* (T<sub>0</sub>, T<sub>1</sub>] | ||
* dimensions: {pid = `1001`, tid = `1`}, sum: `50` | ||
* dimensions: {pid = `1001`, tid = `2`}, sum: `30` | ||
* (T<sub>0</sub>, T<sub>2</sub>] | ||
* dimensions: {pid = `1001`, tid = `1`}, sum: `53` | ||
* dimensions: {pid = `1001`, tid = `2`}, sum: `38` | ||
* (T<sub>0</sub>, T<sub>3</sub>] | ||
* dimensions: {pid = `1001`, tid = `1`}, sum: `56` | ||
* dimensions: {pid = `1001`, tid = `2`}, sum: `42` | ||
* (T<sub>0</sub>, T<sub>4</sub>] | ||
* dimensions: {pid = `1001`, tid = `1`}, sum: `60` | ||
* dimensions: {pid = `1001`, tid = `2`}, sum: `47` | ||
* (T<sub>0</sub>, T<sub>5</sub>] | ||
* dimensions: {pid = `1001`, tid = `2`}, sum: `53` | ||
* dimensions: {pid = `1001`, tid = `3`}, sum: `5` | ||
|
||
It is quite straightforward - we just take the data being reported from the | ||
asynchronous instruments and send them. We might want to consider if [Resets and | ||
Gaps](./datamodel.md#resets-and-gaps) should be used to denote the end of a | ||
metric stream - e.g. thread 1 died, the thread ID might be reused by the | ||
operating system, and we probably don't want to confuse the metrics backend. | ||
|
||
If we export the metrics using **Delta Temporality**: | ||
|
||
* (T<sub>0</sub>, T<sub>1</sub>] | ||
* dimensions: {pid = `1001`, tid = `1`}, delta: `50` | ||
* dimensions: {pid = `1001`, tid = `2`}, delta: `30` | ||
* (T<sub>1</sub>, T<sub>2</sub>] | ||
* dimensions: {pid = `1001`, tid = `1`}, delta: `3` | ||
* dimensions: {pid = `1001`, tid = `2`}, delta: `8` | ||
* (T<sub>2</sub>, T<sub>3</sub>] | ||
* dimensions: {pid = `1001`, tid = `1`}, delta: `3` | ||
* dimensions: {pid = `1001`, tid = `2`}, delta: `4` | ||
* (T<sub>3</sub>, T<sub>4</sub>] | ||
* dimensions: {pid = `1001`, tid = `1`}, delta: `4` | ||
* dimensions: {pid = `1001`, tid = `2`}, delta: `5` | ||
* (T<sub>4</sub>, T<sub>5</sub>] | ||
* dimensions: {pid = `1001`, tid = `2`}, delta: `6` | ||
* dimensions: {pid = `1001`, tid = `3`}, delta: `5` | ||
|
||
You can see that we are performing Cumulative->Delta conversion, and it requires | ||
us to remember the last value of **every single permutation we've encountered so | ||
far**, because if we don't, we won't be able to calculate the delta value using | ||
`current value - last value`. And as you can tell, this is super expensive. | ||
|
||
Making it more interesting, if we have min/max value, it is **mathematically | ||
impossible** to reliably deduce the Delta temporality from Cumulative | ||
temporality. For example: | ||
|
||
* If the maximum value is 10 during (T<sub>0</sub>, T<sub>2</sub>] and the | ||
maximum value is 20 during (T<sub>0</sub>, T<sub>3</sub>], we know that the | ||
maximum value during (T<sub>2</sub>, T<sub>3</sub>] must be 20. | ||
* If the maximum value is 20 during (T<sub>0</sub>, T<sub>2</sub>] and the | ||
maximum value is also 20 during (T<sub>0</sub>, T<sub>3</sub>], we wouldn't | ||
know what the maximum value is during (T<sub>2</sub>, T<sub>3</sub>], unless | ||
we know that there is no value (count = 0). | ||
|
||
So here are some suggestions that we encourage SDK implementers to consider: | ||
|
||
* You probably don't want to encourage your users to do Cumulative->Delta | ||
conversion. Actually, you might want to discourage them from doing this. | ||
* If you have to do Cumulative->Delta conversion, and you encountered min/max, | ||
rather than drop the data on the floor, you might want to convert them to | ||
something useful - e.g. [Gauge](./datamodel.md#gauge). | ||
|
||
### Memory management | ||
|
||
Memory management is a wide topic, here we will only cover some of the most | ||
important things for OpenTelemetry SDK. | ||
|
||
**Choose a better design so the SDK has less things to be memorized**, avoid | ||
keeping things in memory unless there is a must need. One good example is the | ||
[aggregation temporality](#aggregation-temporality). | ||
|
||
**Design a better memory layout**, so the storage is efficient and accessing the | ||
storage can be fast. This is normally specific to the targeting programming | ||
language and platform. For example, aligning the memory to the CPU cache line, | ||
keeping the hot memories close to each other, keeping the memory close to the | ||
hardware (e.g. non-paged pool, | ||
[NUMA](https://en.wikipedia.org/wiki/Non-uniform_memory_access)). | ||
|
||
**Pre-allocate and pool the memory**, so the SDK doesn't have to allocate memory | ||
on-the-fly. This is especially useful to language runtimes that have garbage | ||
collectors, as it ensures the hot path in the code won't trigger garbage | ||
collection. | ||
|
||
**Limit the memory usage, and handle critical memory condition.** The general | ||
expectation is that a telemetry SDK should not fail the application. This can be | ||
done via some dimension-capping algorithm - e.g. start to combine/drop some data | ||
points when the SDK hits the memory limit, and provide a mechanism to report the | ||
data loss. | ||
|
||
**Provide configurations to the application owner.** The answer to _"what is an | ||
efficient memory usage"_ is ultimately depending on the goal of the application | ||
owner. For example, the application owners might want to spend more memory in | ||
order to keep more permutations of metrics dimensions, or they might want to use | ||
memory aggressively for certain dimensions that are important, and keep a | ||
conservative limit for dimensions that are less important. |