diff --git a/oteps/metrics/0131-otlp-export-behavior.md b/oteps/metrics/0131-otlp-export-behavior.md new file mode 100644 index 000000000..29aeedeb5 --- /dev/null +++ b/oteps/metrics/0131-otlp-export-behavior.md @@ -0,0 +1,43 @@ +# OTLP Exporters Configurable Export Behavior + + Add support for configurable export behavior in OTLP exporters. + + The expected behavior required are 1) exporting cumulative values since start time by default, and 2) exporting delta values per collection interval when configured. + +## Motivation + +1. **Export behavior should be configurable**: Metric backends such as Prometheus, Cortex and other backends supporting Prometheus time-series that ingest data from the Prometheus remote write API, require cumulative values for cumulative metrics and additive metrics, per collection interval. In order to export metrics generated by the SDK using the Collector, incoming values from the SDK should be cumulative values. Note than in comparison, backends like Statsd expect delta values for each collection interval. To support different backend requirements, OTLP metric export behavior needs to be configurable, with cumulative values exported as a default. See discussion in [#731](https://github.com/open-telemetry/opentelemetry-specification/issues/731). +2. **Cumulative export should be the default behavior since it is more reliable**: Cumulative export also addresses the problem of missing delta values for an UpDownCounter. The final consumer of the UpDownCounter metrics is almost always interested in the cumulative value. If the Metrics SDK exports deltas and allows the consumer aggregate cumulative values, then any deltas lost in-transit will lead to inaccurate final values. This loss may impact the condition on which an alert is fired or not. On the other hand, exporting cumulative values guarantees only resolution is lost, but the value received by the final consumer will be correct eventually. + 1. *Note:* The [Metrics SIG](https://docs.google.com/document/d/1LfDVyBJlIewwm3a0JtDtEjkusZjzQE3IAix8b0Fxy3Y/edit#heading=h.fxqkpi2ya3br) *July 23 and July 30 meetings concluded that cumulative export behavior is more reliable.* For example, Bogdan Drutu in [#725](https://github.com/open-telemetry/opentelemetry-specification/issues/725) notes “When exporting delta values of an UpdownCounter instrument, the export pipeline becomes a single point of failure for the alerts, any dropped "delta" will influence the "current" value of the metric in an undefined way." + +## Explanation + +In order to support Prometheus backends using cumulative values as well as other backends that use delta values, the SDK needs to be configurable and support an OTLP exporter which handles both cumulative values by default and delta values for export. The implication is that the OTLP metric protocol should support both cumulative and delta reporting strategies. + +Users should be allowed to declare an environment variable or configuration field that determines this setting for OTLP exporters. + +## Internal details + +OTLP exporters can report using the behavior it needs to the Metrics SDK. The SDK can merge the previous state of metrics with current value and return the appropriate values to the exporter. + +Configurable export behavior is already coded in the Metrics Processor component in the [Go SDK](https://github.com/open-telemetry/opentelemetry-go/pull/840). However, this functionality is hardcoded today and would need to rewritten to handle user-defined configuration. See the OTLP metrics definition in [PR #193](https://github.com/open-telemetry/opentelemetry-proto/pull/193), which support both export behaviors. + +## Trade-offs and mitigations + + High memory usage: To support cumulative exports, the SDK needs to maintain state for each cumulative metrics. This means users with high-cardinality metrics can experience high memory usage. + +The high-cardinality metrics use case could be addressed by adding the metrics aggregation processor in the Collector. This would enable the Collector, when configured as an Agent, to support converting delta OTLP to Cumulative OTLP. This functionality requires a single agent for each metric-generating client so that all delta values of a metric are converted by the same Collector instance. + +## Prior art and alternatives + +A discussed solution is to convert deltas to cumulative in the Collector both as an agent and as a standalone service. However, supporting conversion in the Collector when it is a standalone service requires implementation of a routing mechanism across all Collector instances to ensure delta values of the same cumulative metric are aggregated by the same Collector instance. + +## Open questions + +As stated in the previous section, delta to cumulative conversion in the Collector is needed to support Prometheus type backends. This may be necessary in the Collector in the future because the Collector may also accept metrics from other sources that report delta values. On the other hand, if sources are reporting cumulative values, cumulative to delta conversion is needed to support Statsd type backends. + +The future implementation for conversions in the Collector is still under discussion. There is a proposal is to add a [Metric Aggregation Processor](https://github.com/open-telemetry/opentelemetry-collector/issues/1422) in the Collector which recommends a solution for delta to cumulative conversion. + +## Future possibilities + +A future improvement that could be considered is to support a dynamic configuration from a configuration server that determines the appropriate export strategy of OTLP clients at startup.