open-telemetry · jmacd · Oct 28, 2023 · Oct 31, 2023 · Oct 31, 2023 · Dec 15, 2023
diff --git a/text/images/otel-pipeline-monitoring.png b/text/images/otel-pipeline-monitoring.png
diff --git a/text/metrics/0238-pipeline-monitoring.md b/text/metrics/0238-pipeline-monitoring.md
@@ -0,0 +1,181 @@
+# OpenTelemetry Telemetry Pipeline metrics
+
+Propose a uniform standard for telemetry pipeline metrics generated by
+OpenTelemetry SDKs and Collectors with support for several levels of
+detail.
+
+**WIP**: This document has been edited recently, based on reviewer
+feedback.  Since it has changed substantially, I removed a lot of
+text.  I will restore this document after sharing the revisions with
+reviewers.
+
+## Motivation
+
+OpenTelemetry desires to standardize conventions for the metrics
+emitted by SDKs about success and failure of telemetry reporting. At
+the same time, the OpenTelemetry Collector is becoming a stable and
+critical part of the ecosystem, and it has existing conventions which
+are expected to connect with metrics emitted by SDKs.
+
+We use the term "pipeline" to describe an arrangement of system
+components which produce, consume, and process telemetry on its way
+from the point of origin to the endpoint(s) in its journey.
+
+## Explanation
+
+### Detailed design
+
+The proposed metric instrument would be named distinctly depending on
+whether it is a collector or an SDK, to prevent accidental aggregation
+of these timeseries.  The specified counter names would be:
+
+- `otelsdk.producer.items`: count of successful and failed items of
+  telemetry produced, by signal type, by an OpenTelemetry SDK.
+- `otelcol.receiver.items`: count of successful and failed items of
+  telemetry received, by signal type, by an OpenTelemetry Collector
+  receiver component.
+- `otelcol.processor.items`: count of successful and failed items of
+  telemetry processed, by signal type, by an OpenTelemetry Collector
+  receiver component.
+- `otelcol.exporter.items`: count of successful and failed items of
+  telemetry processed, by signal type, by an OpenTelemetry Collector
+  receiver component.
+
+### Recommended conventional attributes
+
+- `otel.success` (boolean): This is true or false depending on whether the
+  component considers the outcome a success or a failure.
+- `otel.outcome` (string): This describes the outcome in a more specific
+  way than `otel.success`, with recommended values specified below.
+- `otel.signal` (string): This is the name of the signal (e.g., "logs",
+  "metrics", "traces")
+- `otel.name` (string): Name of the component in a pipeline.
- `otel.name` (string): Name of the component in a pipeline.
+- `otel.component` (string): Name of the component in a pipeline.
- `otel.name` (string): Name of the component in a pipeline.
+- `otel.component` (string): Name of the component in a pipeline.
+- `otel.pipeline` (string): Name of the pipeline in a collector.
+
+### Specified `otel.outcome` attribute values
+
+The `otel.outcome` attribute indicates extra information about a
+success or failure.  A set of standard conventional attribute values
+is supplied and is considered a closed set.  If these outcomes do not
+accurately explain the reason for a success or failure outcome, they
+SHOULD be extended by OpenTelemetry.
+
+For success=true:
+
+- `accepted`: Indicates a normal, synchronous request success case.
+  The item was consumed by the next stage of the pipeline, which
+  returned success.  Note the item could have been suppressed by a
+  subsequent component, but as far as this component knows, the 
+  request successful.
+- `suppressed:<any other outcome>`: When the true
+  outcome is not known at the time of counting, and the compnent
+  intentionally returns success to its producer.  Examples are given
+  below.
+
+For both success=true and success=false, there is a special outcome
+indicating items did not reach the next stage in the pipeline,
+considered "dropped".  When comparing pipeline metrics from one stage
+to the next, those which are dropped by a component are expected not
+to appear in totals of the subequent pipeline.
+
+- `dropped`: Processors may use this to indicate both success and
+  failure, for example include sampling processors and filtering
+  processors, which successfully avoid sending data based on
+  configuration.  For all components, dropped with success=false
+  indicates that the component introduced an original failure and did
+  not send to the next stage in the pipeline.
+
+For success=false, transient and potentially retryable:
+
+- `deadline_exceeded`: The item was in the process of being sent but the request
+  timed out, or its deadline was exceeded.
+- `resource_exhausted`: The item was handled by the next stage of the
+  pipeline, which returned an error code indicating that it was
+  overloaded.  If the resource being exhausted is local and the item
+  was not handled by the next stage of the pipeline, use `dropped`.
+- `retryable`: The item was handled by the next stage of the pipeline,
+  which returned a retryable error status not covered by any of the
+  above values.
+
+For success=false, permanent category:
+
+- `rejected`: The item was handled by the next stage of the pipeline,
+  which returned a permanent error status or partial success status
+  indicating that some items could not be accepted.
+
+
+#### Success, Outcome matrix
+
+| Success | Outcome                      | Meaning                                                           |
+|---------|------------------------------|-------------------------------------------------------------------|
+| true    | accepted                     | Synchronous send succeeded                                        |
+| true    | dropped                      | Dropped by intention                                              |
+| false   | dropped                      | Producer saw the component return failure, request was not sent   |
+| false   | deadline_exceeded            | Producer saw the component return failure, request timed out      |
+| false   | resource_exhausted           | Producer saw the component return failure, insufficient resources |
+| false   | retryable                    | Producer saw the component return other non-permanent condition   |
+| false   | rejected                     | Producer saw the component return a permanent condition           |
+| true    | supressed:accepted           | Producer saw success; eventually accepted                         |
+| true    | supressed:dropped            | Producer saw success; request was not sent                        |
+| true    | supressed:deadline_exceeded  | Producer saw success; request sent, timed out                     |
+| true    | supressed:resource_exhausted | Producer saw success; request sent, insufficient resources        |
+| true    | supressed:retryable          | Producer saw success; request sent, other non-permanent condition |
+| true    | supressed:rejected           | Producer saw success; request sent, permanent condition           |
+| true    | supressed:unknown            | Producer saw success; no effort to report true outcome            |
+
+#### Examples of each outcome
+
+##### Success, Accepted
+
+This is the common success case.  The item(s) were sent to the next
+stage in the pipeline while blocking the producer.
+
+##### Success, Dropped
+
+A processor was configured with instructions not to pass certain data.
+
+##### Success, Suppressed-Accepted
+
+A component returned success to its producer, and later the outcome
+was successful.
+
+##### Failure, Dropped and Success, Suppressed-Dropped
+
+(If suppressed: A component returned success to its producer, then ...)
+
+The component never sent the item(s) due to limits in effect.  For
+example, shutdown was ordered and the queue could not be drained in
+time due to a limit on parallelism.
+
+##### Failure, Deadline exceeded and Success, Suppressed-Deadline exceeded
+
+(If suppressed: A component returned success to its producer, then ...)
+
+The component attempted sending the item(s), but the item(s) did not
+succeed before the deadline expired.  If there were attempts to retry,
+this is outcome of the final attempt.
+
+##### Failure, Resource exhausted and Success, Suppressed-Resource exhausted
+
+(If suppressed: A component returned success to its producer, then ...)
+
+The component attempted sending the item(s), but the consumer
+indicated its (or its consumers') resources were exceeded.  If there
+were attempts to retry, this is outcome of the final attempt.
+
+##### Failure, Retryable and Success, Suppressed-Retryable
+
+(If suppressed: A component returned success to its producer, then ...)
+
+A component returned success to its producer, and then it attempted
+sending the item(s), but the consumer indicated some kind of transient
+condition other than deadline- or resource-related (e.g., connection
+not accepted).  If there were attempts to retry, this is outcome of
+the final attempt.
+
+##### Failure, Rejected and Success, Suppressed-Rejected
+
+(If suppressed: A component returned success to its producer, then ...)
+
+A compmnent returned success to its producer, and then it attempted
+sending the item(s), but the consumer returned a permanent error.