Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Semantic conventions for telemetry pipeline monitoring #238

Closed
wants to merge 15 commits into from
Binary file added text/images/otel-pipeline-monitoring.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
181 changes: 181 additions & 0 deletions text/metrics/0238-pipeline-monitoring.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,181 @@
# OpenTelemetry Telemetry Pipeline metrics

Propose a uniform standard for telemetry pipeline metrics generated by
OpenTelemetry SDKs and Collectors with support for several levels of
detail.

**WIP**: This document has been edited recently, based on reviewer
feedback. Since it has changed substantially, I removed a lot of
text. I will restore this document after sharing the revisions with
reviewers.

## Motivation

OpenTelemetry desires to standardize conventions for the metrics
emitted by SDKs about success and failure of telemetry reporting. At
the same time, the OpenTelemetry Collector is becoming a stable and
critical part of the ecosystem, and it has existing conventions which
are expected to connect with metrics emitted by SDKs.

We use the term "pipeline" to describe an arrangement of system
components which produce, consume, and process telemetry on its way
from the point of origin to the endpoint(s) in its journey.

## Explanation

### Detailed design

The proposed metric instrument would be named distinctly depending on
whether it is a collector or an SDK, to prevent accidental aggregation
of these timeseries. The specified counter names would be:

- `otelsdk.producer.items`: count of successful and failed items of
telemetry produced, by signal type, by an OpenTelemetry SDK.
- `otelcol.receiver.items`: count of successful and failed items of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I have a connector configured, I assume I get both the otelcol.exporter.items and otelcol.receiver.items metrics emitted for the connector, right?

Let's say I configured the Count connector on a traces pipeline, as described in the example in the component's README. The count connector then accepts traces on the traces/in pipeline and creates metrics on the metrics/out pipeline.

I imagine the otelcol.exporter.items metric for the count connector would count the incoming spans on the trace pipeline. What would be the otel.outcome for those correctly consumed spans? Would it be consumed or rather unsampled? These logs aren't shipped anywhere by the component, they are "swallowed" by the connector if I understand correctly.

I imagine the otelcol.receiver.items metric for the count connector would count the metrics created on the metrics pipeline, with the otel.outcome set to consumed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the consume operation synchronous? I think the traces/in pipeline will wait until the metrics/out pipeline finishes the consume operation, so the outcome for traces/in will depend on the outcome for metrics/out. If metrics/out fails w/ a retryable status, maybe the producer will retry.

Since the count connector can produce more or fewer metric data points than arriving spans, I do not expect the item counts to match between the exporter and receiver, but I think the outcomes could match for synchronous operations. If the operation is asynchronous, the rules discussed in this proposal would apply -- the traces/in might see consumed while the metrics/out sees some sort of failure.

I don't see any problems, per se, just that the monitoring equations for connectors don't apply. I can't assume that the items_in == items_dropped + items_out.

telemetry received, by signal type, by an OpenTelemetry Collector
receiver component.
- `otelcol.processor.items`: count of successful and failed items of
telemetry processed, by signal type, by an OpenTelemetry Collector
receiver component.
- `otelcol.exporter.items`: count of successful and failed items of
telemetry processed, by signal type, by an OpenTelemetry Collector
receiver component.

### Recommended conventional attributes

- `otel.success` (boolean): This is true or false depending on whether the
component considers the outcome a success or a failure.
- `otel.outcome` (string): This describes the outcome in a more specific
way than `otel.success`, with recommended values specified below.
- `otel.signal` (string): This is the name of the signal (e.g., "logs",
"metrics", "traces")
- `otel.name` (string): Name of the component in a pipeline.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- `otel.name` (string): Name of the component in a pipeline.
- `otel.component` (string): Name of the component in a pipeline.

- `otel.pipeline` (string): Name of the pipeline in a collector.

### Specified `otel.outcome` attribute values

The `otel.outcome` attribute indicates extra information about a
success or failure. A set of standard conventional attribute values
is supplied and is considered a closed set. If these outcomes do not
accurately explain the reason for a success or failure outcome, they
SHOULD be extended by OpenTelemetry.

For success=true:

- `accepted`: Indicates a normal, synchronous request success case.
The item was consumed by the next stage of the pipeline, which
returned success. Note the item could have been suppressed by a
subsequent component, but as far as this component knows, the
request successful.
- `suppressed:<any other outcome>`: When the true
outcome is not known at the time of counting, and the compnent
intentionally returns success to its producer. Examples are given
below.

For both success=true and success=false, there is a special outcome
indicating items did not reach the next stage in the pipeline,
considered "dropped". When comparing pipeline metrics from one stage
to the next, those which are dropped by a component are expected not
to appear in totals of the subequent pipeline.

- `dropped`: Processors may use this to indicate both success and
failure, for example include sampling processors and filtering
processors, which successfully avoid sending data based on
configuration. For all components, dropped with success=false
indicates that the component introduced an original failure and did
not send to the next stage in the pipeline.

For success=false, transient and potentially retryable:

- `deadline_exceeded`: The item was in the process of being sent but the request
timed out, or its deadline was exceeded.
- `resource_exhausted`: The item was handled by the next stage of the
pipeline, which returned an error code indicating that it was
overloaded. If the resource being exhausted is local and the item
was not handled by the next stage of the pipeline, use `dropped`.
- `retryable`: The item was handled by the next stage of the pipeline,
which returned a retryable error status not covered by any of the
above values.

For success=false, permanent category:

- `rejected`: The item was handled by the next stage of the pipeline,
which returned a permanent error status or partial success status
indicating that some items could not be accepted.


#### Success, Outcome matrix

| Success | Outcome | Meaning |
|---------|------------------------------|-------------------------------------------------------------------|
| true | accepted | Synchronous send succeeded |
| true | dropped | Dropped by intention |
| false | dropped | Producer saw the component return failure, request was not sent |
| false | deadline_exceeded | Producer saw the component return failure, request timed out |
| false | resource_exhausted | Producer saw the component return failure, insufficient resources |
| false | retryable | Producer saw the component return other non-permanent condition |
| false | rejected | Producer saw the component return a permanent condition |
| true | supressed:accepted | Producer saw success; eventually accepted |
| true | supressed:dropped | Producer saw success; request was not sent |
| true | supressed:deadline_exceeded | Producer saw success; request sent, timed out |
| true | supressed:resource_exhausted | Producer saw success; request sent, insufficient resources |
| true | supressed:retryable | Producer saw success; request sent, other non-permanent condition |
| true | supressed:rejected | Producer saw success; request sent, permanent condition |
| true | supressed:unknown | Producer saw success; no effort to report true outcome |

#### Examples of each outcome

##### Success, Accepted

This is the common success case. The item(s) were sent to the next
stage in the pipeline while blocking the producer.

##### Success, Dropped

A processor was configured with instructions not to pass certain data.

##### Success, Suppressed-Accepted

A component returned success to its producer, and later the outcome
was successful.

##### Failure, Dropped and Success, Suppressed-Dropped

(If suppressed: A component returned success to its producer, then ...)

The component never sent the item(s) due to limits in effect. For
example, shutdown was ordered and the queue could not be drained in
time due to a limit on parallelism.

##### Failure, Deadline exceeded and Success, Suppressed-Deadline exceeded

(If suppressed: A component returned success to its producer, then ...)

The component attempted sending the item(s), but the item(s) did not
succeed before the deadline expired. If there were attempts to retry,
this is outcome of the final attempt.

##### Failure, Resource exhausted and Success, Suppressed-Resource exhausted

(If suppressed: A component returned success to its producer, then ...)

The component attempted sending the item(s), but the consumer
indicated its (or its consumers') resources were exceeded. If there
were attempts to retry, this is outcome of the final attempt.

##### Failure, Retryable and Success, Suppressed-Retryable

(If suppressed: A component returned success to its producer, then ...)

A component returned success to its producer, and then it attempted
sending the item(s), but the consumer indicated some kind of transient
condition other than deadline- or resource-related (e.g., connection
not accepted). If there were attempts to retry, this is outcome of
the final attempt.

##### Failure, Rejected and Success, Suppressed-Rejected

(If suppressed: A component returned success to its producer, then ...)

A compmnent returned success to its producer, and then it attempted
sending the item(s), but the consumer returned a permanent error.