-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Semantic conventions for telemetry pipeline monitoring #238
Closed
Closed
Changes from all commits
Commits
Show all changes
15 commits
Select commit
Hold shift + click to select a range
cef7595
wip count success/failure/drops
jmacd 4d3cfae
rough draft
jmacd 3a9ef27
updates
jmacd 0062480
Wip
jmacd b39b732
TODO WIP too much text now
jmacd a117779
wip update
jmacd 3ce0510
wip2
jmacd bbcf391
specify values
jmacd 1334191
the long tail (start)
jmacd f9d1e9a
draft with no more TODOs
jmacd 12a1a6b
Merge branch 'main' of github.com:open-telemetry/oteps into jmacd/drops
jmacd a23795f
Merge branch 'main' of github.com:open-telemetry/oteps into jmacd/drops
jmacd 1f48c3e
wip
jmacd 261323c
add examples
jmacd 6195761
small revision; needs more work
jmacd File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,181 @@ | ||||||
# OpenTelemetry Telemetry Pipeline metrics | ||||||
|
||||||
Propose a uniform standard for telemetry pipeline metrics generated by | ||||||
OpenTelemetry SDKs and Collectors with support for several levels of | ||||||
detail. | ||||||
|
||||||
**WIP**: This document has been edited recently, based on reviewer | ||||||
feedback. Since it has changed substantially, I removed a lot of | ||||||
text. I will restore this document after sharing the revisions with | ||||||
reviewers. | ||||||
|
||||||
## Motivation | ||||||
|
||||||
OpenTelemetry desires to standardize conventions for the metrics | ||||||
emitted by SDKs about success and failure of telemetry reporting. At | ||||||
the same time, the OpenTelemetry Collector is becoming a stable and | ||||||
critical part of the ecosystem, and it has existing conventions which | ||||||
are expected to connect with metrics emitted by SDKs. | ||||||
|
||||||
We use the term "pipeline" to describe an arrangement of system | ||||||
components which produce, consume, and process telemetry on its way | ||||||
from the point of origin to the endpoint(s) in its journey. | ||||||
|
||||||
## Explanation | ||||||
|
||||||
### Detailed design | ||||||
|
||||||
The proposed metric instrument would be named distinctly depending on | ||||||
whether it is a collector or an SDK, to prevent accidental aggregation | ||||||
of these timeseries. The specified counter names would be: | ||||||
|
||||||
- `otelsdk.producer.items`: count of successful and failed items of | ||||||
telemetry produced, by signal type, by an OpenTelemetry SDK. | ||||||
- `otelcol.receiver.items`: count of successful and failed items of | ||||||
telemetry received, by signal type, by an OpenTelemetry Collector | ||||||
receiver component. | ||||||
- `otelcol.processor.items`: count of successful and failed items of | ||||||
telemetry processed, by signal type, by an OpenTelemetry Collector | ||||||
receiver component. | ||||||
- `otelcol.exporter.items`: count of successful and failed items of | ||||||
telemetry processed, by signal type, by an OpenTelemetry Collector | ||||||
receiver component. | ||||||
|
||||||
### Recommended conventional attributes | ||||||
|
||||||
- `otel.success` (boolean): This is true or false depending on whether the | ||||||
component considers the outcome a success or a failure. | ||||||
- `otel.outcome` (string): This describes the outcome in a more specific | ||||||
way than `otel.success`, with recommended values specified below. | ||||||
- `otel.signal` (string): This is the name of the signal (e.g., "logs", | ||||||
"metrics", "traces") | ||||||
- `otel.name` (string): Name of the component in a pipeline. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
- `otel.pipeline` (string): Name of the pipeline in a collector. | ||||||
|
||||||
### Specified `otel.outcome` attribute values | ||||||
|
||||||
The `otel.outcome` attribute indicates extra information about a | ||||||
success or failure. A set of standard conventional attribute values | ||||||
is supplied and is considered a closed set. If these outcomes do not | ||||||
accurately explain the reason for a success or failure outcome, they | ||||||
SHOULD be extended by OpenTelemetry. | ||||||
|
||||||
For success=true: | ||||||
|
||||||
- `accepted`: Indicates a normal, synchronous request success case. | ||||||
The item was consumed by the next stage of the pipeline, which | ||||||
returned success. Note the item could have been suppressed by a | ||||||
subsequent component, but as far as this component knows, the | ||||||
request successful. | ||||||
- `suppressed:<any other outcome>`: When the true | ||||||
outcome is not known at the time of counting, and the compnent | ||||||
intentionally returns success to its producer. Examples are given | ||||||
below. | ||||||
|
||||||
For both success=true and success=false, there is a special outcome | ||||||
indicating items did not reach the next stage in the pipeline, | ||||||
considered "dropped". When comparing pipeline metrics from one stage | ||||||
to the next, those which are dropped by a component are expected not | ||||||
to appear in totals of the subequent pipeline. | ||||||
|
||||||
- `dropped`: Processors may use this to indicate both success and | ||||||
failure, for example include sampling processors and filtering | ||||||
processors, which successfully avoid sending data based on | ||||||
configuration. For all components, dropped with success=false | ||||||
indicates that the component introduced an original failure and did | ||||||
not send to the next stage in the pipeline. | ||||||
|
||||||
For success=false, transient and potentially retryable: | ||||||
|
||||||
- `deadline_exceeded`: The item was in the process of being sent but the request | ||||||
timed out, or its deadline was exceeded. | ||||||
- `resource_exhausted`: The item was handled by the next stage of the | ||||||
pipeline, which returned an error code indicating that it was | ||||||
overloaded. If the resource being exhausted is local and the item | ||||||
was not handled by the next stage of the pipeline, use `dropped`. | ||||||
- `retryable`: The item was handled by the next stage of the pipeline, | ||||||
which returned a retryable error status not covered by any of the | ||||||
above values. | ||||||
|
||||||
For success=false, permanent category: | ||||||
|
||||||
- `rejected`: The item was handled by the next stage of the pipeline, | ||||||
which returned a permanent error status or partial success status | ||||||
indicating that some items could not be accepted. | ||||||
|
||||||
|
||||||
#### Success, Outcome matrix | ||||||
|
||||||
| Success | Outcome | Meaning | | ||||||
|---------|------------------------------|-------------------------------------------------------------------| | ||||||
| true | accepted | Synchronous send succeeded | | ||||||
| true | dropped | Dropped by intention | | ||||||
| false | dropped | Producer saw the component return failure, request was not sent | | ||||||
| false | deadline_exceeded | Producer saw the component return failure, request timed out | | ||||||
| false | resource_exhausted | Producer saw the component return failure, insufficient resources | | ||||||
| false | retryable | Producer saw the component return other non-permanent condition | | ||||||
| false | rejected | Producer saw the component return a permanent condition | | ||||||
| true | supressed:accepted | Producer saw success; eventually accepted | | ||||||
| true | supressed:dropped | Producer saw success; request was not sent | | ||||||
| true | supressed:deadline_exceeded | Producer saw success; request sent, timed out | | ||||||
| true | supressed:resource_exhausted | Producer saw success; request sent, insufficient resources | | ||||||
| true | supressed:retryable | Producer saw success; request sent, other non-permanent condition | | ||||||
| true | supressed:rejected | Producer saw success; request sent, permanent condition | | ||||||
| true | supressed:unknown | Producer saw success; no effort to report true outcome | | ||||||
|
||||||
#### Examples of each outcome | ||||||
|
||||||
##### Success, Accepted | ||||||
|
||||||
This is the common success case. The item(s) were sent to the next | ||||||
stage in the pipeline while blocking the producer. | ||||||
|
||||||
##### Success, Dropped | ||||||
|
||||||
A processor was configured with instructions not to pass certain data. | ||||||
|
||||||
##### Success, Suppressed-Accepted | ||||||
|
||||||
A component returned success to its producer, and later the outcome | ||||||
was successful. | ||||||
|
||||||
##### Failure, Dropped and Success, Suppressed-Dropped | ||||||
|
||||||
(If suppressed: A component returned success to its producer, then ...) | ||||||
|
||||||
The component never sent the item(s) due to limits in effect. For | ||||||
example, shutdown was ordered and the queue could not be drained in | ||||||
time due to a limit on parallelism. | ||||||
|
||||||
##### Failure, Deadline exceeded and Success, Suppressed-Deadline exceeded | ||||||
|
||||||
(If suppressed: A component returned success to its producer, then ...) | ||||||
|
||||||
The component attempted sending the item(s), but the item(s) did not | ||||||
succeed before the deadline expired. If there were attempts to retry, | ||||||
this is outcome of the final attempt. | ||||||
|
||||||
##### Failure, Resource exhausted and Success, Suppressed-Resource exhausted | ||||||
|
||||||
(If suppressed: A component returned success to its producer, then ...) | ||||||
|
||||||
The component attempted sending the item(s), but the consumer | ||||||
indicated its (or its consumers') resources were exceeded. If there | ||||||
were attempts to retry, this is outcome of the final attempt. | ||||||
|
||||||
##### Failure, Retryable and Success, Suppressed-Retryable | ||||||
|
||||||
(If suppressed: A component returned success to its producer, then ...) | ||||||
|
||||||
A component returned success to its producer, and then it attempted | ||||||
sending the item(s), but the consumer indicated some kind of transient | ||||||
condition other than deadline- or resource-related (e.g., connection | ||||||
not accepted). If there were attempts to retry, this is outcome of | ||||||
the final attempt. | ||||||
|
||||||
##### Failure, Rejected and Success, Suppressed-Rejected | ||||||
|
||||||
(If suppressed: A component returned success to its producer, then ...) | ||||||
|
||||||
A compmnent returned success to its producer, and then it attempted | ||||||
sending the item(s), but the consumer returned a permanent error. |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I have a connector configured, I assume I get both the
otelcol.exporter.items
andotelcol.receiver.items
metrics emitted for the connector, right?Let's say I configured the Count connector on a traces pipeline, as described in the example in the component's README. The count connector then accepts traces on the
traces/in
pipeline and creates metrics on themetrics/out
pipeline.I imagine the
otelcol.exporter.items
metric for the count connector would count the incoming spans on the trace pipeline. What would be theotel.outcome
for those correctly consumed spans? Would it beconsumed
or ratherunsampled
? These logs aren't shipped anywhere by the component, they are "swallowed" by the connector if I understand correctly.I imagine the
otelcol.receiver.items
metric for the count connector would count the metrics created on the metrics pipeline, with theotel.outcome
set toconsumed
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the consume operation synchronous? I think the traces/in pipeline will wait until the metrics/out pipeline finishes the consume operation, so the outcome for traces/in will depend on the outcome for metrics/out. If metrics/out fails w/ a retryable status, maybe the producer will retry.
Since the count connector can produce more or fewer metric data points than arriving spans, I do not expect the item counts to match between the exporter and receiver, but I think the outcomes could match for synchronous operations. If the operation is asynchronous, the rules discussed in this proposal would apply -- the traces/in might see
consumed
while the metrics/out sees some sort of failure.I don't see any problems, per se, just that the monitoring equations for connectors don't apply. I can't assume that the items_in == items_dropped + items_out.