Proposed updates to #184 #1

jmacd · 2023-10-24T00:25:29Z

After reviewing the collector's equivalent metrics and auditing the code framework there, I came up with these recommendations in the form of a (large) change to yours.

This is also very speculative--I edited the generated YAML file to demonstrate the outcome I would like, and it will take a small improvement to the build tools to achieve the intended results.

Following from the collector's example, I am proposing three levels of detail, which would be called "basic", "normal", and "detailed". In the current tooling, we have "required" and "recommended" which I take as equivalent to "basic" and "normal". To this I would add "detailed", so the tools need a small change. Please review the generated file as if I had generated it, and the work needs to be put back into otel.yaml iiuc.

yurishkuro · 2023-10-24T01:26:16Z

docs/otel/export-metrics.md

+leakage.  Multi-level colletor topologies should allow configuration
+of distinct domains (e.g., `agent` and `gateway`).
+
+### Basic level of detail


What is the value of having this level? It saves a single binary attribute, but there are plenty of other attributes that are required (domain, name, signal, etc), so the complexity of the spec doesn't seem to be warranted.

Saving a boolean attribute means having half as many (i.e., one less) timeseries. The information available in the attribute is almost redundant, so I think having a way to avoid the additional 1 timeseries matters.

When you have metrics on a pipeline, the information available by having a success attribute (i.e., one additional series) can be inferred by comparing the subsequent component's totals. This is admittedly a recursive definition -- for the subsequent component to establish it's success/failure rate it will need its own subsequent component's totals, and the final stage in a pipeline will likely not want to use basic-level metrics for this reason. If Total(x) is the sum of the single metric for a component X, the recursive rule for deriving Success/Failure of that component is:

Dropped(this) = Total(next) - Total(this) Failed(this) = Dropped(this) + Failed(next) Success(this) = Total(this) - Failed(this)

yurishkuro · 2023-10-24T01:27:20Z

docs/otel/export-metrics.md

+the exporter's `success=false` to determine the number of items
+dropped by the processor, for example.
+
+### Detailed metrics


kinda similar comment to basic - why have a separate level? What's the objective worth complicating the spec this way?

This is about letting users trade costs based on what they need/want to observe: more metrics may be useful, but they're just additional expense when they're not being used.

I mentioned a personal side-story that led me to this realization in today's Spec SIG: to monitor a water system is similar to monitoring a telemetry pipeline, and it's also a situation where each individual metric is a substantial expense. The minimum number of meters necessary to calculate total leakage in the system is 1 meter for (total) system production and 1 meter per user with a service connection. From total in and total out we can compute leakage, which is equivalent with the calculation for dropped items .

This leads to a conclusion that the minimum-cost configuration for a telemetry pipeline, capable of computing a global Dropped statistic, would use Basic-level detail in each SDK, (disabled metrics for all intermediate collectors), and Normal-level detail for the final component of the final collector in the pipeline. If the user is in a situation where the metrics from the SDKs are not comparable with the metrics from subsequent stages in the pipeline for ay reason, they should use Normal-level detail in the SDK.

I'm also aware of tracing pipelines where there are rate limits enforced at the destination. This is a scenario where if the response code is resource_exhausted I should turn up sampling, if it's timeout I should complain to my backend team about an SLO violation, and if it's queue_full it means I should reconfigure the SDK.

But why shouldn't this be handled with just the pre-aggregation rules ("views"?) instead of making it a problem for the exporters / components to know about different levels?

(This content will appear in a new location, I'm writing an OTEP.)
My assumption is that this would be implemented using views, and the text of a semantic convention would be explaining which views to configure at which level of detail.

Still rough: open-telemetry/oteps#238

jmacd · 2023-10-31T22:36:32Z

open-telemetry/oteps#238

jmacd added 3 commits October 23, 2023 17:13

Update

4940ad3

remove scope

d5142b1

format

c0c3c55

jmacd mentioned this pull request Oct 24, 2023

Add processed/exported Span metrics. open-telemetry/semantic-conventions#184

Closed

yurishkuro reviewed Oct 24, 2023

View reviewed changes

jmacd closed this Oct 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposed updates to #184 #1

Proposed updates to #184 #1

jmacd commented Oct 24, 2023

yurishkuro Oct 24, 2023

jmacd Oct 24, 2023

yurishkuro Oct 24, 2023

jmacd Oct 24, 2023 •

edited

Loading

yurishkuro Oct 24, 2023

jmacd Oct 31, 2023

jmacd Oct 31, 2023

jmacd commented Oct 31, 2023

Proposed updates to #184 #1

Proposed updates to #184 #1

Conversation

jmacd commented Oct 24, 2023

yurishkuro Oct 24, 2023

Choose a reason for hiding this comment

jmacd Oct 24, 2023

Choose a reason for hiding this comment

yurishkuro Oct 24, 2023

Choose a reason for hiding this comment

jmacd Oct 24, 2023 • edited Loading

Choose a reason for hiding this comment

yurishkuro Oct 24, 2023

Choose a reason for hiding this comment

jmacd Oct 31, 2023

Choose a reason for hiding this comment

jmacd Oct 31, 2023

Choose a reason for hiding this comment

jmacd commented Oct 31, 2023

jmacd Oct 24, 2023 •

edited

Loading