-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposed updates to #184 #1
Proposed updates to #184 #1
Conversation
leakage. Multi-level colletor topologies should allow configuration | ||
of distinct domains (e.g., `agent` and `gateway`). | ||
|
||
### Basic level of detail |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the value of having this level? It saves a single binary attribute, but there are plenty of other attributes that are required (domain, name, signal, etc), so the complexity of the spec doesn't seem to be warranted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Saving a boolean attribute means having half as many (i.e., one less) timeseries. The information available in the attribute is almost redundant, so I think having a way to avoid the additional 1 timeseries matters.
When you have metrics on a pipeline, the information available by having a success
attribute (i.e., one additional series) can be inferred by comparing the subsequent component's totals. This is admittedly a recursive definition -- for the subsequent component to establish it's success/failure rate it will need its own subsequent component's totals, and the final stage in a pipeline will likely not want to use basic-level metrics for this reason. If Total(x)
is the sum of the single metric for a component X, the recursive rule for deriving Success/Failure of that component is:
Dropped(this) = Total(next) - Total(this)
Failed(this) = Dropped(this) + Failed(next)
Success(this) = Total(this) - Failed(this)
the exporter's `success=false` to determine the number of items | ||
dropped by the processor, for example. | ||
|
||
### Detailed metrics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kinda similar comment to basic - why have a separate level? What's the objective worth complicating the spec this way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is about letting users trade costs based on what they need/want to observe: more metrics may be useful, but they're just additional expense when they're not being used.
I mentioned a personal side-story that led me to this realization in today's Spec SIG: to monitor a water system is similar to monitoring a telemetry pipeline, and it's also a situation where each individual metric is a substantial expense. The minimum number of meters necessary to calculate total leakage in the system is 1 meter for (total) system production and 1 meter per user with a service connection. From total in and total out we can compute leakage, which is equivalent with the calculation for dropped items .
This leads to a conclusion that the minimum-cost configuration for a telemetry pipeline, capable of computing a global Dropped statistic, would use Basic-level detail in each SDK, (disabled metrics for all intermediate collectors), and Normal-level detail for the final component of the final collector in the pipeline. If the user is in a situation where the metrics from the SDKs are not comparable with the metrics from subsequent stages in the pipeline for ay reason, they should use Normal-level detail in the SDK.
I'm also aware of tracing pipelines where there are rate limits enforced at the destination. This is a scenario where if the response code is resource_exhausted
I should turn up sampling, if it's timeout
I should complain to my backend team about an SLO violation, and if it's queue_full
it means I should reconfigure the SDK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But why shouldn't this be handled with just the pre-aggregation rules ("views"?) instead of making it a problem for the exporters / components to know about different levels?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(This content will appear in a new location, I'm writing an OTEP.)
My assumption is that this would be implemented using views, and the text of a semantic convention would be explaining which views to configure at which level of detail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still rough: open-telemetry/oteps#238
@carlosalberto
After reviewing the collector's equivalent metrics and auditing the code framework there, I came up with these recommendations in the form of a (large) change to yours.
This is also very speculative--I edited the generated YAML file to demonstrate the outcome I would like, and it will take a small improvement to the build tools to achieve the intended results.
Following from the collector's example, I am proposing three levels of detail, which would be called "basic", "normal", and "detailed". In the current tooling, we have "required" and "recommended" which I take as equivalent to "basic" and "normal". To this I would add "detailed", so the tools need a small change. Please review the generated file as if I had generated it, and the work needs to be put back into
otel.yaml
iiuc.